27th Aug 2020 10 minutes read How Does SQL GROUP BY Work? Ignacio L. Bisso GROUP BY Table of Contents GROUP BY, Part 1: Grouping Data GROUP BY, Part 2: Aggregation Functions Grouping Records by Multiple Columns Grouping NULL Values Using WHERE with GROUP BY Avoiding Problems with GROUP BY Counting Problems Omitting Non-Aggregated Columns from GROUP BY There’s More to Do with GROUP BY Grouping results is a powerful SQL feature that allows you to compute key statistics for a group of records. GROUP BY is one of SQL’s most powerful clauses. It allows you to see data in a new way and find key metrics (like the average, maximal, and minimal values in a group of records). Without GROUP BY, all the results we obtain are oriented to records. With GROUP BY, we can create groups of records and calculate metrics on each group. In this article, you’ll learn how GROUP BY makes your SQL queries much more powerful and diverse. GROUP BY, Part 1: Grouping Data Let’s suppose we have a small hotel in Patagonia. We also have a database containing guests’ names, cities of origin, ages, check-in dates, check-out dates, and more. This data is in two tables called room_guest and guest. Have a look: room_guest guest_nameorigin_cityroom_numberday_inday_outageroom_levelamount_invoiced Juan B.San Pedro10012012-12-282013-01-0732standard$9500 Mary J.San Francisco10022013-01-022013-01-1223standard$6700 Peter S.Dubai20022013-01-022013-01-2965premium$34000 Clair BGenova20012014-07-022014-08-0221standard$16000 Meiling Y.San Francisco20022014-11-022014-11-1252standard$9500 Olek V.Dubai20032015-01-022015-01-3137premium$28400 Benjamin L.San Pedro20022016-01-022016-01-1561premium$15400 Arnaldo V.Genova10012017-01-012017-01-0443standard$2500 Mary J.San Francisco10022017-01-022017-01-0723standard$4800 Wei W.Los Angeles20022018-01-022018-01-2231standard$12000 Meiling Y.San Francisco20012018-01-022018-01-2252premium$17500 Peter S.Dubai20022019-01-022019-02-2565premium$32000 Arnaldo V.Genova20032019-08-052019-08-1743standard$11200 Mary J.San Francisco10012019-01-022019-01-1223standard$8900 guest guest_namepreferred_activitycity_namestatecountrycontinent activityCity_nameStateCountryContinent32 Juan B.trekkingSan PedroAndaluciaSpainEurope Mary J.trekkingSan FranciscoCaliforniaUnited StatesAmerica Peter S.trekkingDubaiDubaiArabiaAsia Chiara BskiingGenovaLiguriaItalyEurope Meiling Y.trekkingSan FranciscoCaliforniaUnited StatesAmerica Olek V.relaxingDubaiDubaiArabiaAsia Benjamin L.skiingSan PedroBuenos AiresArgentinaAmerica Wei W.trekkingLos AngelesCaliforniaUnited StatesAmerica Arnaldo V.skiingGenovaLiguriaItalyEurope We want to calculate some statistics so we can book more guests. The SQL GROUP BY clause lets us group records based on data in a given column (or columns). We can group records in the table room_guest based on the value of the column origin_city. Then all the records of guests from ‘Genova’ will belong to one group; all the records of guests from ‘Dubai’ will belong to another group, and so on. The following table shows each group of records in a different color. guest_nameorigin_cityroom_numberday_inday_outageroom_levelamount_invoiced Peter S.Dubai20022013-01-022013-01-2965premium$34000 Olek V.Dubai20032015-01-022015-01-3137premium$28400 Peter S.Dubai20022019-01-022019-02-2565premium$32000 Clair BGenova20012014-07-022014-08-0221standard$16000 Arnaldo V.Genova10012017-01-012017-01-0443standard$2500 Arnaldo V.Genova20032019-08-052019-08-1743standard$11200 Wei W.Los Angeles20022018-01-022018-01-2231standard$12000 Mary J.San Francisco10022013-01-022013-01-1223standard$6700 Mary J.San Francisco10022017-01-022017-01-0723standard$4800 Meiling Y.San Francisco20022014-11-022014-11-1252standard$9500 Meiling Y.San Francisco20012018-01-022018-01-2256premium$17500 Mary J.San Francisco10012019-01-022019-01-1223standard$8900 Benjamin L.San Pedro20022016-01-022016-01-1561premium$15400 Juan B.San Pedro10012012-12-282013-01-0732standard$9500 Now, suppose the hotel’s owner wants to know how many guests come from each city. To find out, we need to count the number of records in each group. In other words, we need the aggregate function COUNT(*), which returns the number of records in a group. COUNT() is a very common function; we’ll return to it later in this article. So, we need a query to create groups of records with the same value in origin_city and then count the number of records in each group. The query would look like this: SELECT origin_city, COUNT(*) AS quantity_of_guests FROM room_guest GROUP BY origin_city You can compare the number of guests from each city in the result table below against the colored table shown previously: origin_cityquantity_of_guests Dubai3 Genova3 Los Angeles1 San Francisco5 San Pedro2 Notice that the number of rows in the query results are the same as the quantity of groups created by the GROUP BY clause. One group for each city, one row for each city. GROUP BY, Part 2: Aggregation Functions While grouping by a value is handy, the real power of GROUP BY is when it’s used with aggregate functions. I’d go so far as to say that every SQL query using a GROUP BY clause should have at least one aggregate function. (But it’s not mandatory.) In the previous section, we mentioned that GROUP BY is used to create groups and calculate metrics. Metrics are calculated by aggregation functions like COUNT(), SUM(), AVG(), MIN(), and MAX(). The values calculated by each of these functions are self-explanatory. However, all of them have something in common: all aggregate functions return a value based on all the records in the group. Let’s consider an example. The hotel owner wants to know the maximum value invoiced for each room. Along with this, he wants to see the minimum and average invoiced value for each room. Here’s the query, followed by the results: SELECT room_number, MAX(amount_invoiced) AS max_amount_invoiced, MIN(amount_invoiced) AS min_amount_invoiced, AVG(amount_invoiced) AS average_amount_invoiced FROM room_guest GROUP BY room_number room_numbermax_amount_invoicedmin_amount_invoicedaverage_amount_invoiced 10019500.002500.006966.66 10026700.004800.005750.00 200117500.0016000.0016750.00 200234000.009500.0020580.00 200328400.0011200.0019800.00 Grouping Records by Multiple Columns In some cases, we may need to group by two or more columns. Can we do that with GROUP BY? We sure can! In the previous query, we created a report analyzing how much money each room is generating. However, some rooms can be configured at a premium or standard level (see room number 2002) during different seasons; thus, to do a correct analysis, we need to group records using two columns: room_number and room_level. Before going to the query, let's use colors to see how the records are grouped by the GROUP BY room_number, room_level clause. Remember that the records in each group must have exactly the same values in both room_number and room_level. For example, the first group is for room_number = 1001 and room_level = ‘standard’. guest_nameorigin_cityroom_numberday_inday_outageroom_levelamount_invoiced Mary J.San Francisco10012019-01-022019-01-1223standard$8900 Arnaldo V.Genova10012017-01-012017-01-0443standard$2500 Juan B.San Pedro10012012-12-282013-01-0732standard$9500 Mary J.San Francisco10022013-01-022013-01-1223standard$6700 Mary J.San Francisco10022017-01-022017-01-0723standard$4800 Meiling Y.San Francisco20012018-01-022018-01-2252premium$17500 Clair BGenova20012014-07-022014-08-0221standard$16000 Benjamin L.San Pedro20022016-01-022016-01-1561premium$15400 Peter S.Dubai20022013-01-022013-01-2965premium$34000 Peter S.Dubai20022019-01-022019-02-2565premium$32000 Meiling Y.San Francisco20022014-11-022014-11-1252standard$9500 Wei W.Los Angeles20022018-01-022018-01-2231standard$12000 Olek V.Dubai20032015-01-022015-01-3137premium$28400 Arnaldo V.Genova20032019-08-052019-08-1743standard$11200 The query is: SELECT room_number, room_level, MAX(amount_invoiced) AS max_amount_invoiced, MIN(amount_invoiced) AS min_amount_invoiced, AVG(amount_invoiced) AS average_amount_invoiced FROM room_guest GROUP BY room_number, room_level The following table shows the results of this query. You can compare this table with the previous table to verify the results. room_numberroom_levelmax_amount_invoicedmin_amount invoicedaverage_amoun_invoiced 1001standard9500.002500.006966.66 1002standard6700.004800.005750.00 2001premium17500.0017500.0017500.00 2001standard16000.0016000.0016000.00 2002premium34000.0015400.0027133.33 2002standard12000.009500.0010750.00 2003premium28400.0028400.0028400.00 2003standard11200.0011200.0011200.00 Grouping NULL Values Like any other value, NULL values have their own group; if we have a NULL in any of the columns in GROUP BY, an extra group of records is created for those records. To demonstrate this, we need to insert a pair of records with NULL values in the column origin_city: INSERT INTO into room_guest VALUES ('Kevin C.', NULL, 2001, '2019-07-25', '2019-08-07', NULL, 'standard', 10500); INSERT INTO into room_guest VALUES ('Karl J.', NULL, 1002, '2019-11-12', '2019-11-22', NULL, 'premium', 13900); Then, this query ... SELECT origin_city, COUNT(*) AS quantity_of_guests FROM room_guest GROUP BY origin_city … will show the following result. Notice the new group for NULL origin_city values in the first row: origin_cityquantity_of_guests NULL2 Dubai3 Genova3 Los Angeles1 San Francisco5 San Pedro2 Using WHERE with GROUP BY The WHERE clause is frequently used in SQL queries, so it’s important to understand how it works when combined with GROUP BY. The WHERE clause is applied before the GROUP BY. This means that all the records are filtered first by WHERE; then the records that match the WHERE condition are grouped using the GROUP BY criteria. As an example, let's use the previous query, but this time we’ll filter for guests coming from the cities of San Francisco and Los Angeles. The query is: SELECT room_number, room_level, MAX(amount_invoiced) AS max_amount_invoiced, MIN(amount_invoiced) AS min_amount_invoiced, AVG(amount_invoiced) AS average_amount_invoiced FROM room_guest WHERE origin_city IN (‘San Francisco’,’Los Angeles’ ) GROUP BY room_number, room_level As expected, this result set is shorter than the previous ones; the WHERE clause filtered out many guests, and only the records for rooms in San Francisco and Los Angeles were processed by the GROUP BY clause. room_numberroom_levelmax_amount_invoicedmin_amount_ invoicedaverage_amount_invoiced 1001standard8900.008900.008900.00 1002standard6700.004800.005750.00 2001premium17500.0017500.0017500.00 2002standard12000.009500.0010750.00 Avoiding Problems with GROUP BY When you’re getting started with GROUP BY, it’s common to run into the following problems. Here’s how to avoid them. Counting Problems Let’s look at a similar case where we need to add more than one extra column into the GROUP BY clause. In the first query, we grouped by origin_city. However, some cities share the same name (because they are in different states or countries). In our data set, we have two different cities named San Pedro, one in Argentina and the other in Spain. We don’t want to count them together, as they are two different places. To count these cities separately, we need to group records using the columns city_origin, state, and country. Then we will repeat the first query but add the columns state and country to the GROUP BY clause. However, if we add columns to the GROUP BY, we should also add them to the SELECT. Because the columns state and country are in the guest table, we have to JOIN the tables room_guest and guest. Here’s the query we have: SELECT origin_city, state, country COUNT(*) AS number_of_guests FROM room_guest JOIN guest ON guest.guest_name = room_guest.guest_name GROUP BY origin_city, state, country The results show two different “San Pedro” cities because we’ve used state and country as additional columns in the GROUP BY clause. origin_citystatecountrynumber_of_guests DubaiDubaiUAE3 GenovaLiguriaItaly3 Los AngelesCaliforniaUnited States1 San FranciscoCaliforniaUnited States5 San PedroBuenos AiresArgentina1 San PedroAndaluciaSpain1 There is still an issue to fix in this query: if the same person visited the hotel two times, we are counting this person twice. This is not necessarily wrong, but what if we want to know the number of unique visitors to the hotel? We’d need to use COUNT(distinct guest_name). The grouping function COUNT(distinct column) returns the quantity of unique values for a given column in a group of records. In the query below, we add the COUNT(distinct) function. We also maintain the original COUNT(*) so that the reader can compare both results: SELECT origin_city, state, country COUNT(distinct guest_name) AS number_of_unique_guests, COUNT(*) AS number_of_guests FROM room_guest JOIN guest ON guest.guest_name = room_guest.guest_name GROUP BY origin_city, state, country Now we can see the hotel received a total of three visits from a Dubai resident, but that these three visits were made by two distinct people (Peter S. and Olek V) . origin_citystatecountrynumber_of_unique_guestsnumber_of_guests DubaiDubaiUAE23 GenovaLiguriaItaly23 Los AngelesCaliforniaUnited States11 San FranciscoCaliforniaUnited States25 San PedroBuenos AiresArgentina11 San PedroAndaluciaSpain11 Before closing this section, I suggest you watch this 5-minute video on GROUP BY for beginners. It’s a super dynamic way to learn SQL. Omitting Non-Aggregated Columns from GROUP BY Another very common GROUP BY error is to add a non-aggregated column (i.e. a column that isn’t used in an aggregate function) in the SELECT that you don’t have in GROUP BY. To avoid this error, follow a very simple rule: All columns in SELECT should appear in the GROUP BY clause or be used in an aggregate function. Let’s try an invalid query to see the error: SELECT room_number, room_level, origin_city, --This column is invalid, is not in the GROUP BY COUNT(*) AS quantity_of_visitors, FROM room_guest GROUP BY room_number, room_level If we execute this query, we’ll get the following error: ERROR: The column «room_guest.origin_city» must be in the GROUP BY clause LINE 3: guest_age, We can fix the error by adding the column origin_city to the GROUP BY clause: SELECT room_number, room_level, Origin_city, COUNT(*) AS quantity_of_visitors FROM room_gest GROUP BY room_number, room_level, origin_city -- origin_city added If you’re trying to figure out the difference between GROUP BY and ORDER BY, read the Difference Between GROUP BY and ORDER BY in Simple Words. It will help you sort it out. There’s More to Do with GROUP BY So, we’ve learned how to use GROUP BY to group records by common values. We know the aggregate functions MIN(), MAX(), AVG(), and SUM() compute various statistics. And the COUNT() function does a lot of things: COUNT(*) counts all rows. COUNT(guest_name) counts all non-NULL values in the guest_name column. COUNT(distinct guest_name) counts all different non-NULL values in the guest_name column. When grouping, NULL gets its own group. And all non-aggregated columns in SELECT must be present in GROUP BY. Due to the length of the article, I didn’t cover the HAVING clause, which is a kind of WHERE clause used to filter groups instead of records. For those readers who want to go a step further, I’ll leave you a link to our SQL Basics course, which covers many interesting topics. It’s a great way to build your SQL skills! Tags: GROUP BY