27th Aug 2020 10 minutes read

How Does SQL GROUP BY Work?

GROUP BY

Table of Contents

GROUP BY, Part 1: Grouping Data
GROUP BY, Part 2: Aggregation Functions
Grouping Records by Multiple Columns
Grouping NULL Values
Using WHERE with GROUP BY
Avoiding Problems with GROUP BY
- Counting Problems
Omitting Non-Aggregated Columns from GROUP BY
There’s More to Do with GROUP BY

Grouping results is a powerful SQL feature that allows you to compute key statistics for a group of records.

GROUP BY is one of SQL’s most powerful clauses. It allows you to see data in a new way and find key metrics (like the average, maximal, and minimal values in a group of records).

Without GROUP BY, all the results we obtain are oriented to records. With GROUP BY, we can create groups of records and calculate metrics on each group. In this article, you’ll learn how GROUP BY makes your SQL queries much more powerful and diverse.

GROUP BY, Part 1: Grouping Data

Let’s suppose we have a small hotel in Patagonia. We also have a database containing guests’ names, cities of origin, ages, check-in dates, check-out dates, and more. This data is in two tables called room_guest and guest. Have a look:

room_guest

guest_name	origin_city	room_number	day_in	day_out	age	room_level	amount_invoiced
Juan B.	San Pedro	1001	2012-12-28	2013-01-07	32	standard	$9500
Mary J.	San Francisco	1002	2013-01-02	2013-01-12	23	standard	$6700
Peter S.	Dubai	2002	2013-01-02	2013-01-29	65	premium	$34000
Clair B	Genova	2001	2014-07-02	2014-08-02	21	standard	$16000
Meiling Y.	San Francisco	2002	2014-11-02	2014-11-12	52	standard	$9500
Olek V.	Dubai	2003	2015-01-02	2015-01-31	37	premium	$28400
Benjamin L.	San Pedro	2002	2016-01-02	2016-01-15	61	premium	$15400
Arnaldo V.	Genova	1001	2017-01-01	2017-01-04	43	standard	$2500
Mary J.	San Francisco	1002	2017-01-02	2017-01-07	23	standard	$4800
Wei W.	Los Angeles	2002	2018-01-02	2018-01-22	31	standard	$12000
Meiling Y.	San Francisco	2001	2018-01-02	2018-01-22	52	premium	$17500
Peter S.	Dubai	2002	2019-01-02	2019-02-25	65	premium	$32000
Arnaldo V.	Genova	2003	2019-08-05	2019-08-17	43	standard	$11200
Mary J.	San Francisco	1001	2019-01-02	2019-01-12	23	standard	$8900

guest

guest_name	preferred_activity	city_name	state	country	continent
activity	City_name	State	Country	Continent	32
Juan B.	trekking	San Pedro	Andalucia	Spain	Europe
Mary J.	trekking	San Francisco	California	United States	America
Peter S.	trekking	Dubai	Dubai	Arabia	Asia
Chiara B	skiing	Genova	Liguria	Italy	Europe
Meiling Y.	trekking	San Francisco	California	United States	America
Olek V.	relaxing	Dubai	Dubai	Arabia	Asia
Benjamin L.	skiing	San Pedro	Buenos Aires	Argentina	America
Wei W.	trekking	Los Angeles	California	United States	America
Arnaldo V.	skiing	Genova	Liguria	Italy	Europe

We want to calculate some statistics so we can book more guests. The SQL GROUP BY clause lets us group records based on data in a given column (or columns). We can group records in the table room_guest based on the value of the column origin_city. Then all the records of guests from ‘Genova’ will belong to one group; all the records of guests from ‘Dubai’ will belong to another group, and so on. The following table shows each group of records in a different color.

guest_name	origin_city	room_number	day_in	day_out	age	room_level	amount_invoiced
Peter S.	Dubai	2002	2013-01-02	2013-01-29	65	premium	$34000
Olek V.	Dubai	2003	2015-01-02	2015-01-31	37	premium	$28400
Peter S.	Dubai	2002	2019-01-02	2019-02-25	65	premium	$32000
Clair B	Genova	2001	2014-07-02	2014-08-02	21	standard	$16000
Arnaldo V.	Genova	1001	2017-01-01	2017-01-04	43	standard	$2500
Arnaldo V.	Genova	2003	2019-08-05	2019-08-17	43	standard	$11200
Wei W.	Los Angeles	2002	2018-01-02	2018-01-22	31	standard	$12000
Mary J.	San Francisco	1002	2013-01-02	2013-01-12	23	standard	$6700
Mary J.	San Francisco	1002	2017-01-02	2017-01-07	23	standard	$4800
Meiling Y.	San Francisco	2002	2014-11-02	2014-11-12	52	standard	$9500
Meiling Y.	San Francisco	2001	2018-01-02	2018-01-22	56	premium	$17500
Mary J.	San Francisco	1001	2019-01-02	2019-01-12	23	standard	$8900
Benjamin L.	San Pedro	2002	2016-01-02	2016-01-15	61	premium	$15400
Juan B.	San Pedro	1001	2012-12-28	2013-01-07	32	standard	$9500

Now, suppose the hotel’s owner wants to know how many guests come from each city. To find out, we need to count the number of records in each group. In other words, we need the aggregate function COUNT(*), which returns the number of records in a group. COUNT() is a very common function; we’ll return to it later in this article.

So, we need a query to create groups of records with the same value in origin_city and then count the number of records in each group. The query would look like this:

SELECT 	origin_city,
COUNT(*) AS quantity_of_guests
FROM   	room_guest 
GROUP BY	origin_city

You can compare the number of guests from each city in the result table below against the colored table shown previously:

origin_city	quantity_of_guests
Dubai	3
Genova	3
Los Angeles	1
San Francisco	5
San Pedro	2

Notice that the number of rows in the query results are the same as the quantity of groups created by the GROUP BY clause. One group for each city, one row for each city.

GROUP BY, Part 2: Aggregation Functions

While grouping by a value is handy, the real power of GROUP BY is when it’s used with aggregate functions. I’d go so far as to say that every SQL query using a GROUP BY clause should have at least one aggregate function. (But it’s not mandatory.)

In the previous section, we mentioned that GROUP BY is used to create groups and calculate metrics. Metrics are calculated by aggregation functions like COUNT(), SUM(), AVG(), MIN(), and MAX(). The values calculated by each of these functions are self-explanatory. However, all of them have something in common: all aggregate functions return a value based on all the records in the group.

Let’s consider an example. The hotel owner wants to know the maximum value invoiced for each room. Along with this, he wants to see the minimum and average invoiced value for each room. Here’s the query, followed by the results:

SELECT 	room_number,
MAX(amount_invoiced) AS max_amount_invoiced,
MIN(amount_invoiced) AS min_amount_invoiced,
AVG(amount_invoiced) AS average_amount_invoiced
FROM   	room_guest 
GROUP BY	room_number

room_number	max_amount_invoiced	min_amount_invoiced	average_amount_invoiced
1001	9500.00	2500.00	6966.66
1002	6700.00	4800.00	5750.00
2001	17500.00	16000.00	16750.00
2002	34000.00	9500.00	20580.00
2003	28400.00	11200.00	19800.00

Grouping Records by Multiple Columns

In some cases, we may need to group by two or more columns. Can we do that with GROUP BY? We sure can!

In the previous query, we created a report analyzing how much money each room is generating. However, some rooms can be configured at a premium or standard level (see room number 2002) during different seasons; thus, to do a correct analysis, we need to group records using two columns: room_number and room_level.

Before going to the query, let's use colors to see how the records are grouped by the GROUP BY room_number, room_level clause. Remember that the records in each group must have exactly the same values in both room_number and room_level. For example, the first group is for room_number = 1001 and room_level = ‘standard’.

guest_name	origin_city	room_number	day_in	day_out	age	room_level	amount_invoiced
Mary J.	San Francisco	1001	2019-01-02	2019-01-12	23	standard	$8900
Arnaldo V.	Genova	1001	2017-01-01	2017-01-04	43	standard	$2500
Juan B.	San Pedro	1001	2012-12-28	2013-01-07	32	standard	$9500
Mary J.	San Francisco	1002	2013-01-02	2013-01-12	23	standard	$6700
Mary J.	San Francisco	1002	2017-01-02	2017-01-07	23	standard	$4800
Meiling Y.	San Francisco	2001	2018-01-02	2018-01-22	52	premium	$17500
Clair B	Genova	2001	2014-07-02	2014-08-02	21	standard	$16000
Benjamin L.	San Pedro	2002	2016-01-02	2016-01-15	61	premium	$15400
Peter S.	Dubai	2002	2013-01-02	2013-01-29	65	premium	$34000
Peter S.	Dubai	2002	2019-01-02	2019-02-25	65	premium	$32000
Meiling Y.	San Francisco	2002	2014-11-02	2014-11-12	52	standard	$9500
Wei W.	Los Angeles	2002	2018-01-02	2018-01-22	31	standard	$12000
Olek V.	Dubai	2003	2015-01-02	2015-01-31	37	premium	$28400
Arnaldo V.	Genova	2003	2019-08-05	2019-08-17	43	standard	$11200

The query is:

SELECT 	room_number,
		room_level,
MAX(amount_invoiced) AS max_amount_invoiced,
MIN(amount_invoiced) AS min_amount_invoiced,
AVG(amount_invoiced) AS average_amount_invoiced
FROM   	room_guest 
GROUP BY	room_number, room_level

The following table shows the results of this query. You can compare this table with the previous table to verify the results.

room_number	room_level	max_amount_invoiced	min_amount invoiced	average_amoun_invoiced
1001	standard	9500.00	2500.00	6966.66
1002	standard	6700.00	4800.00	5750.00
2001	premium	17500.00	17500.00	17500.00
2001	standard	16000.00	16000.00	16000.00
2002	premium	34000.00	15400.00	27133.33
2002	standard	12000.00	9500.00	10750.00
2003	premium	28400.00	28400.00	28400.00
2003	standard	11200.00	11200.00	11200.00

Grouping NULL Values

Like any other value, NULL values have their own group; if we have a NULL in any of the columns in GROUP BY, an extra group of records is created for those records. To demonstrate this, we need to insert a pair of records with NULL values in the column origin_city:

INSERT INTO into room_guest VALUES ('Kevin C.', NULL, 2001, '2019-07-25', '2019-08-07', NULL, 'standard', 10500);
INSERT INTO into room_guest VALUES  ('Karl J.', NULL, 1002, '2019-11-12', '2019-11-22', NULL, 'premium', 13900);

Then, this query ...

SELECT 	origin_city,
COUNT(*) AS quantity_of_guests
FROM   	room_guest 
GROUP BY	origin_city

… will show the following result. Notice the new group for NULL origin_city values in the first row:

origin_city	quantity_of_guests
NULL	2
Dubai	3
Genova	3
Los Angeles	1
San Francisco	5
San Pedro	2

Using WHERE with GROUP BY

The WHERE clause is frequently used in SQL queries, so it’s important to understand how it works when combined with GROUP BY.

The WHERE clause is applied before the GROUP BY. This means that all the records are filtered first by WHERE; then the records that match the WHERE condition are grouped using the GROUP BY criteria.

As an example, let's use the previous query, but this time we’ll filter for guests coming from the cities of San Francisco and Los Angeles. The query is:

SELECT 	room_number,
		room_level,
MAX(amount_invoiced) AS max_amount_invoiced,
MIN(amount_invoiced) AS min_amount_invoiced,
AVG(amount_invoiced) AS average_amount_invoiced
FROM   	room_guest
WHERE		origin_city IN (‘San Francisco’,’Los Angeles’ )
GROUP BY	room_number, room_level

As expected, this result set is shorter than the previous ones; the WHERE clause filtered out many guests, and only the records for rooms in San Francisco and Los Angeles were processed by the GROUP BY clause.

room_number	room_level	max_amount_invoiced	min_amount_ invoiced	average_amount_invoiced
1001	standard	8900.00	8900.00	8900.00
1002	standard	6700.00	4800.00	5750.00
2001	premium	17500.00	17500.00	17500.00
2002	standard	12000.00	9500.00	10750.00

Avoiding Problems with GROUP BY

When you’re getting started with GROUP BY, it’s common to run into the following problems. Here’s how to avoid them.

Counting Problems

Let’s look at a similar case where we need to add more than one extra column into the GROUP BY clause. In the first query, we grouped by origin_city. However, some cities share the same name (because they are in different states or countries). In our data set, we have two different cities named San Pedro, one in Argentina and the other in Spain. We don’t want to count them together, as they are two different places.

To count these cities separately, we need to group records using the columns city_origin, state, and country. Then we will repeat the first query but add the columns state and country to the GROUP BY clause. However, if we add columns to the GROUP BY, we should also add them to the SELECT.

Because the columns state and country are in the guest table, we have to JOIN the tables room_guest and guest. Here’s the query we have:

SELECT 	origin_city, state, country
COUNT(*) AS number_of_guests
FROM   	room_guest 
JOIN		guest ON guest.guest_name = room_guest.guest_name
GROUP BY	origin_city, state, country

The results show two different “San Pedro” cities because we’ve used state and country as additional columns in the GROUP BY clause.

origin_city	state	country	number_of_guests
Dubai	Dubai	UAE	3
Genova	Liguria	Italy	3
Los Angeles	California	United States	1
San Francisco	California	United States	5
San Pedro	Buenos Aires	Argentina	1
San Pedro	Andalucia	Spain	1

There is still an issue to fix in this query: if the same person visited the hotel two times, we are counting this person twice. This is not necessarily wrong, but what if we want to know the number of unique visitors to the hotel? We’d need to use COUNT(distinct guest_name). The grouping function COUNT(distinct column) returns the quantity of unique values for a given column in a group of records.

In the query below, we add the COUNT(distinct) function. We also maintain the original COUNT(*) so that the reader can compare both results:

SELECT 	origin_city, state, country
COUNT(distinct guest_name) AS number_of_unique_guests,
COUNT(*) AS number_of_guests
FROM   	room_guest 
JOIN		guest ON guest.guest_name = room_guest.guest_name
GROUP BY	origin_city, state, country

Now we can see the hotel received a total of three visits from a Dubai resident, but that these three visits were made by two distinct people (Peter S. and Olek V) .

origin_city	state	country	number_of_unique_guests	number_of_guests
Dubai	Dubai	UAE	2	3
Genova	Liguria	Italy	2	3
Los Angeles	California	United States	1	1
San Francisco	California	United States	2	5
San Pedro	Buenos Aires	Argentina	1	1
San Pedro	Andalucia	Spain	1	1

Before closing this section, I suggest you watch this 5-minute video on GROUP BY for beginners. It’s a super dynamic way to learn SQL.

Omitting Non-Aggregated Columns from GROUP BY

Another very common GROUP BY error is to add a non-aggregated column (i.e. a column that isn’t used in an aggregate function) in the SELECT that you don’t have in GROUP BY. To avoid this error, follow a very simple rule: All columns in SELECT should appear in the GROUP BY clause or be used in an aggregate function.

Let’s try an invalid query to see the error:

SELECT    room_number,
	    room_level,
	    origin_city, --This column is invalid, is not in the GROUP BY
    COUNT(*) AS quantity_of_visitors,
FROM      room_guest 
GROUP BY  room_number, room_level

If we execute this query, we’ll get the following error:

ERROR:  The column «room_guest.origin_city» must be in the GROUP BY clause
LINE 3:   guest_age,

We can fix the error by adding the column origin_city to the GROUP BY clause:

	SELECT 	room_number,
			room_level,
			Origin_city,
			COUNT(*) AS quantity_of_visitors
	FROM		room_gest
	GROUP BY	room_number, room_level, origin_city -- origin_city added

If you’re trying to figure out the difference between GROUP BY and ORDER BY, read the Difference Between GROUP BY and ORDER BY in Simple Words. It will help you sort it out.

There’s More to Do with GROUP BY

So, we’ve learned how to use GROUP BY to group records by common values. We know the aggregate functions MIN(), MAX(), AVG(), and SUM() compute various statistics. And the COUNT() function does a lot of things:

COUNT(*) counts all rows.
COUNT(guest_name) counts all non-NULL values in the guest_name column.
COUNT(distinct guest_name) counts all different non-NULL values in the guest_name column.

When grouping, NULL gets its own group. And all non-aggregated columns in SELECT must be present in GROUP BY.

Due to the length of the article, I didn’t cover the HAVING clause, which is a kind of WHERE clause used to filter groups instead of records. For those readers who want to go a step further, I’ll leave you a link to our SQL Basics course, which covers many interesting topics. It’s a great way to build your SQL skills!

Tags:

GROUP BY