Back to articles list Articles Cookbook
Updated: 29th Oct 2024 17 minutes read

GROUP BY and Aggregate Functions: A Complete Overview

SQL’s GROUP BY and aggregate functions are essential in data aggregation – and for analyzing data and creating reports. Let’s explore them together!  

In SQL, GROUP BY and aggregate functions are one of the language’s most popular features. Data aggregation is critical for data analysis and reporting; to make sense of all the data in a dataset, it often needs to be aggregated.

But what is data aggregation? Put simply, it’s when you group data based on common values and perform a calculation for each data group. For example, you might group a list of dogs by their breed and then use a calculation to figure out the average height for each breed. This task is made possible by the GROUP BY clause and the aggregate function AVG().

If you need to refresh your knowledge about GROUP BY, our SQL Basics course is ideal. It has 129 interactive exercises that you can finish in about 10 hours. It covers topics such as filtering data, using logical operators, joining, and using set operators.

Syntax of GROUP BY with Aggregate Functions

GROUP BY is a clause in SQL that arranges data with the same values into groups. Grouping is done by column(s), where all the rows with the same value in that column belong to one group. You can then perform summary computations – such as counting, summing, or averaging values – for each group. For example, you can use GROUP BY to compute the number of employees in a department, the total salary or average salary per department, etc.

The syntax of GROUP BY with an aggregate function is:

SELECT 
  column1, 
  column2, 
  ..., 
  aggregate_function(column_name)
FROM table_name
WHERE filtering_condition
GROUP BY column1, column2, …
HAVING filtering_condition;

The aggregate function is written in the SELECT statement, and the function’s result will be shown as an additional column.

The GROUP BY clause contains the columns by which you want your output to be grouped. This clause is often used with the WHERE and HAVING clauses for filtering. In the syntax, WHERE comes before GROUP BY, while HAVING comes after it. (I’ll explain more about these two clauses later.) Also, your data will sometimes contain NULLs, so you’ll be interested to learn how GROUP BY treats NULLs.

An example of GROUP BY in action is shown in the following GIF.

How SQL GROUP BY works

This basic example groups data by the film director and shows the number of films each has made.

SQL Aggregate Functions

SQL’s aggregate functions are used when you want to do computations on data and return a single value.

The most common aggregate functions in SQL are:

  • COUNT() – Counts the rows in the stated column.
  • SUM() – Returns the sum of all values.
  • AVG() – Computes the average of a group of values.
  • MIN() – Returns the minimum (the lowest) of the given values.
  • MAX() – Returns the maximum (the highest) of the given values

See our guides on how to use these functions to compute aggregates for the whole table:

It’s important to note that the COUNT() function has  several different use variations:

  • COUNT(*) counts all rows in a result set, including NULL values and duplicates.
  • COUNT(expression) counts non-NULL values in a result set.
  • COUNT(DISTINCT) counts distinct values and ignores duplicates.

There are more details in our guide What is the Difference Between COUNT(*), COUNT(1), COUNT(column), and COUNT(DISTINCT)?

Using GROUP BY with SQL Aggregate Functions

I’ll use the table albums to show how aggregate functions and GROUP BY work together. The script for creating the table is here.

This table contains data about musical albums. Here’s a snapshot:

idartistalbumrelease_yearlengthlabel
1Grant GreenThe Latin Bit19630:38:56Blue Note Records
2AC/DCHigh Voltage19760:44:23Atlantic Records
3Brother Jack McDuffTo Seek a New Home19700:33:12Blue Note Records
4Grant GreenGreen Is Beautiful19700:37:33Blue Note Records
5Wayne ShorterMoto Grosso Feio19740:42:22Blue Note Records

Example 1: COUNT() with GROUP BY

Here’s an example code that counts the number of albums by label:

SELECT 
  label,
  COUNT(*) AS number_of_albums
FROM albums
GROUP BY label; 

I select the label and use COUNT(*) to find the number of albums. The asterisk (*) is a shorthand for counting all rows. When I group the output by label, COUNT() will show the number of albums per label.

Here’s the output:

labelnumber_of_albums
Atlantic Records13
Blue Note Records12
Stax Records14

Further reading:

Example 2: SUM() with GROUP BY

The following code sums the album lengths to show the total music available per artist:

SELECT 
  artist,
  SUM(length) AS total_music_available
FROM albums
GROUP BY artist;

The SUM() function is applied to the length column, and the output is then grouped by the artist.

This is the code result:

artisttotal_music_available
Isaac Hayes6:30:02
Otis Redding1:34:09
Brother Jack McDuff1:58:11
Aretha Franklin1:47:07
Grant Green3:10:11
John Prine1:21:57
Led Zeppelin3:32:07
Wayne Shorter2:38:02
Albert King3:08:28
AC/DC2:05:23

Further reading:

Example 3: AVG() with GROUP BY

This code calculates each artist’s average album length:

SELECT 
  artist,
  AVG(length) AS average_album_length
FROM albums
GROUP BY artist; 

To get the desired result, you need to use AVG() on the column length and group the output by artist.

Here’s what the output looks like:

artistaverage_album_length
Isaac Hayes1:05:00
Otis Redding0:31:23
Brother Jack McDuff0:39:24
Aretha Franklin0:35:42
Grant Green0:38:02
John Prine0:40:59
Led Zeppelin0:42:25
Wayne Shorter0:39:31
Albert King0:37:42
AC/DC0:41:48

Further reading:

Example 4: MIN() and MAX() with GROUP BY

Let’s find the shortest and the longest album length for each year:

SELECT 
  release_year,
  MIN(length) AS minimum_album_length,
  MAX(length) AS maximum_album_length
FROM albums
GROUP BY release_year
ORDER BY release_year; 

As in earlier examples, the aggregate functions are used on the length column; MIN() for the shortest and MAX() for the longest album.

The output is grouped by release year.

I also added the ORDER BY clause to sort the output from the earliest to the latest year so it would be more readable.

Here’s the code output:

release_yearminimum_album_lengthmaximum_album_length
19620:34:490:34:49
19630:38:560:38:56
19640:30:170:30:17
19650:32:220:42:45
19670:32:510:41:08
19680:29:300:29:30
19690:31:300:46:00
19700:33:120:43:04
19710:42:371:33:38
19720:37:500:44:46
19730:40:571:43:10
19740:42:220:46:00
19760:39:590:44:23
19770:41:010:41:01

Further reading:

Advanced GROUP BY Techniques

The previous section dealt with the basics of using SQL aggregate functions and GROUP BY. With these foundations set, we can now explore some advanced GROUP BY techniques.

HAVING

To filter aggregated results, you need to use the HAVING clause. Its syntax is shown here:

SELECT 
  column1, 
  column2, 
  ..., 
  aggregate_function(column_name)
FROM table_name
GROUP BY column1, column2, …
HAVING filtering_condition;

HAVING compares the aggregated values to the filtering condition and returns only the values that satisfy that condition. You can learn more in our article about the HAVING clause.

Let me show you an example. The following code calculates the average album length per artist, but it shows only those whose average album length is under 40 minutes.

SELECT 
  artist,
  AVG(length) AS average_album_length
FROM albums
GROUP BY artist
HAVING AVG(length) < '00:40:00';

I used the AVG() function just the way I did earlier. The HAVING clause is new. To filter data in the required way, I wrote in HAVING the same average calculation as the one appearing in SELECT. Then, I use the comparison operator ‘less than’ (<) to compare each grouped result with a value (in this case, 40 minutes). This value is enclosed in single quotes (''). The value in the quotes has to be written in the format HH:MM:SS (hours:minutes:seconds) because the values in the column length are of the INTERVAL data type.

Here’s the output:

artistaverage_album_length
Otis Redding0:31:23
Brother Jack McDuff0:39:24
Aretha Franklin0:35:42
Grant Green0:38:02
Wayne Shorter0:39:31
Albert King0:37:42

The principle is the same no matter what aggregate function you use, but here are the cookbooks to help you with some other functions:

Grouping by Multiple Columns

So far, I have grouped data by one column. However, it’s also possible to group by two or more columns.

It’s not complicated: You just have to list all the grouping columns in GROUP BY and separate them with commas.

By doing that, you group by the column values but also by the combination of values across all the grouping columns. If you have value A in column_1 and value B in column_2, this is one group. If there are, for example, values A in column_1 with value C in column_2, this will form another group – despite the values in column_1 being the same as in the earlier example. 

The code below shows the number of albums by the artist and the year of release:

SELECT 
  artist,
  release_year,
  COUNT(*) AS number_of_albums
FROM albums
GROUP BY artist, release_year
ORDER BY artist, release_year;

I’m using COUNT(*) to find the number of albums. The output is grouped by artist and release year. As you can see, those are exactly the columns in GROUP BY and they are separated by a comma.

To make the output more readable, I sorted data by artist alphabetically and from the earliest to the latest release year.

This is the output snapshot:

artistrelease_yearnumber_of_albums
AC/DC19762
AC/DC19771
Albert King19621
Albert King19671
Albert King19691
Albert King19721
Albert King19741
Aretha Franklin19671
Aretha Franklin19681
Aretha Franklin19721
Brother Jack McDuff19692
Brother Jack McDuff19701

Two Aggregate Functions in One Query

It’s also possible to write a query that has two aggregate functions. No, it’s not what you think: you don’t write one aggregate function within another. That’s not possible in SQL.

Two aggregate functions in one query can mean two things:

  1. Simple approach: Using two aggregate functions in two columns of the same SELECT.
  2. Advanced approach: Using the result of one aggregate function in another aggregate function.

One Query, Two Aggregate Functions: The Simple Approach

So, the simple approach is when you have two aggregate functions in two different aggregate columns in one SELECT. They can be the same function applied on different columns, different functions used on the same column, or different functions used on different columns. It really doesn’t matter; the point is you have two aggregations in one query. These aggregations don’t ‘communicate’, i.e. they don’t depend on one another in any way.

In the example, I’ll use the table package_deliveries with the script linked here.

The table shows information about the number of packages delivered daily to several cities.

Here are the first six rows in the table:

iddate_of_deliverynumber_of_packagescity_of_deliverycountry_of_delivery
12024-03-0149KrakowPoland
22024-03-01528WarsawPoland
32024-03-01158GdanskPoland
42024-03-0128RijekaCroatia
52024-03-0197SplitCroatia
62024-03-0164ZagrebCroatia

The code below calculates the total and the average number of packages by city.

SELECT 
  city_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages,
  AVG(number_of_packages) AS average_number_of_packages
FROM package_deliveries
GROUP BY city_of_delivery;

You can see the query has two aggregates: one with SUM() and one with AVG(). It really doesn’t matter how many aggregate functions you write in the query. The only important thing is that each aggregation is treated as a separate code line or output column.

Here’s the result:

city_of_deliverytotal_number_of_packagesaverage_number_of_packages
Split531177.00
Izmir5,9361,978.67
Ankara9,1393,046.33
Gdansk712237.33
Rijeka627209.00
Zagreb930310.00
Istanbul13,8574,619.00
Krakow673224.33
Warsaw2,358786.00

One Query, Two Aggregate Functions: The Advanced Approach

The advanced version is when the aggregations ‘communicate’, meaning you first need to find one aggregated value and then use it to calculate another aggregated value.  Typically, this is achieved either through subqueries or Common Table Expressions (CTEs).

I’ll show you subquery usage in the example below. Our guide on combining two aggregate functions in GROUP BY covers both approaches.

The following code returns the average daily number of packages delivered in each country:

SELECT 
  country_of_delivery,
  AVG(pd.daily_number_of_packages) AS average_daily_number_of_packages
FROM 
  (SELECT 
     date_of_delivery, 
     country_of_delivery,
     SUM(number_of_packages) AS daily_number_of_packages
  FROM package_deliveries
  GROUP BY date_of_delivery, country_of_delivery) AS pd
GROUP BY country_of_delivery;

I first need to find the number of packages delivered daily in each country, which is the first aggregation. The second aggregation uses this result and calculates the daily average by country.

The first aggregation is written in the subquery. It uses the SUM() function and lists the columns date_of_delivery and country_of_delivery in SELECT. The output is grouped by the same columns.

Now, I can write the main query, referencing in AVG() the column daily_number_of_packages from the subquery. I group the output by country and get the average daily number of packages delivered for each country.

country_of_deliveryaverage_daily_number_of_packages
Turkey9,644.00
Croatia696.00
Poland1,247.67

GROUP BY with Conditional Aggregates

The CASE or CASE WHEN statement labels data based on its values. This statement can also be used with GROUP BY.

Grouping data by conditional aggregates means using CASE WHEN with GROUP BY. However – and this is important – the CASE WHEN you write in SELECT also has to appear in GROUP BY.

Using the table from the previous example, I can write a code that categorizes countries into EU and non-EU countries and shows the total number of packages delivered for each category:

SELECT
  CASE 
    WHEN country_of_delivery = 'Turkey' THEN 'Non-EU'
    ELSE 'EU'
  END,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY CASE 
   WHEN country_of_delivery = 'Turkey' THEN 'Non-EU'
   ELSE 'EU'
  END;

The CASE statement says that Turkey should be categorized as a non-EU country. This condition is stated in WHEN. The category for all other countries is EU, which is written in ELSE.

The CASE statement starts with CASE (hence the name) and ends with the keyword END.

I use the SUM() function to calculate the total number of packages. To show this value by category, I just copied the whole CASE WHEN statement in GROUP BY.

Here’s the output:

casetotal_number_of_packages
Non-EU28,932
EU5,831

If you need more examples to understand this fully, read one of our guides:

GROUP BY Extensions: ROLLUP, CUBE, GROUPING SETS

The three SQL GROUP BY extensions are:

  • ROLLUP – Creates a summary row for the groups listed in GROUP BY.
  • CUBE – Creates subtotals for all the combinations of the groups in GROUP BY.
  • GROUPING SETS – Aggregates output on multiple levels.

In the example, I’ll show you how GROUPING SETS works. You can learn how the other two extensions work in our GROUP BY extensions guide.

The following code returns the number of packages delivered per country and per date:

SELECT 
  country_of_delivery,
  date_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY GROUPING SETS(country_of_delivery, date_of_delivery)
ORDER BY country_of_delivery, date_of_delivery;

I start by using SUM() to calculate the number of packages.

In GROUP BY, I write GROUPING SETS with all the aggregation levels inside the parentheses.

The output shows the number of delivered packages by country and by delivery date:

country_of_deliverydate_of_deliverytotal_number_of_packages
CroatiaNULL2,088
PolandNULL3,743
TurkeyNULL28,932
NULL2024-03-012,730
NULL2024-03-0211,208
NULL2024-03-0320,825

It can help to think of grouping sets as two separate queries. The first query groups by country …

SELECT 
  country_of_delivery,
  NULL AS date_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY country_of_delivery
ORDER BY country_of_delivery;

… and returns the first three rows of the previous output:

country_of_deliverydate_of_deliverytotal_number_of_packages
CroatiaNULL2,088
PolandNULL3,743
TurkeyNULL28,932

The second query groups by delivery date …

SELECT 
  NULL AS country_of_delivery,
  date_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY date_of_delivery
ORDER BY date_of_delivery;

… and outputs the other three rows of the original result:

country_of_deliverydate_of_deliverytotal_number_of_packages
NULL2024-03-012,730
NULL2024-03-0211,208
NULL2024-03-0320,825

In other words, if you merge these two separate grouping queries using UNION

SELECT 
  country_of_delivery,
  NULL AS date_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY country_of_delivery

UNION 

SELECT 
  NULL AS country_of_delivery,
  date_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY date_of_delivery
ORDER BY country_of_delivery, date_of_delivery;

… you get exactly the same result as with GROUPING SETS.

country_of_deliverydate_of_deliverytotal_number_of_packages
CroatiaNULL2,088
PolandNULL3,743
TurkeyNULL28,932
NULL2024-03-012,730
NULL2024-03-0211,208
NULL2024-03-0320,825

Common Mistakes When Using GROUP BY

Confusing ORDER BY and GROUP BY

This is often a matter of not knowing what GROUP BY does.

Let me demonstrate with an example.

Say you want to find the total number of packages by country of delivery.

If you’re not familiar with GROUP BY, you might write a naive solution using ORDER BY.

SELECT 
  date_of_delivery,
  city_of_delivery,
  country_of_delivery
FROM package_deliveries
ORDER BY country_of_delivery;

Now that you have the data sorted by country, as shown in the snapshot below, you decide to manually sum the values to get the result by country.

date_of_deliverycity_of_deliverycountry_of_delivery
2024-03-01RijekaCroatia
2024-03-01SplitCroatia
2024-03-01ZagrebCroatia
2024-03-02RijekaCroatia
2024-03-02SplitCroatia
2024-03-02ZagrebCroatia
2024-03-03ZagrebCroatia
2024-03-03SplitCroatia
2024-03-03RijekaCroatia
2024-03-01KrakowPoland
2024-03-01WarsawPoland

But why would you manually do something that GROUP BY does automatically? Instead of ordering data by country, you should group by it and use SUM() to sum the data:

SELECT 
  country_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY country_of_delivery
ORDER BY country_of_delivery;

You immediately get the output you need:

country_of_deliverytotal_number_of_packages
Croatia2,088
Poland3,743
Turkey28,932

Read more about the differences between GROUP BY and ORDER BY and how to use GROUP BY and ORDER BY together.

Confusing WHERE and HAVING

Another common mistake is trying to filter aggregated values using WHERE. That is not possible — WHERE is used to filter individual rows before aggregation. On the other hand, HAVING is for filtering groups of rows after aggregation.

If you want to get the average number of packages delivered daily by city and show only those with more than 500 packages, you might decide to write this query:

SELECT 
  city_of_delivery,
  AVG(number_of_packages) AS average_number_of_packages
FROM package_deliveries
WHERE AVG(number_of_packages) > 500
GROUP BY city_of_delivery;

It will return an error because WHERE doesn’t accept an aggregate function as an argument. (This doesn’t mean you can't ever use WHERE with GROUP BY; you can, but not to filter groups.)

In this case, you should use HAVING:

SELECT 
  city_of_delivery,
  AVG(number_of_packages) AS average_number_of_packages
FROM package_deliveries
GROUP BY city_of_delivery
HAVING AVG(number_of_packages) > 500;

It’s the same query with AVG() and GROUP BY. The only difference is HAVING, where you compare the AVG() calculation with 500.

The output shows only cities whose daily average is above 500.

city_of_deliveryaverage_number_of_packages
Izmir1,978.67
Ankara3,046.33
Istanbul4,619.00
Warsaw786.00

Learn more about this in the article about HAVING vs. WHERE.

Listing Unaggregated Columns in SELECT but Not in WHERE

The simple rule you should remember is whatever unaggregated column you write in SELECT should also be included in GROUP BY. If you don’t do that, the database won’t know which value to show if several different values are within the same group.

For example, you write the query that is supposed to calculate the total number of packages by country and each city.

SELECT 
  country_of_delivery,
  city_of_delivery,
  SUM(number_of_packages) total_number_of_packages
FROM package_deliveries
GROUP BY country_of_delivery;

Sure, you have country and city in SELECT, but you forgot to put the city in GROUP BY. This query won’t run. It will return an error. While the messages are worded differently in PostgreSQL, Oracle, and other databases, their message is the same: the column city_of_delivery must appear in GROUP BY.

Even if it ran, the database would be confused. It would think something like this: “You want to show cities but don’t want to group by city? How am I supposed to know which city to show in the output when every country has three of them? Please put cities in GROUP BY so I can show each city as a separate group.”

So, you help it by including the city of delivery in GROUP BY:

SELECT 
  country_of_delivery,
  city_of_delivery,
  SUM(number_of_packages) total_number_of_packages
FROM package_deliveries
GROUP BY country_of_delivery, city_of_delivery;

Yes, it’s grouping by two columns and you’ve already learned how to do that. Here’s the query’s output:

country_of_deliverycity_of_deliverytotal_number_of_packages
CroatiaZagreb930
CroatiaRijeka627
CroatiaSplit531
TurkeyIstanbul13,857
PolandWarsaw2,358
PolandKrakow673
TurkeyAnkara9,139
PolandGdansk712
TurkeyIzmir5,936

Other Errors

There are also some other errors, such as forgetting GROUP BY with aggregate functions or not grouping by a unique key. Those and several more are covered in this article about common GROUP BY errors.

Further GROUP BY Resources

After reading this overview of the SQL GROUP BY clause, you should have an idea of its uses and challenges. But if you want to master GROUP BY, you’ll need to go into more detail.

The best way to do that is by checking out some of our courses and cookbooks. Here are some of my suggestions for further learning:

  • Creating Basic SQL Reports – This intermediate-level course focuses on the nuances of using GROUP BY – grouping by multiple columns, using HAVING, combining GROUP BY with CASE WHEN, the difference between COUNT(*) and COUNT(id), etc.
  • GROUP BY Extensions in SQL – This course covers CUBE, ROLLUP, and GROUPING SETS.
  • SQL Practice – The courses in this track have sections dedicated to different SQL topics, including quite a lot of GROUP BY practice.

And don’t forget about our articles featuring GROUP BY examples. Two of my favorites are SQL Practice: 10 GROUP BY Practice Exercises with Detailed Solutions and Top 9 SQL GROUP BY Interview Questions.

In addition to that, there’s an SQL for Data Analysis Cheat Sheet, which has a section dedicated to GROUP BY.

Happy learning!