21st Jul 2020 6 minutes read Using GROUP BY in SQL Zahin Rahman group by Table of Contents GROUP BY Syntax GROUP BY a Single Column Grouping with Aggregate Functions MAX(), MIN(), AVG(), SUM() GROUP BY Multiple Columns with COUNT() Mastering SQL GROUP BY Now that you know SQL’s core commands, power up your queries with the GROUP BY clause and aggregate functions. GROUP BY is a SQL clause that arranges data into groups based on a common value (or values). It is widely used to obtain summary information for large datasets as well as for computing basic aggregate statistics. Once you have grasped SQL’s basic commands like SELECT, WHERE, and JOIN, mastering the GROUP BY clause can help you implement your SQL knowledge in real-world business intelligence scenarios. SQL GROUP BY is most often used in combination with aggregate functions like COUNT(), MAX(), MIN(), SUM(), and AVG().It groups the result set by one or more columns, essentially creating summary rows for each group. These groups are usually based on categorical rather than continuous values. For example, if you are using a dataset of global customers for your business, you can use the GROUP BY clause to summarize information such as the “total number of customers by each country” or the “total sales per customer”. GROUP BY Syntax Now that we know what the GROUP BY clause is, let's look at the syntax of a basic GROUP BY query. SELECT column_name(s) FROM table_name WHERE [condition] GROUP BY column_name1, column_name2 HAVING [condition] ORDER BY column_name Here: SELECT is the standard SQL SELECT query. GROUP BY column_name1 performs the grouping based on values in column_name1. column_name2 is used when grouping is done on more than one column; you can group by any number of columns. This is optional. HAVING [condition] and WHERE [condition] are also optional; both are used to restrict the rows affected by the GROUP BY clause. The key difference is that WHERE filters the rows before grouping and HAVING filters the groups themselves once they are formed. Important: WHERE must precede the GROUP BY clause and HAVING must follow it. ORDER BY [condition] is used after GROUP BY to sort or order the data by a given column_name. At first, GROUP BY and ORDER BY appear to do the same thing – sort the data. But that is where their similarities end. As the name suggests, ORDER BY simply orders the complete dataset in a defined order, while GROUP BY groups the data as an aggregate output. Our earlier article, The Difference Between GROUP BY and ORDER BY in Simple Words, explains this in more detail; check it out if you need a refresher. Now, let’s see how the GROUP BY clause works with real queries. GROUP BY a Single Column To help understand the effect of the GROUP BY clause, let's execute a simple query on the Orders table, which has recorded 196 individual orders. Below, you can see the first five entries in the table: OrderIDCustomerIDEmployeeIDOrderDateShipperID 102489051996-07-043 102498161996-07-051 102503441996-07-082 102518431996-07-081 102527641996-07-092 Using the Orders table, let’s find out how many unique customers have placed an order. Here’s the query: SELECT CustomerID FROM Orders GROUP BY CustomerID RESULT Number of Records: 74 CustomerID 2 3 4 5 7 This is an example of a simple GROUP BY clause on a single column (CustomerID); the output places all the rows with the same value for the CustomerID column in the same group. It also returns the number of unique customers who have placed an order (74). Essentially, a GROUP BY clause without an aggregate function behaves similarly to a DISTINCT clause: it finds the unique values in a column. Grouping can also be done by multiple columns. This is most useful when used with aggregate function(s). Grouping with Aggregate Functions The practical application of SQL’s GROUP BY command is summarizing statistical parameters for groups of data; this is usually achieved with aggregate functions. We’ll explore some examples now, using the Orders tables and the OrderDetails table, a sample of which is shown below: OrderDetailIDOrderIDProductIDQuantity 1102481112 2102484210 310248725 410249149 5102495140 MAX(), MIN(), AVG(), SUM() GROUP BY is perfect for summarizing the statistical parameters of individual categories in your dataset. For example, let’s try to find the maximum, minimum, and average quantity of each ProductID per order and the total number of each product ordered. We’ll use the OrderDetails table and put the results in descending order by the sum quantities ordered for each product. This will shed light on the most popular ProductIDs. Here’s the query: SELECT ProductID, MAX(Quantity), Min(Quantity), AVG(Quantity), SUM (Quantity) FROM OrderDetails GROUP BY ProductID Order By SUM(Quantity) DESC RESULT Number of Records: 77 ProductIDMAX(Quantity)Min(Quantity)AVG(Quantity)SUM (Quantity) 3170432.714285714285715458 6080635.833333333333336430 35100441369 5970624.714285714285715346 260731341 The table below breaks down how some of these aggregate values are calculated for the ProductID values 2 and 31. The GROUP BY clause first groups the rows together for ProductID 2 and then calculates the AVG() and MAX() for the whole group. It does the same for each ProductID for each of the aggregate functions in the query. ProductIDQuantityAverage (?Quantity/# of ProductID)Max (Quantity) 2603160 250 245 240 235 ………… 317032.71470 3160 3156 3142 3140 ………… As you see, the GROUP BY command is an extremely useful tool for finding statistical summaries for any class of variables. GROUP BY Multiple Columns with COUNT() As mentioned earlier, you can use multiple columns in the GROUP BY. In other words, you’re placing all rows with the same values in both column_name1 and column_name2 in one group. Let’s consider an example where we want to count the number of times each employee has used each shipping service. Once again, we’ll use the Orders table. SELECT EmployeeID, ShipperID, COUNT(*) FROM Orders Group BY EmployeeID, ShipperID Order BY ShipperID, Count(*) DESC RESULT Number of Records: 26 EmployeeIDShipperIDCOUNT(*) 4112 118 217 317 617 …….…….……. This query could be used, for example, to get feedback on shippers’ customer service from the employees who have used that service. As we can see above, there are 26 unique interaction pairs between employees and shippers! How does grouping by two columns work behind the scenes? Rows with the same EmployeeID and ShipperID are paired into one group. The size of those groups (i.e. the number of records in the group) is calculated, as you can see below: EmployeeIDShipperIDCount 11 11 1112 11 11 …………. 41 41 418 41 41 …………. The GROUP BY command is also a powerful tool that can be combined with the SQL HAVING and WHERE clauses to define additional conditions for the returned results. Keep in mind that there are certain subtleties when GROUP BY is used with columns that have NULL values. Mastering SQL GROUP BY Needless to say, only practice will make you master this command! Practice is best done on real-world datasets, where you can visualize and relate to the business needs. The SQL Basics course includes a comprehensive set of interactive exercises. You’ll learn how to group rows and compute statistics with aggregate functions on practical datasets. Try it out, and good luck! Tags: group by