25th Jan 2022 15 minutes read The SQL OVER() Clause Explained Tihomir Babic window functions Table of Contents What Is the SQL OVER() Clause? The SQL OVER() Clause Syntax PARTITION BY() ORDER BY The Window Frame The Default Window Frame Meaning of Window Frame Abbreviations The SQL OVER() Clause Examples The Dataset Used Example 1: With PARTITION BY but Without ORDER BY Example 2: With Both PARTITION BY and ORDER BY Example 3: With Both PARTITION BY and ORDER BY Example 4: Without PARTITION BY but With ORDER BY DESC Learning the SQL OVER() Clause Is Not Yet Over Here is a detailed explanation of the OVER() clause in SQL window functions. This is a detailed guide on using the OVER() clause in SQL window functions. I will focus specifically on this clause, which requires that you have at least a general idea of how window functions work. To make the best out of this article, take a look at the SQL Window Functions Cheat Sheet. Feel free to have it by your side as we go. What Is the SQL OVER() Clause? Let’s start by answering the first question that springs to mind: what is an OVER() clause? It is what makes functions become window functions. The OVER() clause lets you define the windows for window functions. It also lets you define the order in which a given window function is executed. The SQL OVER() Clause Syntax The SQL syntax of this clause is as follows: SELECT <column name>, <window function> OVER([PARTITION BY <column names>] [ORDER BY <column names>] [ <ROW or RANGE clause> ]) FROM table; The three distinct parts of the OVER() clause syntax are: PARTITION BY ORDER BY The window frame (ROW or RANGE clause) I’ll walk through each of these. PARTITION BY() This SQL clause lets you define the window. It partitions a dataset into smaller segments called windows. When PARTITION BY is omitted, the window function is applied to the whole dataset. Let’s use a small example dataset to see how it works: iddatesalesproduct 12021-12-061,357.22Ax 22021-12-062,154.88Hammer 32021-12-071,984.88Ax 42021-12-073,147.11Hammer If I want to see total sales by product, using GROUP BY comes to mind. I can write a query like this: SELECT product, SUM(sales) AS sales_per_product FROM product_sales GROUP BY product; The result is: productsales_per_product Ax3,342.10 Hammer5,301.99 GROUP BY works fine if I only want to see aggregate values for each group. However, if I want to see aggregate values for each group while preserving row-level data, I need to use PARTITION BY. For example: SELECT product, date, sales, SUM(sales) OVER (PARTITION BY product) AS sales_per_product FROM product_sales; This query is just to show you what PARTITION BY does. Don’t worry; later on, I’ll show you example queries and explain them in detail. For now, take a look at the result of the query above: productdatesalessales_per_product Ax2021-12-061,357.223,342.10 Ax2021-12-071,984.883,342.10 Hammer2021-12-062,154.885,301.99 Hammer2021-12-073,147.115,301.99 Here, I have sales per product as in the previous example, but I also have sales data for every product and date. The partition in this example is the product. This means the ax is one window partition, while the hammer is another window partition. The window function sums all sales values for the ax; it then goes to the next window partition, resets, and sums all sales values for the hammer. This illustrates the main difference between using GROUP BY and PARTITION BY: GROUP BY collapses rows and returns aggregate values in one row only, while PARTITION BY preserves all rows and returns aggregate values in as many rows as there are in the partition. You may want to help yourself in clarifying this topic further by reading an article dedicated to the difference between GROUP BY and PARTITION BY. ORDER BY You probably can guess what this SQL clause does. Yes, it orders something. But what? Unlike the regular ORDER BY, the ORDER BY in an OVER() clause does not sort the query result. When used in window functions as part of an OVER() clause, ORDER BY defines how window functions are executed. Both ASC and DESC may be used with it. ASC means the window function is executed in ascending order, e.g., A-Z, from the lowest to the highest number, from the oldest to the newest date, etc. If you use DESC, the window function is executed in reverse order compared to ASC. If ORDER BY is omitted, the window function is executed in arbitrary order. Whichever way you use it, keep in mind ORDER BY in an OVER() clause does not have to do with how the query output is sorted. You may still sort the result even when using window functions; you just have to write another ORDER BY clause at the end of your query like you usually do. The Window Frame A window frame is a set of rows that are related to the current row. The current row is where the window function is being executed. Window frames are defined using the RANGE or ROWS keyword. These keywords define the upper and lower bounds of a window frame, thus defining the frame itself. The difference between these two keywords is explained in an article about the RANGE clause. The Default Window Frame The default window frame depends on whether ORDER BY is omitted or not. There are two possible scenarios when ORDER BY is omitted: Using OVER (), omitting both PARTITION BY and ORDER BY. Using OVER (PARTITION BY …), omitting only ORDER BY. In both cases, the default window frame is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. This means the window frame is all rows (all rows before the current row, the current row, and all rows after the current row). In addition, there are two possible scenarios when ORDER BY is not omitted: OVER (ORDER BY …): PARTITION BY is omitted. OVER (PARTITION BY … ORDER BY …): neither PARTITION BY nor ORDER BY is omitted. In these two cases, the default window frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means all values (not rows!) before the current row and the current row itself. Meaning of Window Frame Abbreviations While I’m at it, here’s a list of all possible upper and lower bounds for the window frame and what they mean: AbbreviationMeaning UNBOUNDED PRECEDINGBETWEEN UNBOUNDED PRECEDING AND CURRENT ROW n PRECEDINGBETWEEN n PRECEDING AND CURRENT ROW CURRENT ROWBETWEEN CURRENT ROW AND CURRENT ROW n FOLLOWINGBETWEEN AND CURRENT ROW AND n FOLLOWING UNBOUNDED FOLLOWINGBETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING If you want to see this in an example, here’s an article about defining a window frame in SQL window functions. Now that we have gone through the technical stuff, it’s time to explain the OVER() clause with examples. The SQL OVER() Clause Examples To understand the examples, you have to become familiar with the data first. The Dataset Used All four examples use the same dataset, which consists of three tables: stores products inventory The table stores is a list of the stores selling tea. Here’s what it looks like: idstore_name 1Zaandam 2Amsterdam 3Rotterdam 4Utrecht The next table shows the list of tea producers and the product they make; it’s the products table: idproduct_brandproduct_name 1Mr. WellingtonEarl Grey 2Karnataka Tea CompanyKashmir Tchai 3Fukuyama & Co.Matcha 4Fukuyama & Co.Sencha The table inventory shows the quantities of tea available for every store. It has much more data than the first two tables, so here are just the first several rows: idinventory_dateopening_quantityclosing_quantityproduct_idstore_id 12021-12-0613811 22021-12-078111 32021-12-08525011 42021-12-09504411 52021-12-10443211 62021-12-06NULLNULL21 72021-12-07NULLNULL21 82021-12-08282021 92021-12-09201721 102021-12-10171721 As you can see, there are NULL values. They mean the tea was not available when the store opened or closed. In the above, product_id = 2 in store_id = 1 wasn’t available for two days: 2021-12-06 and 2021-12-07. Generally, the closing_quantity of a given day is the opening_quantity of the following day. If it’s not, then it means a new order was delivered and accounted for before the store opened. This is the case, for example, in the highlighted rows. Ready for some OVER() clause examples? I bet you are! Example 1: With PARTITION BY but Without ORDER BY In this example, the task is to show the inventory date, the quantity at store open, and the quantity at store close. We also want the highest quantity at store close for each day regardless of the store or the product. To do that, I’ve written this query: SELECT inventory_date, opening_quantity, closing_quantity, MAX(closing_quantity) OVER (PARTITION BY inventory_date) AS daily_max FROM inventory; This query uses only PARTITION BY in the OVER() clause. What does this query do? First, it selects three columns from the table inventory. Then I use the MAX() aggregate function to find the highest product quantity at store close. I use the OVER() clause to make this aggregate function a window function. In it, I partition the window by the column inventory_date. This means I get the highest product quantity at store close by inventory date across all stores and across all products. The table below shows all the data for the first date (December 6) and the first row of the second date (December 7) for simplicity. Of course, the query returns much more data. inventory_dateopening_quantityclosing_quantitydaily_max 2021-12-0612851397 2021-12-06NULLNULL397 2021-12-068787397 2021-12-062221397 2021-12-06138397 2021-12-06247200397 2021-12-064239397 2021-12-063333397 2021-12-064441397 2021-12-067371397 2021-12-069794397 2021-12-06412397397 2021-12-063933397 2021-12-066258397 2021-12-064944397 2021-12-0679NULL397 2021-12-075854327 The table lists the quantities at store open and store close for every store-product combination. The highest quantity at store close is in the column daily_max. As you can see, it is 397 for the date 2021-12-06. You can verify this is the highest value; it appears in the highlighted row. The NULL values are shown, but they don’t impact the result. The data is presented the same way for every date in the dataset. You can see the next date is 2021-12-07, and the highest product quantity at store close for this date is 327. The same principle applies to the rest of the table. Example 2: With Both PARTITION BY and ORDER BY In this example, I’ll show you how to use both PARTITION BY and ORDER BY in an OVER() clause. To do that, I write a query that displays the product brand, the product name, and the inventory date. I also want to show the daily quantity sold. In addition, I want to see the daily cumulative quantity sold per brand across all stores. You need to know how to calculate a running total for this example. Before showing you the query, let’s make sure you understand what a running total is. Here it is with the example data: iddatesalesrunning_totalproduct 12021-12-061,357.221,357.22Ax 22021-12-062,154.883,512.10Hammer 32021-12-071,984.885,496.98Ax 42021-12-073,147.118,644.09Hammer A running total is the sum of the current row plus all the preceding rows. In the first row, it’s 1,357.22 + 0 = 1,357.22 because there are no preceding rows. For the next row, it is 2,154.88 + 1,357.22 = 3,512.10. The same logic applies to all remaining rows: 1,984.88 + 3,512.10 = 5,496.98 and 3,147.11 + 5,496.98 = 8,644.09. Now let’s calculate the running total in SQL: SELECT product_brand, product_name, inventory_date, opening_quantity - closing_quantity AS daily_sales, SUM(opening_quantity - closing_quantity) OVER (PARTITION BY product_brand ORDER BY inventory_date) AS cumulative_sales_per_brand FROM products p JOIN inventory i ON p.id = i.product_id; The query selects certain columns from the tables products and inventory. Then, I calculate the difference between the quantities at store open and store close into the column daily_sales. After that, this difference is used in the SUM() window function to get me the sum of daily sales. The window is partitioned by the column product_brand because I want to see the sum of daily sales at the brand level. The ordinary SUM() aggregate function now becomes a window function for calculating the cumulative sum. Finally, the window function is executed by inventory date in ascending order. Had I not used ORDER BY, the SUM() function would have been executed in arbitrary order, which is not much of a cumulative sum/running total. Note: If no ASC or DESC is specified, ASC is the default value. The query returns the result below. Again, I’m only showing a handful of rows: product_brandproduct_nameinventory_datedaily_salessales_per_brand Fukuyama & Co.Matcha2021-12-06527 Fukuyama & Co.Sencha2021-12-06127 Fukuyama & Co.Matcha2021-12-06027 Fukuyama & Co.Sencha2021-12-06027 Fukuyama & Co.Sencha2021-12-06NULL27 Fukuyama & Co.Matcha2021-12-06327 Fukuyama & Co.Sencha2021-12-061527 Fukuyama & Co.Matcha2021-12-06327 Fukuyama & Co.Matcha2021-12-074224 Fukuyama & Co.Sencha2021-12-0798224 Fukuyama & Co.Matcha2021-12-073224 Fukuyama & Co.Matcha2021-12-0711224 Fukuyama & Co.Sencha2021-12-071224 Fukuyama & Co.Sencha2021-12-0770224 Fukuyama & Co.Matcha2021-12-075224 Fukuyama & Co.Sencha2021-12-075224 According to the result, the quantity sold for Fukuyama & Co. products is 27 for 2021-12-06. The total quantity sold on 2021-12-07 is 197. However, the quantity in the column sales_per_brand is 224. Why is that? This is because it’s a cumulative sum that shows the total sold on that day plus the total of all prior days. In other words, you get 224 if you sum all the values in the highlighted cells. This is done the same way for all other dates and brands. Example 3: With Both PARTITION BY and ORDER BY Now, say you want to show the store name, the inventory date, the brand, and the products. Again, we will show the daily quantity sold, but we will also show the average daily quantity sold by each store. This is a cumulative average that changes with every passing day and from the lowest to the highest daily quantity sold. The general principles of calculating a cumulative average are shown in the table below: iddatesalescumulative_averageproduct 12021-12-061,357.221,357.22Ax 22021-12-062,154.881,756.05Hammer 32021-12-071,984.881,832.33Ax 42021-12-073,147.112,161.02Hammer The moving average is calculated as an average of a subset. Generally, moving averages take into calculation the current row, a certain number of rows or values preceding the current row, and/or a certain number of rows or values following the current row. In the above example, I calculate an average that includes the current row and all rows that precede the respective current row. This results in a type of moving average, but I’ll refer to it as a cumulative average to make a distinction. The cumulative average for the first row is 1,357.22. This is the same as the sales value of the row, because there are no rows before the first one. The calculation for the second row looks like this: (1,357.22 + 2,154.88)/2 = 1,756.05; for the next one: (1,357.22 + 2,154.88 + 1,984.88)/3 = 1,832.33. The last cumulative average is calculated by following the same principle: (1,357.22 + 2,154.88 + 1,984.88 + 3,147.11)/4 = 2,161.02. Now, the cumulative average in SQL: SELECT store_name, inventory_date, product_brand, product_name, opening_quantity - closing_quantity AS daily_sales, AVG(opening_quantity - closing_quantity) OVER ( PARTITION BY store_name ORDER BY inventory_date, opening_quantity - closing_quantity ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS cumulative_avg_sales FROM stores s JOIN inventory i ON s.id = i.store_id JOIN products p ON p.id = i.product_id; I select several columns from all three tables. Then, I calculate the daily sales the same way as in Example 2. I use the AVG() aggregate function to calculate the average daily sales. I need the average by store, so I partition the window by the column store_name. Then, I order the data in the partition by inventory date and daily sales using an ORDER BY clause. It means the window function calculates the average from the oldest to the newest date and from the lowest to the highest daily sales. Why? I want to see my cumulative averages as they change. Finally, the window frame is defined as ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. The average includes the current row and all the preceding rows – within the defined window partition, of course. Take a look at the resulting table: store_nameinventory_dateproduct_brandproduct_namedaily_salescumulative_avg_sales Zaandam2021-12-06Fukuyama & Co.Matcha00.00 Zaandam2021-12-06Fukuyama & Co.Sencha10.50 Zaandam2021-12-06Mr. WellingtonEarl Grey52.00 Zaandam2021-12-06Karnataka Tea CompanyKashmir TchaiNULL2.00 Zaandam2021-12-07Fukuyama & Co.Sencha11.75 Zaandam2021-12-07Mr. WellingtonEarl Grey72.80 Zaandam2021-12-07Fukuyama & Co.Matcha114.17 Zaandam2021-12-07Karnataka Tea CompanyKashmir TchaiNULL4.17 Zaandam2021-12-08Fukuyama & Co.Matcha03.57 Zaandam2021-12-08Fukuyama & Co.Sencha13.25 Zaandam2021-12-08Mr. WellingtonEarl Grey23.11 Zaandam2021-12-08Karnataka Tea CompanyKashmir Tchai83.60 Zaandam2021-12-09Fukuyama & Co.Sencha23.45 Zaandam2021-12-09Karnataka Tea CompanyKashmir Tchai33.42 Zaandam2021-12-09Mr. WellingtonEarl Grey63.62 Zaandam2021-12-09Fukuyama & Co.Matcha245.07 Zaandam2021-12-10Karnataka Tea CompanyKashmir Tchai04.73 Zaandam2021-12-10Fukuyama & Co.Matcha04.44 Zaandam2021-12-10Fukuyama & Co.Sencha14.24 Zaandam2021-12-10Mr. WellingtonEarl Grey124.67 Let me prove the calculation is correct by using the same principle you already know. The cumulative average for the first row is 0.00 because the value of this row is 0 and there are no preceding rows. How about the second row? (0 + 1)/2 = 0.50. So far, so good! The third row: (0 + 1 + 5)/3 = 2.00. That’s easy. The same with the next row: (0 + 1 + 5 + NULL)/4 = 1.5. But the above table says the cumulative average is 2.00, not 1.5. Ha, SQL has made an error, right? No, you have made an error if you have treated the NULL values as 0. When calculating averages, SQL simply ignores NULL values. So, the calculation for this row is again (0 + 1 + 5)/3 = 2.00. This is exactly what is shown in the table. How about the next row? You skip the row with the NULL value and calculate the cumulative average this way: (0 + 1 + 5 + 1)/4 = 1.75. Example 4: Without PARTITION BY but With ORDER BY DESC This time, I’ll show you how to use window functions without the PARTITION BY clause but with the data in the window partition in descending order. I want to rank the data by quantity at store open in descending order. In addition to the rank, I want the result to show the following columns: inventory_date, product_brand, product_name, and opening_quantity. Take a look: SELECT inventory_date, product_brand, product_name, opening_quantity, DENSE_RANK() OVER (ORDER BY opening_quantity DESC) FROM inventory i JOIN products p ON i.product_id = p.id; First, there are several columns from the tables inventory and products. I use the DENSE_RANK() function to rank the values. You may also use the RANK() or the ROW_NUMBER() function, but the result may be slightly different. In the OVER() clause, there’s an ORDER BY used with a DESC keyword, so the ranking goes from the highest to the lowest quantity at store open. No partition is defined, so the ranking is applied to the whole dataset. Here are several rows from the data the query returns: inventory_dateproduct_brandproduct_nameopening_quantityrank 2021-12-06Karnataka Tea CompanyKashmir TchaiNULL1 2021-12-07Karnataka Tea CompanyKashmir TchaiNULL1 2021-12-09Fukuyama & Co.SenchaNULL1 2021-12-08Karnataka Tea CompanyKashmir Tchai8202 2021-12-09Karnataka Tea CompanyKashmir Tchai8093 2021-12-09Karnataka Tea CompanyKashmir Tchai7044 2021-12-06Fukuyama & Co.Sencha4125 2021-12-07Fukuyama & Co.Sencha3976 2021-12-08Fukuyama & Co.Sencha3277 As you can see, all NULL values are ranked 1. This is because I’m using PostgreSQL, which puts NULL values at the top when data is in descending order. Since the first three rows have the same value, they have the same rank. The quantity 820 is ranked 2, while the quantity 809 is ranked 3. This goes on in the same way throughout the dataset. To show the same rank is allocated not only when the values are NULL, here are two rows you don’t see above: inventory_dateproduct_brandproduct_nameopening_quantityrank 2021-12-06Fukuyama & Co.Matcha8719 2021-12-07Fukuyama & Co.Matcha8719 As you see, the rows with duplicate values are ranked the same. Ranking functions can be very useful. I recommend you learn how to use all the ranking functions with PARTITION BY, too. Learning the SQL OVER() Clause Is Not Yet Over My explanation of the OVER() clause is done, but that doesn’t mean your learning and practice are over. Now that you understand PARTITION BY, ORDER BY, and the window frame, continue to build on that knowledge. If you like reading about window functions, here’s a list of the best articles about window functions. If practice is your thing, here’s an article showing you several window function examples. The best combination of learning through reading and practicing is in our Window Functions Course. It’s a part of the Advanced SQL Track, which, aside from window functions, covers topics such as GROUP BY extensions and recursive queries. Check it out! Tags: window functions