18th Dec 2020 12 minutes read

SQL Window Functions for Managers: Who Gets a Raise?

window functions

Table of Contents

What Are Window Functions?
Why Are Window Functions Useful?
Window Functions List
The Total Number and Mean of New Clients for Each Salesman
Salesman Ranking per Each Quarter
Grouping Salesmen by Their Performance
Number of Acquired Clients From Previous Quarter
Summary

Window functions are one of the most powerful features in modern SQL. In this article, you will learn why window functions are so great. I will list you the most useful window functions and briefly explain when and how you can use them.

Simple SELECT statements, in combination with WHERE, GROUP BY, and HAVING, are sufficient for many analyses at work. I used standard SQL for many years before I realized there is something more.

A few years ago, I started learning about a great “modern” feature: window functions. And I have been using them ever since. Once you master the syntax and realize how great window functions are, you will use them often.

In this article, I'm going to show you a list of window functions, why they’re so great, and how you can use them. Afterward, I will dive deep into examples.

What Are Window Functions?

Window functions are a special feature in SQL that allows you to do calculations across a set of rows. They are also called OVER functions or analytics functions. You might be thinking, “Hey, this is similar to a GROUP BY statement.” Well, there is a huge difference between them.

When using a basic select-group, we do calculations and display them on a group level. The rows are collapsed and presented in the result set.

When using a window function, there is no collapsing. Each record gets its own calculation. For each row in a table, we define its window frame and do a specific calculation.

Let’s explain this with an example. Below is the kpi_new_clients table that contains information about salesmen’s KPI (key performance indicator), a measure for success, each quarter:

salesman	period	new_clients
Olivia Smith	Q1	9
Olivia Smith	Q2	7
Olivia Smith	Q3	12
Olivia Smith	Q4	10
Lily Jones	Q1	7
Lily Jones	Q2	5
Lily Jones	Q3	8
Lily Jones	Q4	6
Alfred Brown	Q1	4
Alfred Brown	Q2	6
Alfred Brown	Q3	11
Alfred Brown	Q4	9
Sonny Lee	Q1	10
Sonny Lee	Q2	10
Sonny Lee	Q3	11
Sonny Lee	Q4	11

The table contains three columns:

salesman — first and last name of the salesmen
period — Q1, Q2, Q3, and Q4 represent each quarter in the year
new_clients — number of new clients that each salesman acquired in one quarter

Now, imagine you are the boss and you want to see an additional column, benchmark, that shows the best result achieved by the salesmen in each quarter. Once you have a benchmark assigned to each record, you can easily see how close each salesman was. This calculation can be easily done using window functions:

SELECT *,
      MAX(new_clients) OVER (PARTITION BY period) AS benchmark 
FROM kpi_new_clients;

Once you run this query, the SQL engine will display an entire dataset with the additional column, benchmark. This column represents the maximum value of new clients for a specific quarter:

Using a window function for the benchmark calculation.

Let’s briefly go through the syntax:

OVER denotes that this is a window function. Due to this keyword, sometimes window functions are also called OVER functions.
PARTITION BY tells us how the rows are grouped into logical chunks/groups. In our example, the rows are grouped on a period level, which means that for each quarter, we will separately calculate a maximum value.

Now that you are familiar with the syntax, I will briefly describe why window functions are so useful and go over the most commonly used ones.

Why Are Window Functions Useful?

By using window functions, you can perform certain complex business calculations with only a few lines of code. This is especially helpful with time series data analysis and calculations like moving averages, running totals, rankings, etc. This is nicely described in our articles When to use SQL window functions and SQL Course of the Month – Window Functions.

Also, by using window functions, you will have much cleaner and more readable code that is easier to maintain. I will show you what I mean with several examples later on.

Now, let's dive deep into the window functions list.

Window Functions List

Before I introduce the window functions list, I want to explain the main types of window functions:

Aggregate functions — These are regular aggregate functions that you have probably used with GROUP BY. However, they can also be used with OVER(). Unlike regular aggregations used in combination with GROUP BY, when they are used with OVER(), rows are not collapsed. Each record gets its own calculated values. This group of functions represents sum, avg, min, max, and count.
Ranking window functions — These are used to assign a rank or row number to each record inside a partition. The most famous functions in this group are rank(), dense_rank(), and row_number().
Positional window functions — Functions like first_value, last_value, lead, and lag return a single value from a particular row in each window frame (there are no aggregations). This “value” can be the value of the first/last record in each window frame, or it can return a value from the previous row or from the next row (lead/lag).
Distribution functions — In this group, there are two famous functions: cume_dist and percent_rank. Both calculate where each row value stands in a group of other values inside the same group/partition/window frame.

Okay, now that you are aware of the function types, it is time to introduce the most famous window functions. Below is the window function list:

Function	Category	Description
SUM()	AGGREGATION	returns total sum of all input values for each partition/window frame
AVG()	AGGREGATION	returns average value of all input values for each partition/window frame
MIN(), MAX()	AGGREGATION	returns min or max value among all input values for each partition/window frame
COUNT()	AGGREGATION	returns total number of input values for each partition/window frame
CUME_DIST()	DISTRIBUTION	returns cumulative distribution of input values for each partition/window frame (it is calculated by the formula: total number of rows that are less or equal to current value divided by the total number of rows in the partition/window frame)
PERCENT_RANK()	DISTRIBUTION	returns relative percent rank of a given row (similar to cume_dist, based on the number of rows in the group that have a lower value than the current row)
ROW_NUMBER()	RANKING	assigns a sequential integer to each row within the partition of a result set
RANK()	RANKING	assigns a rank to each value within the group/partition (values need to be ordered so that ORDER BY can define an order; if two values are the same, they receive the same rank)
DENSE_RANK()	RANKING	Assigns the same rank to rows with equal values for the ranking criteria (very similar to rank function)
NTILE()	RANKING	Assigns to each value in a list a number that represents a bucket or group (you specify a number of buckets, and ntile() determines what belongs to which bucket)
FIRST_VALUE()	POSITIONAL	returns a value of the expression for the first row according to the given order (it's different from max)
LAST_VALUE()	POSITIONAL	returns the value of the expression for the last row according to the given order
LEAD()	POSITIONAL	assigns a value to each row in a table that is stored in a row after the current one (next row value)
LAG()	POSITIONAL	assigns a value to each row in a table that is stored in a row before the current one (previous row value)

Now, it is time to see window functions in action. Let's go through several examples.

The Total Number and Mean of New Clients for Each Salesman

We could display the total number of new clients and mean value for each salesman in one year (Q1, Q2, Q3, and Q4) by using GROUP BY syntax:

SELECT salesman,
      avg(new_clients) AS average,
      sum(new_clients) AS total
FROM kpi_new_clients group by salesman;

You are probably now asking yourself, “Why should I use window functions if I can do the calculation by using GROUP BY? Why not stick with the regular syntax?” Well, in this case, window functions are great for adding averages and total sums to each input row as an additional attribute without collapsing.

With GROUP BY, you will get collapsed results:

salesman	average	total
Alfred Brown	7.5	30
Lily Jones	6.5	26
Olivia Smith	9.5	38
Sonny Lee	10.5	42

You got four rows. If you would like to see the averages and totals as additional values, use window functions. See the query below:

SELECT *,
  new_clients,
  avg(new_clients) OVER (PARTITION BY salesman) AS average,
  sum(new_clients) OVER (PARTITION BY salesman) AS total
FROM kpi_new_clients;

Once you run this query, you will see the whole table with two additional columns:

With a result presented like this, you can easily check in which quarter each salesman was below or above his/her average and answer similar questions. Our data set is now enriched with additional values that give us better insights.

Salesman Ranking per Each Quarter

We could analyze our dataset further by ranking our salesmen. Who was the best in each quarter? Who has acquired the most and who has not been successful in acquiring new clients?

To answer these questions, you can use window ranking functions—row_number, rank, and dense_rank. These analytics functions work similarly— rank is assigned to each record in a dataset by ordering records according to some value.

In this case, we are interested in who acquired the most clients. So, it is important that the column new_clients is ordered in descending order. Here is the SELECT statement:

SELECT *,
      ROW_NUMBER() OVER (PARTITION BY period ORDER BY new_clients desc) AS rownumber_salesman,
      RANK() OVER (PARTITION BY period ORDER BY new_clients desc) AS rank_salesman,
      dense_rank() OVER (PARTITION BY period ORDER BY new_clients desc) AS dense_rank_salesman
FROM kpi_new_clients;

Once you run this statement, the following output will be displayed on the screen:

salesman	period	new_clients	rownumber_salesman	rank_salesman	dense_rank_salesman
Sonny Lee	Q1	10	1	1	1
Olivia Smith	Q1	9	2	2	2
Lily Jones	Q1	7	3	3	3
Alfred Brown	Q1	4	4	4	4
Sonny Lee	Q2	10	1	1	1
Olivia Smith	Q2	7	2	2	2
Alfred Brown	Q2	6	3	3	3
Lily Jones	Q2	5	4	4	4
Olivia Smith	Q3	12	1	1	1
Alfred Brown	Q3	11	2	2	2
Sonny Lee	Q3	11	3	2	2
Lily Jones	Q3	8	4	4	3
Sonny Lee	Q4	11	1	1	1
Olivia Smith	Q4	10	2	2	2
Alfred Brown	Q4	9	3	3	3
Lily Jones	Q4	6	4	4	4

Here is a short code explanation:

In this example, we created three additional columns for the results of the window functions: rownumber_salesman, rank_salesman, and dense_rank_salesman.
The rows are grouped into partitions by quarters so that each ranking will be done on each partition separately.
row_number, rank, and dense_rank assigned 1 for the salesman with the highest number of new clients in a given period, two for the second-highest, and so on.
We used OVER in a combination with ORDER BY. ORDER BY is used for partition ordering (we sorted the records inside each partition in a specified order). In our example, each partition was sorted by the column new_clients in descending order. After, row ordering ranking was applied.

Notice that all three analytics functions worked in almost the same way. The only difference lies in the records that have the same value in new_clients.

Let's take a look at the numbers from the third quarter:

salesman	period	new_clients	rownumber_salesman	rank_salesman	dense_rank_salesman
Olivia Smith	Q3	12	1	1	1
Alfred Brown	Q3	11	2	2	2
Sonny Lee	Q3	11	3	2	2
Lily Jones	Q3	8	4	4	3

In the third quarter, Alfred and Sonny acquired the same number of clients?: 11. Olivia did better, and Lily acquired the least. How do we want to rank if more records have the same value in new_clients?

If we want Alfred and Sonny to get the same rank, we should use rank or dense_rank. These two functions assign the same rank to equal values. If we want each salesman to get a different rank, we should use row_number.

What about Lily if Alfred and Sonny share second place? Do we want to give her third or fourth place? If third, we should use dense_rank. If fourth, we should use rank.

This is the main difference between rank and dense_rank. dense_rank does not skip any ranks if there is a tie in the preceding records.

More details about ranking functions can be found in our article: Overview of Ranking Functions in SQL. Take a look if you are keen to learn more about ranking.

Grouping Salesmen by Their Performance

Another useful function is NTILE. This window function breaks ordered records into a specified number of equal groups.

So, we could form groups or buckets of salesmen by their performance. We could do this as follows:

WITH total_by_quarter AS (
SELECT 
   salesman,
   sum(new_clients) AS new_clients
FROM kpi_new_clients 
GROUP BY salesman
)
SELECT 
  *,
  NTILE(2) OVER (ORDER BY new_clients desc) AS bucket_performance 
FROM total_by_quarter;

Here is the code explanation:

We used CTEs (common table expressions). total_by_quarter is the temporary result set that stores the total number of acquired clients in the year (Q1, Q2, Q3, and Q4 merged) for each salesman.
Each salesman belongs to one bucket/group. This is stored in the column bucket_performance which is created by using NTILE.
NTILE takes one argument?: ?the number of buckets. In our case, this argument is set to 2, which means that two salesmen groups are created.

The result of our query looks like this:

Sonny and Olivia acquired more new clients, so they are in bucket 1. Alfred and Lily are in bucket 2. The groups are equal in size (they both have two salesmen), and they are formed according to the ordered new_clients variable.

Number of Acquired Clients From Previous Quarter

In the last example, I’m going to introduce a positional function. We are going to use lag to return the value of new_clients from the previous quarter for each salesman separately.

Here is the code:

SELECT 
  *,
  LAG(new_clients) OVER (PARTITION BY salesman ORDER BY period) 
     AS previous_period_new_clients 
FROM kpi_new_clients;

Keep this in mind:

Inside the OVER clause, we used PARTITION BY (we want to group the records for each salesman separately). For each salesman, the rows are ordered by period column.
To each current row, we assigned the value of the previous one (the lag function does this).
Some rows are populated with null. These are rows for which the previous row value is unknown.

Similarly, the lead function can be used as well. lead takes values from the following row.

Summary

In this article, you learned about window functions — why they are useful and when you could use them. I gave you a list of window functions, and through several examples, I explained the syntax and how you can use them at work.

If you want to learn more, I suggest taking the Window Functions course available on our LearnSQL.com platform. This course is interactive, which means it contains a lot of examples for you to practice with.

If you are working with MS SQL or PostgreSQL, I suggest Window Functions in MS SQL Server or Window Functions in PostgreSQL. All three courses mentioned above contain the same content, except that they are for different SQL engines.

There is also a cool cheat sheet that you can use while practicing: SQL Window Functions Cheat Sheet. It is a good resource for learning window functions.

Tags:

window functions