23rd Apr 2021 7 minutes read

What Is the Role of DISTINCT in SQL?

Data Engineering

Table of Contents

SQL DISTINCT on a single column
SQL DISTINCT with Multiple Columns
SQL DISTINCT with Aggregate Functions
- DISTINCT with Aggregate Functions and GROUP BY
Learning about DISTINCT in SQL Is Just the Start!

The DISTINCT keyword in SQL plays an important role in removing duplicate values from your results. This article provides examples showing exactly how DISTINCT works.

By default, SQL queries show all the returned rows, including duplicate rows, in the result set. The DISTINCT keyword in the SELECT clause is used to eliminate duplicate rows and display a unique list of values. In other words, the DISTINCT keyword retrieves unique values from a table.

The basic syntax of the DISTINCT keyword in SQL is:

SELECT DISTINCT
  column_1,
  column_2,
  …,
  column_n
FROM table_name

(If you find the above syntax confusing, consider LearnSQL.com’s SQL Basics course. It covers the fundamentals of SQL, including the DISTINCT keyword.)

Note that the DISTINCT keyword must be placed before the columns you specify. You can specify more than one column to be DISTINCT, which will make SQL return only unique combinations of those columns’ values. If you do select more than one column, all the columns are combined and only the uniqueness of the row will be considered. SQL DISTINCT looks for a distinct row, rather than a distinct column.

Let’s see the DISTINCT keyword in action with a practical example. First, though, we should familiarize ourselves with the data we’ll be working with. Imagine we have a table called employee_info, which contains details about the employees at our company:

employee_id	first_name	last_name	department	age
127374	Joshua	Rees	Sales	26
293421	Karen	Bower	Sales	45
244813	Luke	Payne	NULL	42
463281	Max	Paige	Human Resources	53
374241	Julia	Glover	Human Resources	29
568421	Rose	Harris	Human Resources	43
NULL	NULL	NULL	NULL	NULL
839211	Olivia	Baker	Sales	32
293493	Karen	Bower	Sales	45
463281	Max	Paige	Human Resources	53
244813	Luke	Payne	NULL	42
933821	Bernadette	Parsons	Sales	NULL

SQL DISTINCT on a single column

Let’s see the effect of the DISTINCT keyword when we try to SELECT the employee_id column. First, here is the query before applying the DISTINCT keyword:

SELECT employee_id
FROM employee_info

This results in the following:

employee_id
127374
293421
244813
463281
374241
568421
NULL
839211
293493
463281
244813
933821

Executing this query returns the entire employee_id column from our table. This column matches the employee_info table exactly. On closer analysis, we can see some duplicate values – 244813 and 463281 each appear twice.

Let’s add the DISTINCT keyword to address these duplicate values. Remember, we add the DISTINCT keyword before we specify the columns:

SELECT DISTINCT employee_id 
FROM employee_info

employee_id
127374
293421
244813
463281
374241
568421
NULL
839211
293493
933821

Our resulting data set appears shorter; there are several things we can learn from this response.

SQL DISTINCT Removes Duplicate Values

If we contrast the original table to our result, we can see that the duplicates were successfully filtered out. Using the DISTINCT keyword is just one method for removing duplicate values, as shown by this article on the different ways to remove duplicates in SQL.

The employee_info table contained two occurrences of the following values:

employee_id
244813
463281

However, our result now only shows these values once. The DISTINCT keyword has had the desired effect.

Pay Attention to the Columns Used

When you first view the employee_info table, it appears that these rows are duplicates.

employee_id	first_name	last_name	department	age
293421	Karen	Bower	Sales	45
293493	Karen	Bower	Sales	45

On closer inspection, you can see that the value of employee_id is different. Since we applied the DISTINCT keyword to this employee_id column, both of these values are returned.

The Inclusion of NULL values

You’ve probably noticed the presence of NULL values in the dataset returned. In SQL, NULL is considered a unique value, meaning it will get returned when using the DISTINCT keyword. If you would like to remove NULL values from the result, you could write the following query:

SELECT DISTINCT
  * 
FROM employee_info 
WHERE employee_id IS NOT NULL

SQL DISTINCT with a Subquery

You can also use the DISTINCT keyword as part of a subquery. However, it is worth mentioning that an IN, ANY, or EXISTS subquery evaluates to true if there is at least one row that causes the subquery to evaluate to true. This makes using a DISTINCT with an IN, ANY, or EXISTS subquery unnecessary.

SQL DISTINCT with Multiple Columns

The columns you specify as DISTINCT play a key role in what data gets returned. Look at the following query:

SELECT DISTINCT
  employee_id,
  first_name,
  last_name 
FROM employee_info

We are looking for unique combinations of these three columns. Executing this query yields the following result:

employee_id	first_name	last_name
127374	Joshua	Rees
293421	Karen	Bower
244813	Luke	Payne
463281	Max	Paige
374241	Julia	Glover
568421	Rose	Harris
NULL	NULL	NULL
839211	Olivia	Baker
293493	Karen	Bower
933821	Bernadette	Parsons

When scanning this result set, you may be expecting to see completely unique names along with their employee_id. For example, look at this data:

employee_id	first_name	last_name
293421	Karen	Bower
293493	Karen	Bower

The employee_id value is unique for both rows, meaning these rows are also considered unique. If we changed the query to the following:

SELECT DISTINCT
  first_name,
  last_name 
FROM employee_info

... employee_id is no longer considered; now we’re looking for unique combinations of names. Thus, only one row for “Karen Bower” is shown:

first_name	last_name
Joshua	Rees
Karen	Bower
Luke	Payne
Max	Paige
Julia	Glover
Rose	Harris
NULL	NULL
Olivia	Baker
Bernadette	Parsons

If you are struggling to achieve your desired results when using the DISTINCT keyword, pay special attention to the columns you have included. If this specific example is still causing you confusion, check out this article on how to count distinct values in SQL.

SQL DISTINCT with Aggregate Functions

The DISTINCT keyword can be used along aggregate functions. The most common aggregate functions are:

MIN() - Returns the smallest (minimal) value.
MAX() - Returns the largest (maximal) value.
SUM() - Returns the sum of values specified.
AVG() - Returns the average the specified values (e.g. in a column).
COUNT() - Returns the number of items (e.g. in a column).

When DISTINCT is used with SUM(), AVG(), and COUNT(), it eliminates duplicate values before the sum, average, or count is calculated. While you can use the DISTINCT keyword with MIN() and MAX(), it is pointless to do so; it will have no impact on the results.

Let’s apply this information to our previous example involving the employee_info table. If we wanted to count the number of employees currently working at our company while taking care to remove duplicate entries, we’d write the following query:

SELECT COUNT(DISTINCT employee_id)
FROM employee_info

The result of this query is:

COUNT(DISTINCT employee_id)
9

This would be the simplest example of how the DISTINCT keyword can be used in conjunction with aggregate functions. Executing this query without the DISTINCT keyword would include duplicate values in the count:

SELECT COUNT(employee_id) 
FROM employee_info

The result of this query is:

COUNT(employee_id)
11

If you think your table may contain duplicate records, using DISTINCT is an excellent way to ensure your reporting is accurate and remains unaffected by duplicate values. If this use case isn’t clear to you, I recommend reading this article on counting distinct values in SQL. There are many other use cases for COUNT(); for additional useful examples, check out this article explaining the differences between common applications of the COUNT() function.

DISTINCT with Aggregate Functions and GROUP BY

Let’s use an aggregate function with the DISTINCT keyword to help us solve a common problem. We want to find the number of potential duplicate employee_id entries for each name pairing in our table. We want to isolate the case where the employee’s name is exactly the same, but with a different employee_id. Here’s the query:

SELECT
  first_name,
  last_name,
  COUNT(DISTINCT employee_id) 
FROM employee_info 
GROUP BY first_name, last_name

And the result:

employee_id	first_name	COUNT (DISTINCT employee_id)
NULL	NULL	0
Bernadette	Parsons	1
Joshua	Rees	1
Julia	Glover	1
Karen	Bower	2
Luke	Payne	1
Max	Paige	1
Olivia	Baker	1
Rose	Harris	1

We can see from these results that we have two different entries for Karen Bower’s employee_id. This is an excellent piece of information to have. If there are not two different individuals named Karen Bower working at our company, we know that this information was entered in error and can be considered a duplicate to be removed.

Learning about DISTINCT in SQL Is Just the Start!

The ability to isolate and exclude duplicate data from your query results is an important skill for all SQL users. The DISTINCT keyword plays an important role in your ability to do this effectively. If you enjoyed this article and want to continue learning about SQL, consider our SQL Fundamentals learning track. It will teach you basic SQL statements like WHERE, GROUP BY, ORDER BY, and HAVING. You’ll also learn how to JOIN tables and add, modify, or remove data from a database.

Tags:

Data Engineering