23rd Apr 2021 7 minutes read What Is the Role of DISTINCT in SQL? Andrew Bone Data Engineering Table of Contents SQL DISTINCT on a single column SQL DISTINCT Removes Duplicate Values Pay Attention to the Columns Used The Inclusion of NULL values SQL DISTINCT with a Subquery SQL DISTINCT with Multiple Columns SQL DISTINCT with Aggregate Functions DISTINCT with Aggregate Functions and GROUP BY Learning about DISTINCT in SQL Is Just the Start! The DISTINCT keyword in SQL plays an important role in removing duplicate values from your results. This article provides examples showing exactly how DISTINCT works. By default, SQL queries show all the returned rows, including duplicate rows, in the result set. The DISTINCT keyword in the SELECT clause is used to eliminate duplicate rows and display a unique list of values. In other words, the DISTINCT keyword retrieves unique values from a table. The basic syntax of the DISTINCT keyword in SQL is: SELECT DISTINCT column_1, column_2, …, column_n FROM table_name (If you find the above syntax confusing, consider LearnSQL.com’s SQL Basics course. It covers the fundamentals of SQL, including the DISTINCT keyword.) Note that the DISTINCT keyword must be placed before the columns you specify. You can specify more than one column to be DISTINCT, which will make SQL return only unique combinations of those columns’ values. If you do select more than one column, all the columns are combined and only the uniqueness of the row will be considered. SQL DISTINCT looks for a distinct row, rather than a distinct column. Let’s see the DISTINCT keyword in action with a practical example. First, though, we should familiarize ourselves with the data we’ll be working with. Imagine we have a table called employee_info, which contains details about the employees at our company: employee_idfirst_namelast_namedepartmentage 127374JoshuaReesSales26 293421KarenBowerSales45 244813LukePayneNULL42 463281MaxPaigeHuman Resources53 374241JuliaGloverHuman Resources29 568421RoseHarrisHuman Resources43 NULLNULLNULLNULLNULL 839211OliviaBakerSales32 293493KarenBowerSales45 463281MaxPaigeHuman Resources53 244813LukePayneNULL42 933821BernadetteParsonsSalesNULL SQL DISTINCT on a single column Let’s see the effect of the DISTINCT keyword when we try to SELECT the employee_id column. First, here is the query before applying the DISTINCT keyword: SELECT employee_id FROM employee_info This results in the following: employee_id 127374 293421 244813 463281 374241 568421 NULL 839211 293493 463281 244813 933821 Executing this query returns the entire employee_id column from our table. This column matches the employee_info table exactly. On closer analysis, we can see some duplicate values – 244813 and 463281 each appear twice. Let’s add the DISTINCT keyword to address these duplicate values. Remember, we add the DISTINCT keyword before we specify the columns: SELECT DISTINCT employee_id FROM employee_info employee_id 127374 293421 244813 463281 374241 568421 NULL 839211 293493 933821 Our resulting data set appears shorter; there are several things we can learn from this response. SQL DISTINCT Removes Duplicate Values If we contrast the original table to our result, we can see that the duplicates were successfully filtered out. Using the DISTINCT keyword is just one method for removing duplicate values, as shown by this article on the different ways to remove duplicates in SQL. The employee_info table contained two occurrences of the following values: employee_id 244813 463281 However, our result now only shows these values once. The DISTINCT keyword has had the desired effect. Pay Attention to the Columns Used When you first view the employee_info table, it appears that these rows are duplicates. employee_idfirst_namelast_namedepartmentage 293421KarenBowerSales45 293493KarenBowerSales45 On closer inspection, you can see that the value of employee_id is different. Since we applied the DISTINCT keyword to this employee_id column, both of these values are returned. The Inclusion of NULL values You’ve probably noticed the presence of NULL values in the dataset returned. In SQL, NULL is considered a unique value, meaning it will get returned when using the DISTINCT keyword. If you would like to remove NULL values from the result, you could write the following query: SELECT DISTINCT * FROM employee_info WHERE employee_id IS NOT NULL SQL DISTINCT with a Subquery You can also use the DISTINCT keyword as part of a subquery. However, it is worth mentioning that an IN, ANY, or EXISTS subquery evaluates to true if there is at least one row that causes the subquery to evaluate to true. This makes using a DISTINCT with an IN, ANY, or EXISTS subquery unnecessary. SQL DISTINCT with Multiple Columns The columns you specify as DISTINCT play a key role in what data gets returned. Look at the following query: SELECT DISTINCT employee_id, first_name, last_name FROM employee_info We are looking for unique combinations of these three columns. Executing this query yields the following result: employee_idfirst_namelast_name 127374JoshuaRees 293421KarenBower 244813LukePayne 463281MaxPaige 374241JuliaGlover 568421RoseHarris NULLNULLNULL 839211OliviaBaker 293493KarenBower 933821BernadetteParsons When scanning this result set, you may be expecting to see completely unique names along with their employee_id. For example, look at this data: employee_idfirst_namelast_name 293421KarenBower 293493KarenBower The employee_id value is unique for both rows, meaning these rows are also considered unique. If we changed the query to the following: SELECT DISTINCT first_name, last_name FROM employee_info ... employee_id is no longer considered; now we’re looking for unique combinations of names. Thus, only one row for “Karen Bower” is shown: first_namelast_name JoshuaRees KarenBower LukePayne MaxPaige JuliaGlover RoseHarris NULLNULL OliviaBaker BernadetteParsons If you are struggling to achieve your desired results when using the DISTINCT keyword, pay special attention to the columns you have included. If this specific example is still causing you confusion, check out this article on how to count distinct values in SQL. SQL DISTINCT with Aggregate Functions The DISTINCT keyword can be used along aggregate functions. The most common aggregate functions are: MIN() - Returns the smallest (minimal) value. MAX() - Returns the largest (maximal) value. SUM() - Returns the sum of values specified. AVG() - Returns the average the specified values (e.g. in a column). COUNT() - Returns the number of items (e.g. in a column). When DISTINCT is used with SUM(), AVG(), and COUNT(), it eliminates duplicate values before the sum, average, or count is calculated. While you can use the DISTINCT keyword with MIN() and MAX(), it is pointless to do so; it will have no impact on the results. Let’s apply this information to our previous example involving the employee_info table. If we wanted to count the number of employees currently working at our company while taking care to remove duplicate entries, we’d write the following query: SELECT COUNT(DISTINCT employee_id) FROM employee_info The result of this query is: COUNT(DISTINCT employee_id) 9 This would be the simplest example of how the DISTINCT keyword can be used in conjunction with aggregate functions. Executing this query without the DISTINCT keyword would include duplicate values in the count: SELECT COUNT(employee_id) FROM employee_info The result of this query is: COUNT(employee_id) 11 If you think your table may contain duplicate records, using DISTINCT is an excellent way to ensure your reporting is accurate and remains unaffected by duplicate values. If this use case isn’t clear to you, I recommend reading this article on counting distinct values in SQL. There are many other use cases for COUNT(); for additional useful examples, check out this article explaining the differences between common applications of the COUNT() function. DISTINCT with Aggregate Functions and GROUP BY Let’s use an aggregate function with the DISTINCT keyword to help us solve a common problem. We want to find the number of potential duplicate employee_id entries for each name pairing in our table. We want to isolate the case where the employee’s name is exactly the same, but with a different employee_id. Here’s the query: SELECT first_name, last_name, COUNT(DISTINCT employee_id) FROM employee_info GROUP BY first_name, last_name And the result: employee_idfirst_nameCOUNT (DISTINCT employee_id) NULLNULL0 BernadetteParsons1 JoshuaRees1 JuliaGlover1 KarenBower2 LukePayne1 MaxPaige1 OliviaBaker1 RoseHarris1 We can see from these results that we have two different entries for Karen Bower’s employee_id. This is an excellent piece of information to have. If there are not two different individuals named Karen Bower working at our company, we know that this information was entered in error and can be considered a duplicate to be removed. Learning about DISTINCT in SQL Is Just the Start! The ability to isolate and exclude duplicate data from your query results is an important skill for all SQL users. The DISTINCT keyword plays an important role in your ability to do this effectively. If you enjoyed this article and want to continue learning about SQL, consider our SQL Fundamentals learning track. It will teach you basic SQL statements like WHERE, GROUP BY, ORDER BY, and HAVING. You’ll also learn how to JOIN tables and add, modify, or remove data from a database. Tags: Data Engineering