24th Oct 2024 14 minutes read NULLs and Handling Missing Data in SQL Agnieszka Kozubek-Krycuń NULL SQL Operators Table of Contents What Is NULL in SQL? The Dataset Comparison Operators with NULL Three-Valued Logic in SQL Using NULL in SQL Functions Functions That Work with NULLs COALESCE NULLIF NULL in GROUP BY and Aggregate Functions NULL and GROUP BY NULL and Aggregate Functions NULL and JOIN NULL in ORDER BY Handle Missing Data with NULL in SQL! Handling missing data (i.e. NULLs) in SQL can be challenging. NULLs can pose a lot of traps, especially if you don’t understand how they work. In this article, we’ll talk about dealing with NULL in SQL. We’ll also explain how to avoid common mistakes when working with NULLs. Having missing data in your database is an unavoidable reality of life. There are many reasons why you may have missing data in your database: the complete data isn’t yet available, users provide incomplete information, changes to the database schema, database malfunctions, human error, and more. Working with missing data in SQL queries is challenging. In this article I’ll show the most common pitfalls associated with incomplete data in SQL. If you want to practice working with missing or unknown data in SQL, I recommend our SQL Practice track. At the time of writing, it contains 10 courses for you to practice SQL – and we keep adding more! The courses are divided into sections and many of them have a section dedicated to NULL. I specifically recommend these courses to practice handling NULL in SQL: Basic SQL Practice: A Store SQL Practice: University Basic SQL Practice: Run Track Through Queries! Basic SQL Practice: Blog & Traffic Data What Is NULL in SQL? In SQL, NULL represents a missing or undefined value in a database. It is used to denote the fact that the value in a field is absent or unknown. A common beginner problem with NULLs is that your database will often not show NULLs explicitly. In the table below, the like field for post ID 1, the location field for post ID 2, and the views field for post ID 3 are all NULL. However, a database does not show NULL explicitly; instead, it shows an empty field. You have to be aware that NULL is a possibility to guess that these fields are NULL. idtitlelocationviewslikes 1Quick Morning Routines!London94,365 2Eco-Friendly Living Tips123,8916,587 3Healthy Snacks on the GoParis9,457 It’s important to note, however, that NULL is different from an empty string or from a zero. NULL is the absence of value; it means that the value is unknown. As we shall soon see, NULL itself is actually not a real value. Many people say or write something like “There are NULL values in this field”, but that is technically incorrect. When designing a table in a database, a database designer may decide that NULL is not allowed for a particular field. This is done by setting a NOT NULL constraint for a column when the table is created. Primary keys (columns that identify rows in a table) are also NOT NULL by default. It is considered a best practice to avoid NULLs in your database design. So, as many columns as possible should be defined as NOT NULL. It’s best to only allow NULLs when strictly necessary. However, data can be messy and sometimes NULL is unavoidable. The Dataset In this article, we’ll be using data from table posts. Imagine you’re scraping data from your favorite social media platform to do some analysis on it. The data from that platform is stored in this table. Here are the fields: id – The ID of the post. title – The title of the post. url – The URL (web address) of the post. creator – The name of the post’s creator. published – The date the post was published. type – The post’s type. location –Where the post is published; this can be NULL if the location is unknown or irrelevant. views – How many views each post has; this can be NULL if the creator chooses not to make this data publicly visible. likes – The number of likes given to the post; this can be NULL if the creator chooses not to make this data publicly visible. dislikes – The number of dislikes given to the post; this can be NULL because the platform does not show this data anymore. However, we may have the dislikes data for older posts. Now that we’ve reviewed the data, let’s use it to understand NULL. Comparison Operators with NULL Even something as seemingly simple as how NULL behaves when used with comparison operators can be counterintuitive and surprising for beginners. Suppose we want to find out how many rows in the table posts have missing views fields. For example: SELECT COUNT(*) FROM posts WHERE views = NULL; Result: 0 rows Great, zero rows have missing views. Amazing! How many of them are non-missing, then? SELECT COUNT(*) FROM posts WHERE views <> NULL; Result: 0 rows Also zero? Something must be wrong. The issue here is that you should use the IS NULL and IS NOT NULL operators to test for NULLs: SELECT COUNT(*) FROM posts WHERE views IS NULL; Result: 34 rows SELECT COUNT(*) FROM posts WHERE views IS NOT NULL; Result: 66 rows Why are these results so different from the previous two queries? Three-Valued Logic in SQL The comparison problems for NULL come from the fact that NULL is not a real value. This is the most important point you have to understand to work with NULL efficiently. SQL uses three-valued logic. Each logical condition in SQL can have one of three values: TRUE, FALSE or NULL. NULL here means “I don’t know”. Whenever you use a condition in WHERE, SQL returns rows for which the logical condition in WHERE is TRUE. It does not return the rows for which the condition is FALSE (as you’d expect) and for which the condition is NULL (which is not always what you’d expect). Let’s consider an example of how WHERE works: SELECT COUNT(*) FROM posts WHERE views < 100; This query counts posts where the views column has a value and that value is less than 100. Posts with unknown views are not counted. This may be counterintuitive: when you display the posts data, you see the emptiness in many views fields. You intuitively think that surely this emptiness is less than 100. But the emptiness means that the database doesn’t have the data; since it doesn’t have the data, it can’t say if it’s less than 100 or not. So the rows that have an empty views field are not counted. SELECT COUNT(*) FROM posts WHERE views < likes; This query returns rows where the number of views and the number of likes are both known and the views value is less than the likes value. It does not return rows where the views value is unknown or the likes value is unknown. If you want to include the posts with empty views fields in the result, you have to explicitly filter for NULL: SELECT COUNT(*) FROM posts WHERE views < likes OR views IS NULL; Let’s go back again to the queries we started with: SELECT COUNT(*) FROM posts WHERE views = NULL; SELECT COUNT(*) FROM posts WHERE views <> NULL; The WHERE condition compares the views column with NULL. However, NULL means “I don’t know the value”. The database cannot say if views is equal (or not) to an unknown value. It may or may not be, so the database says NULL – i.e. “I don’t know” – and those rows are not returned in the result. Remember: Test for NULL with IS NULL and IS NOT NULL Comparison operators (like <, <=, >, >=, =, <>, and LIKE) return NULL if one of the arguments is NULL. If you want to include NULL, test for it explicitly with IS NULL or IS NOT NULL. Using NULL in SQL Functions NULL is equally troublesome in operators and functions. Most functions and operators return NULL when given NULL as an argument. Example 1: Imagine we want to return the header for each post. (The header consists of the title, a dash, and the location). Here’s the query: SELECT title || ‘ - ‘ || location FROM posts; Result: Quick Morning Routines! - London Healthy Snacks on the Go - Paris The query returns NULL if the title or the location are missing. The post with ID 2 has NULL as the result of our query, since its location is unknown. Example 2: The same is true for arithmetic operators. Say you want to calculate post engagement as the sum of likes and dislikes: SELECT title, likes, dislikes, likes + dislikes AS engagement FROM posts; Result: titlelikesdislikesengagement Quick Morning Routines!251530 Eco-Friendly Living Tips10 Healthy Snacks on the Go34 If any of the likes or dislikes fields is NULL, then the value returned in the engagement column is also NULL. Example 3: The same behavior is exhibited by regular functions, like UPPER(): SELECT title, UPPER(creator) FROM posts; titleUPPER(creator) Quick Morning Routines!JENNY Eco-Friendly Living Tips Healthy Snacks on the GoRACHEL82 The creator of the post “Eco-Friendly Living Tips” is unknown, and thus the expression UPPER(creator) returns NULL. Functions That Work with NULLs Fortunately, there are functions in SQL that help mitigate those problems with NULL. COALESCE COALESCE() takes many arguments and returns the first non-NULL value from its arguments. It is typically used to replace NULL with a meaningful value in another function or expression. We could modify our engagement query like this: SELECT title, likes, dislikes, COALESCE(likes, 0) + COALESCE(dislikes, 0) AS engagement FROM posts; Whenever the likes or dislikes value is NULL, the COALESCE() function replaces it with 0. The new value is used in the computation and we avoid NULL results: titlelikesdislikesengagement Quick Morning Routines!251530 Eco-Friendly Living Tips1010 Healthy Snacks on the Go3434 You can also use COALESCE() to give a meaningful label for NULLs in the results. The following query replaces NULL with “Unknown” in the result set; the field itself is still NULL in the database: SELECT title, COALESCE(location, ‘Unknown’) AS location FROM posts; Here’s the result: titlelocation Quick Morning Routines!London Eco-Friendly Living TipsUnknown Healthy Snacks on the GoParis Remember: You use the COALESCE() function: To provide a meaningful label for NULL in reports. To give a value for NULL in computations. NULLIF Another function working with NULL is NULLIF. This one is a bit weird: it takes two arguments and returns NULL if the arguments are equal. In practice, you use NULLIF to avoid division by zero: SELECT title, likes / NULLIF(views, 0) FROM posts; You want to compute the likes to views ratio for posts. However, if the views value is 0, you could get a division by zero error. In order to avoid this, you use the NULLIF function. If views is equal to zero, then NULLIF(views, 0) returns NULL. NULL in division results in NULL result and avoids the division by zero error. Here we take advantage of NULL cascading over the results of the computations. NULL in GROUP BY and Aggregate Functions When working with missing values, it’s good to know how NULL behaves in GROUP BY and aggregate functions. NULL and GROUP BY GROUP BY puts rows into groups based on common values in a given column. You can then apply aggregate functions to each group and compute summaries for each group. This query counts the number of posts for each location: SELECT location, COUNT(*) FROM posts GROUP BY location; With GROUP BY, all rows with NULL in the column are put into one group; you compute statistics for this group like any other. In our example, all posts with an unknown location are put into one group: locationCOUNT London45 Paris23 12 …… NULL and Aggregate Functions Generally speaking, aggregate functions also ignore NULLs. But there are some important variants in how some aggregate functions handle NULLs. The functions SUM(), MIN(), MAX() all ignore NULLs: SELECT type, SUM(views), MIN(views), MAX(views) FROM posts GROUP BY type; typeSUMMINMAX video230,4855,632100,589 image159,3401,28945,003 text34,2242563,341 infographics The SUM() function treats NULL as if it were 0, so NULL does not influence the result of the SUM. But if all values in the group are NULL, the result of the SUM() is NULL. In our example, we don't have any view data for the infographics group, so the sum is NULL for this group. The MIN() and MAX() functions also ignore NULL; they return the minimum and maximum values from the known values. Only if all values in the group are NULL will these functions return NULL. Our infographic group has no data, so the minimum and maximum values are reported as NULL. The COUNT() function is a bit more subtle when it comes to handling NULL. There are three variants of COUNT syntax: COUNT(*), COUNT(expression), COUNT(DISTINCT). You can read about them in our article What is the Difference Between COUNT(*), COUNT(1), COUNT(column), and COUNT(DISTINCT)?: SELECT COUNT(*), COUNT(location), COUNT(DISTINCT location) FROM posts; COUNTCOUNTCOUNT 1007852 The expression COUNT(*) counts all the rows in the results set. There are 100 posts in our posts table, so this expression returns 100. The expression COUNT(location) counts non-NULL values in the given column. In our example, it will count posts where the location column is not NULL. It will ignore posts with unknown locations. Finally, the COUNT(DISTINCT location) counts distinct non-NULL values; in other words, it ignores repeated values. It will count how many different locations there are in our posts table. The AVG() function ignores NULL. This is generally what you expect. However, you should be careful when using AVG() with COALESCE(). All of the following variants return different values: AVG(views), AVG(COALESCE(views,0)), COALESCE(AVG(views)). Remember: Rows with NULLs in GROUP BY columns are put into a separate group. Aggregate functions ignore NULL and only use known values in computations. Use COALESCE if you want to replace an unknown value with a specific value. NULL and JOIN You have to remember about NULL when using JOIN, especially with OUTER JOINs like LEFT JOIN or FULL JOIN. There can be NULLs in columns coming from the right table. Imagine we have another table, comments, that contains data about post comments. It has information in the following columns: id – A unique identifier for each comment. post_id – The ID of the post this comment is about. content – The content of the comment author – The author of the comment upvotes – The number of upvotes given to this comment; this can be NULL downvotes – The number of downvotes given to this comment; it can be NULL We want to count how many comments there are for each post, but we want to include posts with no comments in the results. You have to use posts LEFT JOIN comments to include all posts. Next, you have to remember to use COUNT(comments.id) and not COUNT(*) when counting comments. The latter will count rows regardless of whether the row is related to the comment. The correct way is to use COUNT(comments.id). If there are no comments, the id is NULL and it is not counted. SELECT posts.title, COUNT(comments.id) FROM posts LEFT JOIN comments ON posts.id = comments.post_id; Another problem to keep in mind is that the WHERE condition can sometimes “cancel” the OUTER JOIN. In the query below, we want to find comments with upvotes higher than 100. If the post has some comments with an unknown number of upvotes, these comments will not be included in the result. If the post only has comments with an unknown amount of upvotes, the post will not be included at all – despite our using LEFT JOIN. The WHERE condition will effectively “cancel” the LEFT JOIN: SELECT posts.title, comments.content FROM posts LEFT JOIN comments ON posts.id = comments.post_id WHERE upvotes > 100; Remember: LEFT JOIN, RIGHT JOIN, or FULL JOIN can introduce NULL in the result. The WHERE condition can “cancel out” the OUTER JOIN. NULL in ORDER BY When you’re creating a report, you often want to sort the data in a specific order – e.g. in alphabetical, ascending, or descending order. How does NULL behave in sorting? When you order by a column that contains NULL, rows with NULL will come up first or last, depending on the database engine you’re using. For example, MySQL places NULLs first for ascending order sorts, while Oracle places them last for ascending order sorts. You can check the default behavior of your database in its documentation. If you don’t remember the default behavior or you don’t like the default behavior, you can use the NULLS FIRST or NULLS LAST operators after ORDER BY to specify the desired behavior: SELECT title, views FROM posts ORDER BY views DESC NULLS LAST; This ensures any rows with a NULL in them are listed last: titleviews Quick Morning Routines!120,365 …… Eco-Friendly Living Tips256 Easy At-Home Workouts for All Levels Healthy Snacks on the Go You can read in detail about how NULL works with ORDER BY in How ORDER BY and NULL Work Together in SQL. Handle Missing Data with NULL in SQL! Handling NULL and missing data in SQL is an important skill for anyone working with data. Understanding the nuances of NULL, its behavior in different operations, and the best practices for managing missing data ensures that your queries are accurate and your analyses are reliable. To deepen your understanding of SQL, consider buying our All Forever SQL package. This one-time payment offer provides lifetime access to all current and future SQL courses. The courses cover everything from basic queries to advanced SQL; what you learn will be useful at all levels of your career. Also, be sure to check out our SQL Practice track with 10 SQL practice courses and over 1,000 exercises. Elevate your skills for the long term with LearnSQL.com! Tags: NULL SQL Operators