Back to articles list Articles Cookbook
14 minutes read

NULLs and Handling Missing Data in SQL

Handling missing data (i.e. NULLs) in SQL can be challenging. NULLs can pose a lot of traps, especially if you don’t understand how they work. In this article, we’ll talk about dealing with NULL in SQL. We’ll also explain how to avoid common mistakes when working with NULLs.

Having missing data in your database is an unavoidable reality of life. There are many reasons why you may have missing data in your database: the complete data isn’t yet available, users provide incomplete information, changes to the database schema, database malfunctions, human error, and more. Working with missing data in SQL queries is challenging. In this article I’ll show the most common pitfalls associated with incomplete data in SQL.

If you want to practice working with missing or unknown data in SQL, I recommend our SQL Practice track. At the time of writing, it contains 10 courses for you to practice SQL – and we keep adding more! The courses are divided into sections and many of them have a section dedicated to NULL. I specifically recommend these courses to practice handling NULL in SQL:

What Is NULL in SQL?

In SQL, NULL represents a missing or undefined value in a database. It is used to denote the fact that the value in a field is absent or unknown.

A common beginner problem with NULLs is that your database will often not show NULLs explicitly. In the table below, the like field for post ID 1, the location field for post ID 2, and the views field for post ID 3 are all NULL. However, a database does not show NULL explicitly; instead, it shows an empty field. You have to be aware that NULL is a possibility to guess that these fields are NULL.

idtitlelocationviewslikes
1Quick Morning Routines!London94,365
2Eco-Friendly Living Tips123,8916,587
3Healthy Snacks on the GoParis9,457

It’s important to note, however, that NULL is different from an empty string or from a zero. NULL is the absence of value; it means that the value is unknown. As we shall soon see, NULL itself is actually not a real value. Many people say or write something like “There are NULL values in this field”, but that is technically incorrect.

When designing a table in a database, a database designer may decide that NULL is not allowed for a particular field. This is done by setting a NOT NULL constraint for a column when the table is created. Primary keys (columns that identify rows in a table) are also NOT NULL by default.

It is considered a best practice to avoid NULLs in your database design. So, as many columns as possible should be defined as NOT NULL.  It’s best to only allow NULLs when strictly necessary. However, data can be messy and sometimes NULL is unavoidable.

The Dataset

In this article, we’ll be using data from table posts. Imagine you’re scraping data from your favorite social media platform to do some analysis on it. The data from that platform is stored in this table. Here are the fields:

  • id – The ID of the post.
  • title – The title of the post.
  • url – The URL (web address) of the post.
  • creator – The name of the post’s creator.
  • published – The date the post was published.
  • type – The post’s type.
  • location –Where the post is published; this can be NULL if the location is unknown or irrelevant.
  • views – How many views each post has; this can be NULL if the creator chooses not to make this data publicly visible.
  • likes – The number of likes given to the post; this can be NULL if the creator chooses not to make this data publicly visible.
  • dislikes – The number of dislikes given to the post; this can be NULL because the platform does not show this data anymore. However, we may have the dislikes data for older posts.

Now that we’ve reviewed the data, let’s use it to understand NULL.

Comparison Operators with NULL

Even something as seemingly simple as how NULL behaves when used with comparison operators can be counterintuitive and surprising for beginners.

Suppose we want to find out how many rows in the table posts have missing views fields. For example:

SELECT COUNT(*)
FROM posts
WHERE views = NULL;

Result: 0 rows

Great, zero rows have missing views. Amazing! How many of them are non-missing, then?

SELECT COUNT(*)
FROM posts
WHERE views <> NULL;

Result: 0 rows

Also zero? Something must be wrong.

The issue here is that you should use the IS NULL and IS NOT NULL operators to test for NULLs:

SELECT COUNT(*)
FROM posts
WHERE views IS NULL;

Result: 34 rows

SELECT COUNT(*)
FROM posts
WHERE views IS NOT NULL;

Result: 66 rows

Why are these results so different from the previous two queries?

Three-Valued Logic in SQL

The comparison problems for NULL come from the fact that NULL is not a real value. This is the most important point you have to understand to work with NULL efficiently.

SQL uses three-valued logic. Each logical condition in SQL can have one of three values: TRUE, FALSE or NULL. NULL here means “I don’t know”. Whenever you use a condition in WHERE, SQL returns rows for which the logical condition in WHERE is TRUE. It does not return the rows for which the condition is FALSE (as you’d expect) and for which the condition is NULL (which is not always what you’d expect).

Let’s consider an example of how WHERE works:

SELECT COUNT(*)
FROM posts
WHERE views < 100;

This query counts posts where the views column has a value and that value is less than 100. Posts with unknown views are not counted. This may be counterintuitive: when you display the posts data, you see the emptiness in many views fields. You intuitively think that surely this emptiness is less than 100. But the emptiness means that the database doesn’t have the data; since it doesn’t have the data, it can’t say if it’s less than 100 or not. So the rows that have an empty views field are not counted.

SELECT COUNT(*)
FROM posts
WHERE views < likes;

This query returns rows where the number of views and the number of likes are both known and the views value is less than the likes value. It does not return rows where the views value is unknown or the likes value is unknown.

If you want to include the posts with empty views fields in the result, you have to explicitly filter for NULL:

SELECT COUNT(*)
FROM posts
WHERE views < likes OR views IS NULL;

Let’s go back again to the queries we started with:

SELECT COUNT(*)
FROM posts
WHERE views = NULL;

SELECT COUNT(*)
FROM posts
WHERE views <> NULL;

The WHERE condition compares the views column with NULL. However, NULL means “I don’t know the value”. The database cannot say if views is equal (or not) to an unknown value. It may or may not be, so the database says NULL – i.e. “I don’t know” – and those rows are not returned in the result.

Remember:

  • Test for NULL with IS NULL and IS NOT NULL
  • Comparison operators (like <, <=, >, >=, =, <>, and LIKE) return NULL if one of the arguments is NULL. If you want to include NULL, test for it explicitly with IS NULL or IS NOT NULL.

Using NULL in SQL Functions

NULL is equally troublesome in operators and functions. Most functions and operators return NULL when given NULL as an argument.

Example 1: Imagine we want to return the header for each post. (The header consists of the title, a dash, and the location). Here’s the query:

SELECT 
  title || ‘ - ‘ || location
FROM posts;

Result:

Quick Morning Routines! - London
Healthy Snacks on the Go - Paris

The query returns NULL if the title or the location are missing. The post with ID 2 has  NULL as the result of our query, since its location is unknown.

Example 2: The same is true for arithmetic operators. Say you want to calculate post engagement as the sum of likes and dislikes:

SELECT 
  title, 
  likes, 
  dislikes, 
  likes + dislikes AS engagement
FROM posts;

Result:

titlelikesdislikesengagement
Quick Morning Routines!251530
Eco-Friendly Living Tips10
Healthy Snacks on the Go34

If any of the likes or dislikes fields is NULL, then the value returned in the engagement column is also NULL.

Example 3: The same behavior is exhibited by regular functions, like UPPER():

SELECT 
  title, 
  UPPER(creator)
FROM posts;
titleUPPER(creator)
Quick Morning Routines!JENNY
Eco-Friendly Living Tips
Healthy Snacks on the GoRACHEL82

The creator of the post “Eco-Friendly Living Tips” is unknown, and thus the expression UPPER(creator) returns NULL.

Functions That Work with NULLs

Fortunately, there are functions in SQL that help mitigate those problems with NULL.

COALESCE

COALESCE() takes many arguments and returns the first non-NULL value from its arguments. It is typically used to replace NULL with a meaningful value in another function or expression. We could modify our engagement query like this:

SELECT 
  title, 
  likes, 
  dislikes, 
  COALESCE(likes, 0) + COALESCE(dislikes, 0) AS engagement
FROM posts;

Whenever the likes or dislikes value is NULL, the COALESCE() function replaces it with 0. The new value is used in the computation and we avoid NULL results:

titlelikesdislikesengagement
Quick Morning Routines!251530
Eco-Friendly Living Tips1010
Healthy Snacks on the Go3434

You can also use COALESCE() to give a meaningful label for NULLs in the results. The following query replaces NULL with “Unknown” in the result set; the field itself is still NULL in the database:

SELECT 
  title, 
  COALESCE(location, ‘Unknown’) AS location
FROM posts;

Here’s the result:

titlelocation
Quick Morning Routines!London
Eco-Friendly Living TipsUnknown
Healthy Snacks on the GoParis

Remember: You use the COALESCE() function:

  • To provide a meaningful label for NULL in reports.
  • To give a value for NULL in computations.

NULLIF

Another function working with NULL is NULLIF. This one is a bit weird: it takes two arguments and returns NULL if the arguments are equal. In practice, you use NULLIF to avoid division by zero:

SELECT 
  title, 
  likes / NULLIF(views, 0)
FROM posts;

You want to compute the likes to views ratio for posts. However, if the views value is 0, you could get a division by zero error. In order to avoid this, you use the NULLIF function. If  views is equal to zero, then  NULLIF(views, 0) returns NULL.

NULL in division results in NULL result and avoids the division by zero error. Here we take advantage of NULL cascading over the results of the computations.

NULL in GROUP BY and Aggregate Functions

When working with missing values, it’s good to know how NULL behaves in GROUP BY and aggregate functions.

NULL and GROUP BY

GROUP BY puts rows into groups based on common values in a given column. You can then apply aggregate functions to each group and compute summaries for each group. This query counts the number of posts for each location:

SELECT 
  location, 
  COUNT(*)
FROM posts
GROUP BY location;

With GROUP BY, all rows with NULL in the column are put into one group; you compute statistics for this group like any other.

In our example, all posts with an unknown location are put into one group:

locationCOUNT
London45
Paris23
12

NULL and Aggregate Functions

Generally speaking, aggregate functions also ignore NULLs. But there are some important variants in how some aggregate functions handle NULLs.

The functions SUM(), MIN(), MAX() all ignore NULLs:

SELECT 
  type, 
  SUM(views), 
  MIN(views), 
MAX(views)
FROM posts
GROUP BY type;
typeSUMMINMAX
video230,4855,632100,589
image159,3401,28945,003
text34,2242563,341
infographics

The SUM() function treats NULL as if it were 0, so NULL does not influence the result of the SUM. But if all values in the group are NULL, the result of the SUM() is NULL. In our example, we don't have any view data for the infographics group, so the sum is NULL for this group.

The MIN() and MAX() functions also ignore NULL; they return the minimum and maximum values from the known values. Only if all values in the group are NULL will these functions return NULL. Our infographic group has no data, so the minimum and maximum values are reported as NULL.

The COUNT() function is a bit more subtle when it comes to handling NULL. There are three variants of COUNT syntax: COUNT(*), COUNT(expression), COUNT(DISTINCT). You can read about them in our article  What is the Difference Between COUNT(*), COUNT(1), COUNT(column), and COUNT(DISTINCT)?:

SELECT 
  COUNT(*), 
  COUNT(location), 
  COUNT(DISTINCT location)
FROM posts;
COUNTCOUNTCOUNT
1007852

The expression COUNT(*) counts all the rows in the results set. There are 100 posts in our posts table, so this expression returns 100.

The expression COUNT(location) counts non-NULL values in the given column. In our example, it will count posts where the location column is not NULL. It will ignore posts with unknown locations.

Finally, the COUNT(DISTINCT location) counts distinct non-NULL values; in other words, it ignores repeated values. It will count how many different locations there are in our posts table.

The AVG() function ignores NULL. This is generally what you expect. However, you should be careful when using AVG() with COALESCE(). All of the following variants return different values: AVG(views), AVG(COALESCE(views,0)), COALESCE(AVG(views)).

Remember:

  • Rows with NULLs in GROUP BY columns are put into a separate group.
  • Aggregate functions ignore NULL and only use known values in computations.
  • Use COALESCE if you want to replace an unknown value with a specific value.

NULL and JOIN

You have to remember about NULL when using JOIN, especially with OUTER JOINs like LEFT JOIN or FULL JOIN.  There can be NULLs in columns coming from the right table.

Imagine we have another table, comments, that contains data about post comments. It has information in the following columns:

  • id – A unique identifier for each comment.
  • post_id – The ID of the post this comment is about.
  • content – The content of the comment
  • author – The author of the comment
  • upvotes – The number of upvotes given to this comment; this can be NULL
  • downvotes – The number of downvotes given to this comment; it can be NULL

We want to count how many comments there are for each post, but we want to include posts with no comments in the results. You have to use posts LEFT JOIN comments to include all posts.

Next, you have to remember to use COUNT(comments.id) and not COUNT(*) when counting comments. The latter will count rows regardless of whether the row is related to the comment. The correct way is to use COUNT(comments.id). If there are no comments, the id is NULL and it is not counted.

SELECT 
  posts.title, 
  COUNT(comments.id)
FROM posts
LEFT JOIN comments
ON posts.id = comments.post_id;

Another problem to keep in mind is that the WHERE condition can sometimes “cancel” the OUTER JOIN. In the query below, we want to find comments with upvotes higher than 100. If the post has some comments with an unknown number of upvotes, these comments will not be included in the result. If the post only has comments with an unknown amount of upvotes, the post will not be included at all – despite our using LEFT JOIN. The WHERE condition will effectively “cancel” the LEFT JOIN:

SELECT 
  posts.title, 
  comments.content
FROM posts
LEFT JOIN comments
ON posts.id = comments.post_id
WHERE upvotes > 100;

Remember:

  • LEFT JOIN, RIGHT JOIN, or FULL JOIN can introduce NULL in the result.
  • The WHERE condition can “cancel out” the OUTER JOIN.

NULL in ORDER BY

When you’re creating a report, you often want to sort the data in a specific order – e.g. in alphabetical, ascending, or descending order. How does NULL behave in sorting?

When you order by a column that contains NULL, rows with NULL will come up first or last,  depending on the database engine you’re using. For example, MySQL places NULLs first for ascending order sorts, while Oracle places them last for ascending order sorts. You can check the default behavior of your database in its documentation.

If you don’t remember the default behavior or you don’t like the default behavior, you can use the NULLS FIRST or NULLS LAST operators after ORDER BY to specify the desired behavior:

SELECT 
  title, 
  views
FROM posts
ORDER BY views DESC NULLS LAST;

This ensures any rows with a NULL in them are listed last:

titleviews
Quick Morning Routines!120,365
Eco-Friendly Living Tips256
Easy At-Home Workouts for All Levels
Healthy Snacks on the Go

You can read in detail about how NULL works with ORDER BY in How ORDER BY and NULL Work Together in SQL.

Handle Missing Data with NULL in SQL!

Handling NULL and missing data in SQL is an important skill for anyone working with data. Understanding the nuances of NULL, its behavior in different operations, and the best practices for managing missing data ensures that your queries are accurate and your analyses are reliable.

To deepen your understanding of SQL, consider buying our All Forever SQL package. This one-time payment offer provides lifetime access to all current and future SQL courses. The courses cover everything from basic queries to advanced SQL; what you learn will be useful at all levels of your career. Also, be sure to check out our SQL Practice track with 10 SQL practice courses and over 1,000 exercises. Elevate your skills for the long term with LearnSQL.com!