top of page
  • Writer's pictureKelly Adams

How to Validate SQL Query Results

Updated: Apr 26


A person is typing on a laptop. It is a closeup of the keyboard.

As a data analyst, how do I make sure my SQL queries return the correct results? Unlike in courses there’s no single correct answer. A query might return results without errors, but it might be inaccurate. Data validation has honestly saved me so many times. There have been times when my first query was returning no errors but after investigating more it was incorrect. Data analysts need to return accurate and clean data. Without it making good business decisions is a major challenge.


How I Validate Data

Below, I'll dive into the three main ways I validate data, providing examples for each. These are meant to go over my methodology and won’t dive into a step-by-step guide. 


Note: All examples in this blog post use simplified, fictional data for educational purposes only. In the real-world, tables typically contain many more rows and columns. But, I have simplified the data here to focus understanding and teaching the core concepts.


Method 1: Look at data as a whole

Take a comprehensive view of the data. This takes time to develop, and I’m still working on it. But I do have a good understanding of the business and can spot check when numbers look incorrect. This is great for a quick check when I’m drafting the query. This doesn't require any specific tool, I’m just looking at the results as a whole and comparing it with results I expect. 


Example: Aggregation by Date


Below is a PostgreSQL query that gets the total purchases by day from the sales  table.. 



A SQL query displayed in a code editor with a dark blue background. The code is written in white and light blue font colors, highlighting syntax such as 'SELECT', 'FROM', 'WHERE', and 'GROUP BY'. The query selects transaction dates and the sum of purchase amounts from a 'sales' table, filtering transactions between January 1 and January 31, 2024, and groups the results by transaction date and item code.


The results are returning six figures for purchases which is what I expect. It would a red flag if it was returning seven figures for each day. Taking a quick look the aggregation seems to be working properly. But I would probably do a more in-depth validation (see the next 2 methods below). 

transaction_date

total_purchases

2024-01-01

100000

2024-01-02

150000

2024-01-03

125000


Method 2: Check results against database

Verify query results against the database directly. Instead of relying on my query aggregation I look at the database and tables directly to make sure it’s correct. This is mostly used for simple aggregations like counts or sums, especially over a specific date range.


Example: Count of Total Purchases on a Specific Date


The PostgreSQL query shows the total purchases by day from the sales  table..



A SQL query displayed in a code editor with a dark blue background. The code is written in white and light blue font colors, highlighting syntax such as 'SELECT', 'FROM', 'WHERE', and 'GROUP BY'. This code snippet is extracting the count of purchases from the 'sales' table for a specific date, January 1, 2024, using the 'COUNT' function and groups the count by the transaction date.


Result of the query:

transaction_date

count_purchases

2024-01-01

10000

To validate the data I’m going to look at the total count of purchases in time period directly in the sales  table.

transaction_timestamp::date = '2024-01-01'

Note: for some database tools you can just add in the WHERE clause at the top without having to write the entire query. This is how it works in DBeaver, which is the main tool that I use.


This the raw data from the sales table.. Timestamp is in: YYYY-MM-DD HH:MM:SS format. We are going to ignore the item_code  column for now because we only care about the total count of purchases in ‘2024-01-01’.

id

transaction_timestamp

purchase_amount

item_code

1

2024-01-01 00:00:30

50

A

2

2024-01-01 00:00:45

50

A

3

2024-01-01 00:00:57

40

B

4

2024-01-01 00:01:15

50

A

5

2024-01-01 00:01:30

30

C

...

...

...

...

10000

2024-01-01 23:58:15

30

C

Depending on the database tool, you can just click a column and get metrics for that column including: count, sum, min, and max. This is how I would quickly check the count of all purchases made (regardless of item) for the date of ‘2024-01-01’.


Method 3: Get subset of data

For queries where the the result is large, I look at a subset of data to validate. This is because it can be a challenge to manually aggregate data especially if the raw data is big. Breaking it up into smaller chunks helps reduce complexity. I make sure to get enough subset of data to give a manageable but good sample for validation. Typically this is used for analysis when I have to look at specific user activity like playing games.


Example:

The PostgreSQL query shows is returning the total purchase amount for a specific item by day from the sales  table..



A SQL query displayed in a code editor with a dark blue background. The code is written in white and light blue font colors, highlighting syntax such as 'SELECT', 'FROM', 'WHERE', and 'GROUP BY'. It aggregates total purchases from the 'sales' table over a three-day period, from January 1 to January 3, 2024. The results are grouped by the transaction date.


These are the results.

transaction_date

total_purchases

item_code

2024-01-01

10000

A

2024-01-01

15000

B

2024-01-01

15000

C

2024-01-02

9000

A

2024-01-02

12000

B

2024-01-02

7000

C

...

...

...

2024-01-31

8000

C

Instead of having to check every single item for every single day of the month. I’m going to only check the total_purchases  for the dates: 2024-01-01 and 2024-01-15 and check the total_purchases  for items within those dates.


  • 2024-01-01

  • A

  • B

  • C

  • 2024-01-15

  • A

  • B

  • C

I’m going to validate my query by looking at the total amount of purchases in time period directly in the sales table. I check sales  table and only filter transactions for one day at a time. Note: for some database tools you can just add in the WHERE clause at the top without having to write the entire query. This is how it works in DBeaver, which is the main tool that I use.


transaction_timestamp::date = '2024-01-01'

This the raw data from the sales table.

id

transaction_timestamp

purchase_amount

item_code

1

2024-01-01 05:00

50

A

2

2024-01-01 05:05

50

A

3

2024-01-01 05:10

40

B

4

2024-01-01 05:20

50

A

5

2024-01-01 05:25

30

C

...

...

...

...

100

2024-01-01 23:50

30

B

Then I’d export the data in Excel and get the total purchases by item using a pivot table. I would then repeat this process for the other day I was going to look at, 2024-01-15. I use Excel because it’s easy to pivot data.


Tips

A few tips I’ve learned from validating data:

  1. Get help when you need it. If it doesn’t look right ask a team member to review your work. Sometimes when I’ve been looking at a project for so long my eyes glaze over and I miss something obvious. Don’t worry this happens to all of us from time to time.

  2. Use different validation methods. I like to use a variety of methods when I validate. For instance, if the query aggregate data, I’ll cross validate with raw data using a tool like Excel. This diverse approach strengthens validations and reduces errors.

Conclusion

While the validation process may seem tedious, it’s necessary. Getting accurate data ensures the you’re basing insights off of reliable data and helps the business make informed decisions. As a data analyst, you’ll probably be spending a lot more of your time on cleaning and validation than you think. It’s not the most exciting task you’ll do, it is the foundation of developing accurate data driven decisions.

Opmerkingen


bottom of page