As a data analyst, how do I make sure my SQL queries return the correct results? Unlike in courses there’s no single correct answer. A query might return results without errors, but it might be inaccurate. Data validation has honestly saved me so many times. There have been times when my first query was returning no errors but after investigating more it was incorrect. Data analysts need to return accurate and clean data. Without it making good business decisions is a major challenge.
How I Validate Data
Below, I'll dive into the three main ways I validate data, providing examples for each. These are meant to go over my methodology and won’t dive into a step-by-step guide.
Note: All examples in this blog post use simplified, fictional data for educational purposes only. In the real-world, tables typically contain many more rows and columns. But, I have simplified the data here to focus understanding and teaching the core concepts.
Method 1: Look at data as a whole
Take a comprehensive view of the data. This takes time to develop, and I’m still working on it. But I do have a good understanding of the business and can spot check when numbers look incorrect. This is great for a quick check when I’m drafting the query. This doesn't require any specific tool, I’m just looking at the results as a whole and comparing it with results I expect.
Example: Aggregation by Date
Below is a PostgreSQL query that gets the total purchases by day from the sales table..
The results are returning six figures for purchases which is what I expect. It would a red flag if it was returning seven figures for each day. Taking a quick look the aggregation seems to be working properly. But I would probably do a more in-depth validation (see the next 2 methods below).
transaction_date | total_purchases |
2024-01-01 | 100000 |
2024-01-02 | 150000 |
2024-01-03 | 125000 |
Method 2: Check results against database
Verify query results against the database directly. Instead of relying on my query aggregation I look at the database and tables directly to make sure it’s correct. This is mostly used for simple aggregations like counts or sums, especially over a specific date range.
Example: Count of Total Purchases on a Specific Date
The PostgreSQL query shows the total purchases by day from the sales table..
Result of the query:
transaction_date | count_purchases |
2024-01-01 | 10000 |
To validate the data I’m going to look at the total count of purchases in time period directly in the sales table.
transaction_timestamp::date = '2024-01-01'
Note: for some database tools you can just add in the WHERE clause at the top without having to write the entire query. This is how it works in DBeaver, which is the main tool that I use.
This the raw data from the sales table.. Timestamp is in: YYYY-MM-DD HH:MM:SS format. We are going to ignore the item_code column for now because we only care about the total count of purchases in ‘2024-01-01’.
id | transaction_timestamp | purchase_amount | item_code |
1 | 2024-01-01 00:00:30 | 50 | A |
2 | 2024-01-01 00:00:45 | 50 | A |
3 | 2024-01-01 00:00:57 | 40 | B |
4 | 2024-01-01 00:01:15 | 50 | A |
5 | 2024-01-01 00:01:30 | 30 | C |
... | ... | ... | ... |
10000 | 2024-01-01 23:58:15 | 30 | C |
Depending on the database tool, you can just click a column and get metrics for that column including: count, sum, min, and max. This is how I would quickly check the count of all purchases made (regardless of item) for the date of ‘2024-01-01’.
Method 3: Get subset of data
For queries where the the result is large, I look at a subset of data to validate. This is because it can be a challenge to manually aggregate data especially if the raw data is big. Breaking it up into smaller chunks helps reduce complexity. I make sure to get enough subset of data to give a manageable but good sample for validation. Typically this is used for analysis when I have to look at specific user activity like playing games.
Example:
The PostgreSQL query shows is returning the total purchase amount for a specific item by day from the sales table..
These are the results.
transaction_date | total_purchases | item_code |
2024-01-01 | 10000 | A |
2024-01-01 | 15000 | B |
2024-01-01 | 15000 | C |
2024-01-02 | 9000 | A |
2024-01-02 | 12000 | B |
2024-01-02 | 7000 | C |
... | ... | ... |
2024-01-31 | 8000 | C |
Instead of having to check every single item for every single day of the month. I’m going to only check the total_purchases for the dates: 2024-01-01 and 2024-01-15 and check the total_purchases for items within those dates.
2024-01-01
A
B
C
2024-01-15
A
B
C
I’m going to validate my query by looking at the total amount of purchases in time period directly in the sales table. I check sales table and only filter transactions for one day at a time. Note: for some database tools you can just add in the WHERE clause at the top without having to write the entire query. This is how it works in DBeaver, which is the main tool that I use.
transaction_timestamp::date = '2024-01-01'
This the raw data from the sales table.
id | transaction_timestamp | purchase_amount | item_code |
1 | 2024-01-01 05:00 | 50 | A |
2 | 2024-01-01 05:05 | 50 | A |
3 | 2024-01-01 05:10 | 40 | B |
4 | 2024-01-01 05:20 | 50 | A |
5 | 2024-01-01 05:25 | 30 | C |
... | ... | ... | ... |
100 | 2024-01-01 23:50 | 30 | B |
Then I’d export the data in Excel and get the total purchases by item using a pivot table. I would then repeat this process for the other day I was going to look at, 2024-01-15. I use Excel because it’s easy to pivot data.
Tips
A few tips I’ve learned from validating data:
Get help when you need it. If it doesn’t look right ask a team member to review your work. Sometimes when I’ve been looking at a project for so long my eyes glaze over and I miss something obvious. Don’t worry this happens to all of us from time to time.
Use different validation methods. I like to use a variety of methods when I validate. For instance, if the query aggregate data, I’ll cross validate with raw data using a tool like Excel. This diverse approach strengthens validations and reduces errors.
Conclusion
While the validation process may seem tedious, it’s necessary. Getting accurate data ensures the you’re basing insights off of reliable data and helps the business make informed decisions. As a data analyst, you’ll probably be spending a lot more of your time on cleaning and validation than you think. It’s not the most exciting task you’ll do, it is the foundation of developing accurate data driven decisions.
Comments