I remember the first time I was handed real data at my job. I was feeling pretty confident with SQL due to my projects and courses. But when I started to look at the database, I paused. I realized that perfectly clean data and neatly structured tables was not reality. In the real world data can get messy and table relationships can be hard to understand. I learned that while courses helped me build a strong foundation in SQL using it in the real world is very different. In this post, I'll be going into some challenges and insights on using SQL.
SQL in Real Life
Working with Dirty Data
In courses most of the data you work with is clean, organized, and perfectly structured. This is because it's easier to work and practice with clean data. It's good practice. In reality, a significant portion of a data analyst's role is working with messy and dirty data. Dirty data refers to data that's incomplete, improperly formatted, or duplicated. In other words, it's not suitable for analysis. This can mean dealing with missing values, duplicates, or inconsistent formatting.
For example, duplicated entries where there should be unique values, null values in crucial columns, inconsistent data categorizations (e.g. the same product being categorized as different things). At my work, for instance, when I encounter missing values, I investigate why the data might be missing. Was it a data entry error, or is the data unavailable. Depending on the reason, I might fill in the data myself or make a note of the reason.
Maintain a Focus on Business Objectives
If you take a course you're given problems and challenges to solve. Simply practicing is the reason why. But often every query you do, there's a business need or a question that needs to be answered. It's important to always remember the business objective behind your tasks. Whenever I'm writing SQL, I ask myself, "why are we doing this?" This simple question helps align my query with business goals. It helps prevent me from getting lost in the weeds.
I made this mistake in my first few weeks as an analyst. I overlooked the business context when I was writing my SQL query. I created a query that returned all of the data the stakeholder requested. But, it wasn’t exactly what the stakeholder was looking for. While the query returned the correct data, without understanding the business needs it was useless.
Understanding Table Relationships
In a perfect world databases would have easy-to-understand ERDs (entity relationship models) that show how databases are related to each other. In reality, there may not be an ERD or it may be so complex it's difficult to understand at first. For writing effective queries you need to understand the way data tables relate to each other. For example, knowing how a 'customers' table relates to an 'orders' table can be the difference between an accurate and an erroneous query.
If available, always start with your company's documentation. It can be a good resource for understanding table relationships. I also like to draw out the basic table relationships on a piece of paper. It helps me map out how they relate to one another. But in general this will take time but it's the foundation to your role as a data analyst.
The Art of Writing Queries
In courses there's often one correct answer. But in reality, SQL is as much an art as it is a science. Writing queries is open-ended, and there isn't just one 'correct' way to get the desired output. Over time, you develop your own style, preferences, and tricks to extract data efficiently.
For example, if you want to find out the total amount of sales for a particular product category from an e-commerce database. This information is in two tables orders and products . The orders table contains the sales data, and the products table categorizes each product.
Using a subquery:
SELECT products.category, SUM(sale_amount) as total_sales
FROM products
JOIN (
SELECT product_id, SUM(sale_amount) as sale_amount
FROM orders
GROUP BY product_id
) ON products.product_id = orders.product_id
WHERE products.category = 'Electronics'
GROUP BY products.category;
Here we first aggregate the sales data in the orders table by product_id . This subquery is then joined with the products table to get the category information.
Using a CTE:
WITH AggregatedOrders AS (
SELECT product_id, SUM(sale_amount) as sale_amount
FROM orders
GROUP BY product_id
)
SELECT products.category, SUM(AggregatedOrders.sale_amount) as total_sales
FROM products
JOIN AggregatedOrders ON products.product_id = AggregatedOrders.product_id
WHERE products.category = 'Electronics'
GROUP BY products.category;
In this query we use a CTE called AggregatedOrders to first aggregate the sales data by product_id . Then we use the CTE in the main query to join it with the products table and get the category information.
Both methods get the same results but are different ways of structuring the query. Choosing which one to use is on personal preference and how you want your query to be structure. Subqueries can sometimes make the query more compact, but CTEs make the query more readable and modular.
My Top 10 SQL Functions and How I Use Them
I've gone over what it's really like using SQL. Below I'm going to share the 10 most used functions I use at work (besides the basics like SELECT, FROM, WHERE etc.). I'm not going to dive into a detailed explanations of each function (check out SQL Tutorial from W3Schools for an explanation).
COUNT: I use it all the time to get the number of unique entries in specific columns.
SUM: It's good for getting total figures, like total sales or total number of logins.
INNER JOIN: I use this when I need to merge tables based on common columns, making sure data from both tables matches.
LEFT JOIN: Useful when I need data from one table and any corresponding data from a second table, but still want to retain all the original rows from the first table. This is the most common JOIN I use and most analysts feel the same way.
CASE WHEN: It's great for giving conditional statements and I often combine it with other functions like COUNT or SUM.
HAVING: Useful for filtering aggregated data, especially after using a GROUP BY clause.
DATE: This helps me convert strings or other data types into date format. A lot of my job involves working with dates.
CURRENT_DATE: Frequently used to filter data up to today's date or to add a timestamp to my analyses.
EXTRACT: Essential for pulling specific components from a date, like month or year, especially when analyzing trends over time.
DATE-DIFF: My go-to function when calculating the difference between two dates, such as the days between a user's first and last login.
Note: I mostly use BigQuery and PostgreSQL at my work. But every SQL language is different and you should double check the documentation for the language you're using.
How Can You Get Better At SQL?
While courses can build a foundation it's difficult to fully prepare for SQL until you have a full-time role. But there are ways to do it. Here are a few suggestions:
Projects, specifically ones where you create your own data set. It gives you practice cleaning the data, creating your own tables, analyzing the data, and giving insights.
Volunteer with a non-profit and help them analyze their data. I did this and it helped me get hands on experience with handling messy data.
Work part-time in a role that lets you handle and analyze a lot of data. This doesn't necessarily have to be a "data analyst role". I had a marketing role which let me utilize my data analysis skills.
Freelancing/consulting is also an option but it can be difficult to get started. If you need advice on how to get started I'd recommend checking out this LinkedIn post from Benjamin Rogojan.
If you are interested in courses I'd recommend checking out these ones:
Solve Real-World Data Problems with SQL from Jess Ramos. Note this is more of an intermediate and project-based course, so you need to have a good understanding of SQL before taking it. But you'll practice things like CTEs, CASE statements, UNIONs and more.
Any course from Maven Analytics, I've heard great things about their platform and courses. The courses go over real world problems and they also have free datasets where you can practice your own analysis skills.
Conclusion
In conclusion, while SQL is a powerful tool in a data analyst's toolkit, mastering it goes beyond knowing its syntax. It's about understanding the data, the business objectives, and the myriad ways in which queries can be structured to extract meaningful insights. As with many things in life, practice, experience, and continuous learning are key.
コメント