Data content creators often talk about the importance of having a portfolio. Especially for people who transitioning careers. The reason why: it showcases your skills, especially if you don't have previous experience in data, and it demonstrates your passion for learning. I've talked about the benefits of having a portfolio plenty of times. But one of the most common challenges is where to find a good dataset.
While you’re given datasets in courses these are usually already clean and small. It's good for practicing and improving SQL knowledge. But it's not the same as SQL in the real world. Ideally, you want datasets that are as close to the real world. Like datasets that are large rows of data and have multiple tables that relate to each other.
Where to find data
There are two main ways you can find complex datasets and a third bonus one.
Collect Personal Data
The first is time consuming it's what I did. You collect your own data and create a data set. I did this with my weight lifting project. This demonstrates your ability to go through an entire data product. You can collect, clean, analyze, and share your findings. Most analysts won't be dealing with the collecting data and designing databases. That's usually reserved more for data engineers or back-end data people. Fair warning, this method is ambitious and takes time. But there’s no legality issues because it is all your data.
The simplest way to do this is to use an app that automatically collects the data for you. Then export it into an .csv or an Excel spreadsheet. I did this for my weight lifting project. I have an app where I track all my sets, reps, and weights. Then I was able to export it to a CSV file. Another tried and true method is collecting your data into Excel. Then either analyzing it in Excel or exporting it to another tool like Google BigQuery or visualizing it using something like Tableau.
Find Publicly Available Data
Another option is to find pubic and free data sets. The benefit is you get a data set that's already created and you can start analyzing right away. The drawback could be the data might not be clean or the quality isn't that good. Also, you might not know how the data was collected or any other background information.
I've actually already found a few links to this that are my resources blog posts. Here are the sites below:
Datahub - This site covers a wide range of topics from climate change to entertainment, but it focuses on economic and business data.
Dataset Search - You're able to use Google to search for datasets. It's great if you have a particular topic in mind.
Kaggle - It has variety of free datasets provided by users from everything to arts & entertainment to social science data.
Data Gov - Public data from the US government from everything from crime to healthcare.
Maven Analytics Data Playground - Datasets that are hand picked by Maven's instructors. These datasets can be more fun like analyzing the Harry Potter movies scripts to more business focused like analyzing sales of a pizza place.
Awesome Public Datasets - A list of topic focused public data sources that are high quality. These are collected from blogs, answers, and user responses.
Datacamp Datasets - These datasets are from a variety of fields from real estate to retail. All the datasets have the data and packages needed.
NASA Data - Has open-data provided to the public from NASA. The dataset pages only hold the metadata and the actual data may be on another NASA site. There will be links to the data in these other locations.
Google BigQuery - It’s free to sign up and Google has plenty of free datasets to practice with. Though you’ll have to use Google’s BigQuery syntax which is different from other SQL languages but the basics are the same.
Why not scrape data?
Now if you're familiar with Python and web scraping. You may be asking why I don't advocate for web scraping. TIt might not be legal. That's not always the case but you want to be careful with which sites you web scrape from. Check out this video: Building a bot to scrape job data... How NOT to collect data from Luke Barousse, and why he had to alter his project about LinkedIn job postings. Check out this link for the specific time when he discusses the legality of it.
Also if you're a beginner data analyst you probably won't want to start with Python. I consider it a bonus skill to learn once you have the fundamentals down. For now if you're just starting out and trying to build out a robust portfolio, I recommend just finding some free data sets or building your own. If you do want to be careful, just do your research about the site you’re getting the data from and the possible legal issues around it.
Bonus: Paid Courses/Newsletters
A bonus suggestion that may not be as straightforward but it works. See if any courses or other paid give out a data set. For example, Jess Ramos has a paid subscriber option for her newsletter that will get you exclusive datasets and projects.
I'm actually working on a SQL course with Luke Barousse that will have a full dataset you can use for a portfolio project. In full transparency the actual course will be free but there will be an optional paid portion that will get you some extra things. And it will help Luke's channel and me.
The first step to building a portfolio is to get data. While it can be challenging to find free and publicly available data it's not impossible. If you have any other suggestions on sites to find data sets feel free to email me at firstname.lastname@example.org and I'd be happy to add it to my data analytics resource blog page.
If you want to know how to build a robust portfolio I suggest checking out this blog post I made a few months ago: A Guide to Creating a Well Rounded Data Analytics Portfolio.