How much can anyone really care about sepal length?” my friend complained to me over coffee a few days ago. She was referring to the built-in `iris` dataset in R, which first debuted way back in 1936. “Why do college professors try to teach us data science with crappy, boring, pointless data when there’s so much great data out there for data science projects?”
She’s right. It’s really tough to motivate yourself to learn data science, or do data science projects when your data is boring or meaningless to you. I know I struggled to motivate myself to learn data science until I found some good crunchy data that interested me.
In this article, I’m going to break down 10 amazing websites where you can grab some really awesome data for data science projects. The purpose will be to showcase a variety of data that might appeal to you. Ultimately, these websites should help you find data you care about, do a cool data science project, and use that to get a job.
How did I Vet these Data Sources?
If you see a website in this article, it’s because the data it contains is:
- Freely available. You won’t have to pay for it.
- Community-oriented. It’s not just going to just be a file; there will be some commentary and explanation around it.
- Cool. It’s something that someone, somewhere will care about. Maybe you!
- Clean-ish. You’ll get to practice the fun part of data science – analyzing, visualizing, sharing, and so on.
- Language-agnostic. You can dig into these with Python, R, SQL, or any other language you like.
1. Google’s Dataset Search
I’m cheating a little bit, because this isn’t really a website for datasets, but rather a search engine for data sets. But it’s too good not to include.
Google’s Dataset Search is just like Google but for data sets. You type in your query, and Google returns as many datasets as it has on that subject.
For example, searching “cats” brings me over one hundred datasets, including a dataset containing over 9,000 images of cats.
Kaggle’s Datasets is also a search engine, but it’s both more limited and more focused.
It’s more limited because it only contains datasets that people have published with Kaggle. But it’s more focused because the datasets aren’t just whatever random set of numbers Google scraped. Kaggle is a home for data science competitions, so the datasets it collects are extremely relevant to data science.
This allows you to filter by your specific interest. For example, I can stumble across that same cat dataset if I searched “cat” with the “computer vision” filter on.
This may come as a surprise to you, but KDNuggets curates a great set of datasets. These datasets are specifically for Data Science, Machine Learning, AI & Analytics, so they’re
Many of these aren’t KDNuggets exclusives, but it’s a good list to poke around in. It’s worth noting that when you sign up to be a KDNuggets email subscriber, you also get access to World Data AI which itself contains 3.5 billion datasets.
4. Government websites
I could easily expand this list of websites to get datasets to about a million simply by individually listing each of the government websites I like to use to get data. I won’t. Instead, I’ll offer a small list here:
Governments are constantly collecting data to do studies, and many of them publish that data online.
If you like your data to come with a heady dose of pop culture, look no further than Pudding.cool. This website looks at topics as varied as repetitive pop lyrics, women’s pockets, and how The Big Bang Theory gets censored by the Chinese government.
This is more of a digital magazine writing longform essays about culture, showing a lot of data alongside. I’m including it here because they tell awesome stories and share their data.
Another essay-driven pop culture website with freely available data you can purloin. They focus more on sports and politics. It’s less data-driven, but I’m giving it a spot on this list because it still curates and shares datasets.
7. Tidy Tuesdays
Now, the reality of the matter is that data often isn’t tidy at all. Tidy Tuesdays isn’t exactly a website with datasets per se, but it’s a weekly event and community with an emphasis on using data science to explore untidy data.
Every week, a new dataset drops. Participants are encouraged to share their cleaning techniques and visualizations with each other on GitHub and Twitter.
GitHub is the home of a lot of data. You can easily search, filter, and download data to play around with on your own. However, the data quality is highly variable. Because anyone can upload data, it’s not always in great condition.
However, I feel the benefits make up for that.
Buzzfeed doesn’t just do quizzes that comment on the human condition by asking you to build a salad. It may not be as well known for this, but Buzzfeed does a lot of quality data journalism.
10. Awesome Public Datasets
I’m ending this list with a pretty self-explanatory title: Awesome Public Datasets. This repo lives on GitHub and contains (mostly) free datasets to explore. They come from online datasets, user suggestions, and research papers.