10 Easily Available Datasets For Deep Learning

Machine learning and deep learning rely on datasets to work. Without the proper dataset, sometimes even processed AI processes do not work. In the case of deep learning, one requires cleaned, labelled and categorized datasets. This is essential for the neural network to be as accurate as possible. The entire concept of deep learning works on layers of data to make sense. All the decision making, forecasting and classification depends on data. So, in a way, data is the most crucial part of deep learning. The framework and modelling can be done with several libraries. Even in data science, to use machine learning, the most important part is Data cleaning. Exploratory data analysis and data cleaning are important parts on which results depend.

Let us have a look at some of the publicly available datasets for deep learning purposes:


MNIST is a dataset of over 60,000 handwritten digits. It is a standard for deep learning practice. This dataset is a subset of the NSIT Special Database 19. The NSIT dataset contains handwritten uppercase and lowercase alphabets as well as digits. MNIST is a great dataset for those willing to practice and apply deep learning on real-world data. One can practice pattern recognition and other learning techniques. One does not have to worry about data cleaning in this either. MNIST is a benchmark for computer vision, learning and classification tools.
Handwritten digits from MNIST dataset
Image source: https://github.com/cazala/mnist


EMNIST dataset is an extension of the MNIST set for handwritten letters. It is the perfect dataset for people who are new to deep and machine learning research. Like MNIST, this is also an extension of the NSIT dataset. It focuses more on numbers and small and capital alphabets. The link is: https://www.westernsydney.edu.au/bens/home/reproducible_research/emnist.


CLEVR stands for Compositional Language and Elementary Visual Reasoning. This is a diagnostic dataset used for testing out visual reasoning. Stanford's Fei Fei Li leads the research for enabling computers to see and sense. This dataset was developed to assist in the research for this project. The training data consists of 70k images and 699,899 questions. The validation dataset also has around 15k questions and 149,991 questions. The testing dataset also has 15k images and 14,988 questions. The dataset has the purpose of Visual Questioning Analysis. It also has around 100k rendered images and a million generated questions. The link is: https://github.com/facebookresearch/clevr-iep.
CLEVR dataset reasoning
Image source: https://cs.stanford.edu/people/jcjohns/clevr/


This dataset was developed for checking grammatical errors and corrections. The corpus dataset has a wide range of language skill levels. It utilizes edits to make languages sound more native and of correct grammar. It is the benchmark for Grammatical Error Correction (GEC) among 4 systems to identify improvement areas. But, currently a machine learning system cannot correct as efficiently as a language expert. With the help of deep learning maybe soon GEC systems can ace the human level. The dataset however, has set remarkable benchmarks for grammar correction purposes.

5. Google's Open Images:

This dataset has a ginormous to annotated, labelled 9 million URLs and 6000 image categories. It is one of the largest possible datasets for training as per: https://research.googleblog.com/2016/09/introducing-open-images-dataset.html. All the images have a Creative Commons Attribution license as well. The image labels also cover a lot of ground on real-life subjects. It is one of the most valuable datasets for people in the AI field. It is a result of collaboration between CMU, Cornell University and Google.

6. STL-10 dataset:

This dataset is used for image recognition and draws inspiration from CIFAR-10 dataset. It has a corpus of 100k unlabeled images as well as 500 training images. The dataset is most suited for unsupervised learning, self-taught and deep learning systems. The link is: http://cs.stanford.edu/~acoates/stl10/. It also has a higher resolution than the CIFAR-10 dataset which makes it a bit challenging. Developing scalable supervised learning systems is tougher for this reason.

STL-10 dataset images
Image source: https://cs.stanford.edu/~acoates/stl10

7. Uber 2B Trip Dataset:

The name is pretty self-explanatory for this dataset. It consists of data from more than 4.5 million Uber pickups in NYC. The dataset is for pickups from April to September of 2014 and even from January to June of 2015. The corpus dataset has trip data of 10 FHV (For Hire Vehicles). Additionally, it also has data of another 329 other FHV companies. It has also been used to resolve the traffic woes in NYC as a result of Uber and other FHV service areas congestions. The link is: https://github.com/fivethirtyeight/uber-tlc-foil-response.

8. Data Science Bowl 2017:

The famous Data science bowl among enthusiasts and professionals had a prize of $1 million. The dataset has thousands of high-resolution lung scans sourced from National Cancer Institute. It has low dose CT images in the DICOM format. Each scan is of many axial slices as shown below, of the chest cavity area. The link is: https://www.kaggle.com/c/data-science-bowl-2017/data.

9. Maluuba NewsQA dataset:

Maluuba, a Montreal based company compiled a crowd-sourced dataset to develop advanced algorithms. With this, it is possible to develop algorithms capable of answering human-level queries, i.e. queries that would need human level comprehension and reasoning skills. The dataset is a CNN news articles consisting database having 100k question-answer sets. This dataset is good for people interested in Natural Language Processing. The link is: https://github.com/Maluuba/newsqa.

10. YouTube 8M dataset:

This dataset is for people researching and interested in video analysis. It has a gigantic training database of 8 million YouTube videos with tagged objects in them. This data has been instrumental in advancing several video-based applications based researches. It helped advance areas like transfer learning, noise data modeling and domain adaptation. It was even entered into a Kaggle competition for a $100k prize. The link is: https://research.google.com/youtube8m/download.html.

YouTube 8M dataset with sample videos
Image source: https://ai.googleblog.com/2017/02/an-updated-youtube-8m-video.html


There are several datasets available for machine and deep learning on Kaggle. But as per need from different areas, Deep Learning Datasets need to be cleaned and formatted. In the case of deep learning, we need our datasets to be formatted beforehand. Hopefully the above mentioned datasets can help. For more, you can get in touch with our Deep Learning Kolkata centers.



Want all the latest advances and tech news sent directly to your inbox?


Leave a Reply

Your email address will not be published. Required fields are marked *