A selection of machine learning datasets

    Hello reader!

    Here is an article guide on open datasets for machine learning. In it, for starters, I will collect a selection of interesting and fresh (relatively) datasets . And a bonus, at the end of the article, I’ll attach useful links for self-searching datasets.

    Less words, more data.


    A selection of datasets for machine learning:

    • Game of Thrones deaths and battles - this data set combines three data sources, each based on information from a series of books.
    • Global Terrorism Database - Over 180,000 terrorist attacks worldwide, 1970-2017.
    • Bitcoin, historical data - Bitcoin data with an interval of 1 minute from selected exchanges, January 2012 - March 2019.
    • FIFA 19 full player data set - 18k + FIFA 19 players, ~ 90 attributes retrieved from the latest FIFA database.
    • YouTube Video Statistics - Daily trending statistics for YouTube videos.
    • Overview of Suicide Indicators from 1985 to 2016 - Comparison of socio-economic information with suicide rates by year and country.
    • Huge Stock Market Dataset - historical daily prices and volumes of all US stocks and ETFs.
    • World development indicators - indicators of the development of countries from around the world.
    • Kaggle Machine Learning & Data Science Survey 2017 - Great insight into the state of data science and machine learning.
    • Violence and Weapons Data - A complete report of more than 260,000 U.S. weapons incidents in 2013-2018.
    • Chest X-ray (pneumonia) - 5,863 images, 2 categories.
    • Voice recognition by voice - This database was created to identify the voice as male or female based on the acoustic properties of voice and speech. The data set consists of 3,168 recorded voice samples collected from men and women.
    • Student Alcohol Consumption - Data was obtained from a survey of students in math and Portuguese in high school courses. It contains a lot of interesting social, gender and educational information about students.
    • Malaria cell dataset - cell images for detecting malaria.
    • Surveys of young people - data on the preferences, interests, habits, opinions and fears of young people.
    • World University Rankings - Explore the best universities in the world.
    • Credit Card Fraud Detection - Anonymous credit card transaction datasets marked as fraudulent or genuine.
    • Heart disease date - this database contains 76 attributes, such as age, gender, type of chest pain, resting blood pressure and others.
    • European football base - 25,000+ matches, attributes of players and teams for European professional football.
    • Wine Reviews - 130k wine reviews with variety, location, winery, price and description.
    • Baidu Apolloscapes . A large dataset for recognizing 26 semantically different objects like cars, bicycles, pedestrians, buildings, street lamps, etc.
    • Comma.ai . More than seven hours on the highway. The dataset includes information about vehicle speed, acceleration, steering angle and GPS coordinates.
    • Color Recognition - This dataset contains 4242 color images. Data collection is based on flicr data, Google images, Yandex images.
    • The daily market price of each cryptocurrency is the historical cryptocurrency price for all tokens.
    • Chocolate Rating - An expert rating of more than 1,700 chocolate bars.
    • Health Insurance Market - Data on plans for health and dentistry in the US health insurance market.
    • Heartbeat sounds - a classification of heartbeat abnormalities according to a stethoscope.
    • Database of anime recommendations - recommendations from 76,000 users on myanimelist.net
    • Blood cell images - 12,500 images: 4 different types of cells.
    • Chest x-ray - more than 112,000 chest radiographs from more than 30,000 unique patients.
    • Murder Reports 1980-2014 - The Killing Responsibility project is the most comprehensive killing database in the United States currently available.
    • Used Car Database - Over 370,000 used cars. The content of the data is in German, so you need to translate it first if you do not speak German.
    • US Government Open Data House - data, tools, and resources for research, web and mobile app development, and data visualization.
    • National Center for Chronic Disease Prevention and Health Promotion (NCCDPHP). The center is working on reducing risk factors for chronic diseases.
    • The largest UK collection of social, economic and demographic resources.
    • EconData - several thousand economic time series, prepared by a number of US government agencies and distributed in various formats and media.
    • Coastal Research Center - interesting data on the sea and its biological composition. Here you can find datasets ranging from the analysis of data from the Red Sea model to the study of temperature and currents over the narrow southern California shelf.
    • Sign language digits data set - Turkey, Ankara, Ayranji, Anadolu. High school sign language data set.
    • The quality of red wine is a simple and understandable practical data set for regression or classification modeling.
    • Tables of the English Football Premier League (1968-2019).
    • HotspotQA Dataset - a dataset with questions and answers, which allows you to create systems for answering questions in a more understandable way.
    • xView is one of the largest publicly available set of aerial imagery of the earth. It contains images of various scenes from around the world, annotated using bounding boxes.
    • Labelme - Large dataset of annotated images.
    • ImageNet - Image dataset for new algorithms, organized according to the WordNet hierarchy, in which hundreds and thousands of images represent each node in the hierarchy.
    • LSUN. - dataset of images broken down by scene and category with partial markup of data.
    • MS COCO is a large-scale dataset for detecting and segmenting objects.
    • COIL100 - 100 different objects, depicted at every angle in a circular revolution.
    • Visual Genome - dataset with ~ 100 thousand detailed annotated images.
    • Google's Open Images. - A collection of 9 million image URLs “that have been tagged over 6,000 categories” under a Creative Commons license.
    • Labeled Faces in the Wild - A collection of 13,000 labeled facial images of people to use applications that involve the use of face recognition technology.
    • Stanford Dogs Dataset - Contains 20,580 images from 120 dog breeds.
    • Indoor Scene Recognition. - dataset for recognition of the interior of buildings. Contains 15 620 images and 67 categories.
    • Oxford's Robotic Car - Over 100 repetitions of one Oxford route captured during the year. Different combinations of weather conditions, traffic and pedestrians, as well as longer-term changes like road works, got into the dataset.
    • Cityscape Dataset is a large dataset containing records of one hundred street scenes in 50 cities.
    • KUL Belgium Traffic Sign Dataset - more than 10,000 annotations of thousands of different traffic lights in Belgium.
    • LISA Laboratory for Intelligent & Safe Automobiles - a dataset with traffic signs, traffic lights, recognized vehicles and trajectories.
    • Bosch Small Traffic Light Dataset - dataset with 24,000 annotated traffic lights.
    • WPI datasets - dataset for the recognition of traffic lights, pedestrians and road markings.
    • Berkeley DeepDrive - a huge dataset for autopilots. It contains over 100,000 videos with over 1,100 hours of driving recordings at different times of the day and in various weather conditions.
    • MIMIC-III - dataset with anonymized health data ~ 40,000 patients undergoing intensive care (demographic data, vital signs, laboratory tests and medications).
    • Amazon Reviews - Contains about 35 million reviews from Amazon over 18 years. Data includes product and user information, ratings, and review text itself.

    Useful links for finding datasets:

    • Of course, Kaggle is the meeting place for all lovers of machine learning competitions.
    • Google Dataset Search - Search datasets across the Internet. Also, if necessary, you can add your own data sets .
    • Machine Learning Repository is a set of databases, subject theories, and data generators that are used by the machine learning community to empirically analyze machine learning algorithms.
    • VisualData - search for datasets for machine vision, with convenient categorization.
    • DATA USA - a complete set of publicly available data from the USA with visualization, description and infographics.

    On this our short selection came to an end. If someone has something to supplement or share - write in the comments.

    All knowledge!
    Subscribe to the Neuron channel in Telegram (@neurondata) - there are fresh articles and news from the world of data science appear every week. Thanks to everyone who helps with useful links, especially Igor Mariarty, Andrey Bondarenko and Matvey Kochergin.

    Only registered users can participate in the survey. Please come in.

    And what data could you collect?

    • 27.7% Number of mosquitoes killed 23
    • 19.2% The amount of coffee drunk in a lifetime 16
    • 24% The number of references to your name when the project is being released 20
    • 28.9% Your salary data (actually not) 24

    Also popular now: