Alexei ChernobrovovConsultant on Analytics and Data Monetization

Where could you find data for Machine Learning: 3 methods for preparing dataset from open sources

The dream of every data scientist is a cleaned sample without errors, outliers or missing values, but with the whole set of data that is necessary for solving the following task. However, in the reality there is often incorrect or incomplete information. So, it takes a lot of time for preparing dataset for modeling (cleaning, normalizing and generating variables) because the effectiveness of Machine Learning (ML) models depends on the quality of the original dataset.

At the same time, the data necessary for the formation of predictors and target variables can easily be absent in the initial sample. Thus, the question of enriching the dataset before the data scientist arises. In this article, we will consider 3 legal ways of obtaining other people's big data for our own business tasks:

  1. the use of ready datasets;
  2. working with web-platforms that provide statistics;
  3. the use of information from external websites.

1. Ready datasets

Structured data samples on various topics can be downloaded independently from the following sources:

1.1 Competitive platforms and Data Science (DS) and ML competitions that provides sets of raw data as part of solving a specific problem;

1.2. DS and ML websites and communities, where are datasets for individual and group education and for scientific researches;

1.3 Open source data – information of a machine-readable format posted in the Internet and available for all users for free and further publication without copyright restriction and other methods of control [1].

More information about each of these methods below.

1. Competitive platforms and Data Science and Machine Learning competitions 

Individual and group olympics, championships, hackathon and data analysis and machine learning competitions nationwide and around the world, which are usually held online. In Russia the most popular and biggest are the following events [2]:

  • Russian ML Cup – Mail.Ru Group online Machine Learning championship for adult participants (18+) with professional experience in Data Science – winners usually receive valuable prizes and are invited to an interview with the company;
  • - one of the biggest platforms for individual and group data analysis championships in Russia and East Europe. The client of it are major international and russian companies: Avito, Gazprom, Rosbank and others. They pose a real business challenge for participants, for example, credit scoring, provide initial data, evaluate and reward winners [3].

At the international level, the following competition venues are considered to be the most popular [2]:

  • Kaggle – a Google platform for holding machine learning and data analysis competitions. Organizers are the major companies: Google, Intel, Mercedes-Benz and etc, who themselves determine the evaluation criteria, terms and prizes of their corporate competitions. At the same time there are up to 20 competitions, everyone can take part in it. Winners receive cash prizes ranging from 15-100 thousand dollars. In addition, Kaggle contains datasets available for download and free use, even out of the competition.
  • KDD CUP – the cup of the the Associationfor Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining for teams up to 10 people. As a rule, the main task is to solve a socially significant problem, for example, predicting air pollution, etc. Winners receive a cash reward and present solutions in person at the KDD CUP workshop in London.

Participating in such competitions allows to improve the professional level for both beginners and DS and ML specialists.

Все участники ML-соревнований получают главный приз - важный опыт
All participants of ML-competitions get the main prize - important experience


2. Data Science and Machine Learning websites and communities 

The biggest platforms for structured datasets for free individual and group using in science, education or business could be found on the following sources.

The following datasets are the most popular samples for learning ML:

  • Titanic– structured sample of survivors in the wreck of the legendary ship;
  • Boston– Boston property price dataset;
  • Visual Genome– a huge data base of detailed annotated images. 

Also the following sources could be used:

3. Open Data

Open data can come from various sources, among which are the largest and most regular providers of information - state bodies and scientific communities. For example, in the framework of the e-Government concept, which has been actively developing around the world since the beginning of the 2000s [4], many countries, including Russia, have created websites information disseminaion processed in the general government sector – open government data (OGD).

The OGDs are published by government agencies, local governments, or organizations located in their departments. The Russian open data portal contains more than 22 thousand datasets in csv format for all areas of activity, from the economy to electronics, both in the country as a whole and in individual regions [5]. In addition, within the framework of the implementation of the Decree of the Government of the Russian Federation of 10.07.2013 No. 583, each subject of the federation and each department of the federal level (ministries, services, committees, etc.) must regularly place structured data arrays in machine-readable formats (CSV, JSON, XML ) on their Internet portals for universal and free use [6]. For example, the Moscow government open data portal contains more than 600 datasets in the capital: from the register of entrepreneurs to bicycle parking lots.

OGD are unnumbered records of objects with a minimum set of descriptive characteristics. For example, in the Moscow bicycle parking dataset you will see a table of 3 columns: the name of the parking, address and number of places. There are no data on the usability, security and automation. Therefore, it will not work to extract ready-made signs from the set of OGDs, however, the information contained in them will be useful in the independent formation of predictors.

Moreover, the OGD sets have no aggregate statistics, for example about the value of the subsistence level in regions and etc. This information should be searched on the website of the Unified Interdepartmental Statistical Information System (UISIS). On the UISIS portal  data are in different formats, both tabular and text (XML, DOC), so when using time, additional time will be required to convert the information.

Datasets for international organizations, events and situations are also available in free mode or by paid subscription on web portals that accumulate information, for example, the Knoema service. In addition, you can use the search tools from Google , focused specifically on datasets.

Поиск данных - важная задача для data scientist'а

Data search is a very important task for data scientist


2. Web-platforms with statistics

If state statistics shows already completed facts, for example, the number of mortgage loans issued in the region, then the statistics of search queries indicates the intentions of potential customers. This allows us to understand the needs of consumers, which is especially important in the tasks of forecasting demand and other, not only marketing analytics. So you can, for example, identify the frequency of  seasonal acute respiratory viral infections by region or analyze traffic congestion using Yandex.Maps to build optimal logistics for the delivery of goods or transportation of passengers.

To obtain statistical data, including in real time, you can use the API (Application Programming Interface) provided by third-party services. For example, API solutions from Google, APIs of various Yandex services, trading systems that receive information from stock, currency and cryptocurrency exchanges for automated trading in securities. At the same time, access to the data set, as a rule, in JSON format, is implemented through the URL. The data scientist will have to do further analysis of the contents of the JSON file using his own scripts.

The freely available and commercial tools for web audit and SEO-promotion: Wincher, MegaIndex, Similarweb and others will help to collect keywords and search phrases, as well as determine their frequency and cluster by various criteria (gender, age, region, etc.) etc. You can collect such statistics yourself by installing Google Analytics or Yandex.Metrics counters on your site. Also, a word search service from Yandex is useful for this. It is worth noting that these services do not provide dataset in the usual structured form, so you have to do data preprocessing by yourself.

3. Scarping of external websites

Not all websites provide API interface for automatic data obtaining. So there are scarping technologies.

Scarping is the process of analyzing the contents of web pages using parser robots - special programs or scripts. The most famous parsers on the network are search engine robots that analyze sites, save the analysis results in their databases in order to produce the most relevant documents while searching. You can write your own web parser in Python or PHP to automatically obtain data from external resources according to the specified characteristics (images, keywords, etc.). But the easiest way is to find ready-made solutions, libraries that exist in all popular programming languages.

Such use of information from external sites is perhaps the most creative way to enrich the initial sample, since it is based on the search for available data from competitors or related fields within the business context of the problem being solved. That is how the Red Roof hotel chain has built its marketing strategy, promptly offering passengers of delayed or canceled flights accommodation in hotels near the airports. Possible delays or cancellations are figured out based on adverse weather forecasts, which are publicly available on many websites. And the maximum number of guests is determined based on the number of passenger seats on the planes of canceled or delayed flights. Based on this information, the potential occupancy of the number of rooms is calculated and an advertising campaign is built on the proposal of a particular location. This approach increased network growth by 10% in areas covered by this advertising campaign [7]. However, in order to apply this method in reality, a data scientist must deeply immerse himself in the applied specifics of the analyzed industry. Furthermore, additional processing of the data obtained as a result of scarping is required in order to use this information to build predictors in its ML model. As a rule, this means writing additional scripts to process the received data.


Идеальный датасет - это очищенная выборка без ошибок, выбросов и пропущенных значений, но с полным набором данных, необходимых для решения поставленной задачи

An ideal dataset is a cleared sample without errors, outliers and missing values, but with a complete set of data necessary to solve the task.


All these 3 methods and their components could be used on practice for solving one of business-tasks. This could significantly enrich the original dataset, improve the training sample quality and the effectiveness of the ML-models trained on it. A side effect of such an event is a significant increase in time for data preparation, as it is necessary to find the source with the appropriate information, obtain data, process the obtained sets and form signs for ML modeling on their basis.