Where to put Big Data, or why do you need a Data Lake?

02.04.2020

Earlier, I've described data warehouses and data marts in the article about ETL processes. Now let's analyze another important element of modern IT infrastructure for storing Big Data. In this article, you will find what a Data Lake is, why it is needed, how it is used, what technologies it is based on, and what is the risk of using it incorrectly.

What is a Data Lake and who needs it?

One of the special features of Big Data is the different format in which information is presented: posts on social networks, multimedia files, technical equipment logs, records from corporate databases, and so on. In order to get useful information out of all this data, it must be collected first. This is what a Data Lake is used for - storage of a large amount of unstructured data collected or generated by a single company [1].

Unlike corporate data warehouses (CWD) or Data Warehouse (DWH), the Data Lake stores unstructured, so-called raw information. For example, drones and surveillance cameras videos, traffic telemetry, photographs, logs of user behavior, metrics, information systems load, and so on. Such data is not yet suitable for general use in daily analytics within BI-systems but can be used to quickly work out new business hypotheses with the help of ML-algorithms. Thus, it is the Data Scientist who is the Data Lake user, whereas DWH is used by many more people: analysts, subject matter experts, and executives. However, this is not the only difference between Data Lake and DWH. To better understand what these elements of enterprise IT infrastructure have in common and how they differ, let's analyze them according to the following criteria [2]:

usefulness of content;
data structures (types);
agility;
availability;
cost.

Feature	Data Warehouse	Data Lake
Usefulness	Only useful up-to-date data is stored	All data is stored, including "useless" data that may be useful in the future or may never be needed
Structure	Well-structured data in a single format	Structured, semi-structured, and unstructured heterogeneous data in all formats: multimedia files, text, and binary data from different sources
Agility	Low agility: structure and data types are preconceived and cannot be changed during operation	High agility: allowing new data types and structures to be added during operation
Availability	Clear structure makes data retrieval and processing fast	Lack of clear structure requires additional data processing to use the data in practice
Cost	High cost due to the complexity of designing and upgrading, including the price of equipment for fast and efficient operation	Data Lake is much cheaper than DWH as the main expense is the storage of GB of information

Thus, Data Lake provides the following merits to the practical application of Data Science (DS) in business [3]:

Scalability - a distributed file system allows new machines or nodes to be connected as needed without changing the storage structure or complex reconfiguration;
Cost-effectiveness - Data Lake can be built based on free software Apache Hadoop, without expensive licenses and expensive servers, using the required number of relatively budget machines;
Versatility - large volumes of heterogeneous data can be used for almost any research task - from forecasting demand to identifying user preferences or the effects of weather on product quality;
Speed of launch - the accumulated amount of Data Lake allows you to quickly test the next ML-model, without wasting time and engineering resources on collecting information from different sources.

What a Data Lake is based on

As noted above, in most cases Data Lake is built based on commercial Apache Hadoop distributions (Cloudera/HortonWorks, MapR, Arenadata) or cloud solutions such as Amazon Web Services, Microsoft Azure, Mail.ru, Yandex, and other Cloud providers. There are also ready-made products from specialized vendors of the corporate Big Data sector: Teradata, Zaloni, HVR, Podium Data, Snowflake, etc. [4]

In any case, regardless of the chosen base, the Data Lake entity includes the following components [5]:

Data uploading tools in batch or streaming modes. For example, continuous data collection can be organized using Apache Kafka or NiFi, while batch data collection can be organized using Apache Airflow.
File storage, which should be scalable, fault-tolerant, and cheap enough. For example, HDFS (Apache Hadoop Distributed File System) or Amazon S3.
Cataloging and search tools to quickly search for relevant information using metadata and complementary solutions such as Apache Solr or Amazon ElasticSearch.
Data processing tools to transform, filter, and other transformations for later use. For example, Apache Spark - Big Data framework for working with data in near real-time, including ML modeling (Spark MLLib).
Information security components - organization of a secure network perimeter (Apache Knox Gateway), backup, replication and recovery, SSL encryption, secure protocols (Kerberos), access restriction policies with Apache Ranger and Atlas.

Although Data Lake is mainly positioned as a repository of raw information, it can also store processed data. When used correctly, Data Lake provides users with the ability to quickly request smaller, more relevant, and flexible datasets compared to DWH for similar request execution times. This is possible because of the data schema (ad hoc) computed on the fly, which is not predetermined in advance but is generated at the time of access. Thus, in practice, the data lake can be used together with QCD, providing a data-driven business infrastructure [6].

*Fig.2. Data sources and Data Lake processes*

How not to make a Data Lake into a data swamp

For all their merits, Data Lake are vulnerable to the following risks [7]:

Poor data quality due to their lack of control when loading it, as well as the simplicity of this process and the cheapness of storing information;
The difficulty of determining the value of data. On the one hand, the philosophy of Big Data assumes the importance of any information. But on the other hand, if a business needs some data quickly, it is usually known in advance. And therefore, it is logical to load such information immediately into a DWH or BI system.
Turning the Data Lake into a swamp, which is a consequence of the previous two points.

To prevent the Data Lake from turning into a swamp, it is necessary to fine-tune the data governance process by determining the quality of the information even before it is uploaded to the Data Lake. This can be done in the following ways [2]:

Cut off sources with obviously unreliable data; Set up role-based policies of access rights to download information for certain categories of employees;
Check some file parameters, such as the size of images or video/audio recordings.

Further to the above recommendations, Teradata, one of the leading providers of Big Data analytics solutions, gives 5 more tips for effective deployment of Data Lake [7]:

Integration of the Data Lake with other elements of the corporate IT infrastructure: DWH, information systems databases, cloud services, and other sources of potentially relevant data. In doing so, keep in mind the balance between storage capacity, its speed, and the reasonable cost of the solution.
Don't clutter up Data Lake. Instead of one global repository, it makes sense to organize several spaces and categorize data at once. This will also improve the most important feature from the user's point of view which is the reading speed.
Maintain trust in data by capturing its provenance and verifying the quality of metadata.
Give Data Scientists and analysts the tools to research, profile and get answers to their requests in the Data Lake by organizing them into cross-functional teams with data engineers, developers, and business experts.
Ensure security by preventing potential data breaches and losses through access control policies, secure perimeter, backup, replication, and recovery tools.

The correlation between the purity of the Data Lake and the degree of managerial maturity of an enterprise according to the CMMI model is also quite interesting. In particular, when continuous monitoring of well-established business processes is performed, modern Big Data tools with integrated machine learning tools allow the self-organization of the Data Lake, performing continuous collection, aggregation, and meta-partitioning of information using so-called data pipelines [8].

Level of management maturity	State and data nature	Data Lake condition
Initial	Data are duplicated or partially absent, represented in different formats and systems, not connected with each other, the share of manual data processing is high	Local data repository with no defined order of automated processing
Managed	Information is quite successfully processed automatically within one division, but is not integrated with other corporate processes a nd structures (departments, branches, etc.)	Puddle or swamp of data
Defined	Data sharing between different processes, systems and structures of the enterprise is partially automated, there is a unified catalog of corporate data	Data Lake
Managed based on quantitative data	Data synchronization between different processes, systems and enterprise structures is not fully automated, some procedures are run on demand or manually	Managed Data Lake
Optimizable	Automated procedures for emergence, update, exchange and synchronization of data between different processes, systems and business units are well established and functioning successfully	Self-organized Data Lake

To sum up

Markets&Markets predicts that the Data Lake market will grow up to $8.81 billion by 2021 with an annual growth rate of 28.3% [4]. This trend is confirmed by the study "Angling for Insights in Today's Data Lake," showing almost 10% revenue growth for companies using this technology [2]. However, with the versatility and high potential usefulness of Data Lake, the organization of a clean Data Lake, that will bring real benefits to the business is quite a challenge. As with any DS-project, first of all, it is necessary to assess its prospects, taking into account the costs of implementation. Today, the organization of a Data Lake is justified in companies, which may be classified as data-driven organizations: telecoms, large retailers, especially online stores, banks, and industrial holdings with a distributed structure of branches. Typically in such organizations, processes are at CMMI level 4-5, and the IT infrastructure, including Big Data, is a complex set of proprietary and self-written solutions for continuous delivery of various data. Therefore, Data Lake will fit perfectly into the existing IT environment, complementing and enriching it with new data.

Alexei ChernobrovovConsultant on Analytics and Data Monetization

Where to put Big Data, or why do you need a Data Lake?

What is a Data Lake and who needs it?

What a Data Lake is based on

How not to make a Data Lake into a data swamp

To sum up

Sources

Contacts

Thank you for your message!