Earlier, I've described data warehouses and data marts in the article about ETL processes. Now let's analyze another important element of modern IT infrastructure for storing Big Data. In this article, you will find what a Data Lake is, why it is needed, how it is used, what technologies it is based on, and what is the risk of using it incorrectly.
One of the special features of Big Data is the different format in which information is presented: posts on social networks, multimedia files, technical equipment logs, records from corporate databases, and so on. In order to get useful information out of all this data, it must be collected first. This is what a Data Lake is used for - storage of a large amount of unstructured data collected or generated by a single company [1].
Unlike corporate data warehouses (CWD) or Data Warehouse (DWH), the Data Lake stores unstructured, so-called raw information. For example, drones and surveillance cameras videos, traffic telemetry, photographs, logs of user behavior, metrics, information systems load, and so on. Such data is not yet suitable for general use in daily analytics within BI-systems but can be used to quickly work out new business hypotheses with the help of ML-algorithms. Thus, it is the Data Scientist who is the Data Lake user, whereas DWH is used by many more people: analysts, subject matter experts, and executives. However, this is not the only difference between Data Lake and DWH. To better understand what these elements of enterprise IT infrastructure have in common and how they differ, let's analyze them according to the following criteria [2]:
Feature |
Data Warehouse |
Data Lake |
Usefulness |
Only useful up-to-date data is stored |
All data is stored, including "useless" data that may be useful in the future or may never be needed |
Structure |
Well-structured data in a single format |
Structured, semi-structured, and unstructured heterogeneous data in all formats: multimedia files, text, and binary data from different sources |
Agility |
Low agility: structure and data types are preconceived and cannot be changed during operation |
High agility: allowing new data types and structures to be added during operation |
Availability |
Clear structure makes data retrieval and processing fast |
Lack of clear structure requires additional data processing to use the data in practice |
Cost |
High cost due to the complexity of designing and upgrading, including the price of equipment for fast and efficient operation |
Data Lake is much cheaper than DWH as the main expense is the storage of GB of information |
Thus, Data Lake provides the following merits to the practical application of Data Science (DS) in business [3]:
As noted above, in most cases Data Lake is built based on commercial Apache Hadoop distributions (Cloudera/HortonWorks, MapR, Arenadata) or cloud solutions such as Amazon Web Services, Microsoft Azure, Mail.ru, Yandex, and other Cloud providers. There are also ready-made products from specialized vendors of the corporate Big Data sector: Teradata, Zaloni, HVR, Podium Data, Snowflake, etc. [4]
In any case, regardless of the chosen base, the Data Lake entity includes the following components [5]:
Although Data Lake is mainly positioned as a repository of raw information, it can also store processed data. When used correctly, Data Lake provides users with the ability to quickly request smaller, more relevant, and flexible datasets compared to DWH for similar request execution times. This is possible because of the data schema (ad hoc) computed on the fly, which is not predetermined in advance but is generated at the time of access. Thus, in practice, the data lake can be used together with QCD, providing a data-driven business infrastructure [6].
For all their merits, Data Lake are vulnerable to the following risks [7]:
To prevent the Data Lake from turning into a swamp, it is necessary to fine-tune the data governance process by determining the quality of the information even before it is uploaded to the Data Lake. This can be done in the following ways [2]:
Further to the above recommendations, Teradata, one of the leading providers of Big Data analytics solutions, gives 5 more tips for effective deployment of Data Lake [7]:
The correlation between the purity of the Data Lake and the degree of managerial maturity of an enterprise according to the CMMI model is also quite interesting. In particular, when continuous monitoring of well-established business processes is performed, modern Big Data tools with integrated machine learning tools allow the self-organization of the Data Lake, performing continuous collection, aggregation, and meta-partitioning of information using so-called data pipelines [8].
Level of management maturity |
State and data nature |
Data Lake condition
|
Initial |
Data are duplicated or partially absent, represented in different formats and systems, not connected with each other, the share of manual data processing is high |
Local data repository with no defined order of automated processing |
Managed |
Information is quite successfully processed automatically within one division, but is not integrated with other corporate processes a nd structures (departments, branches, etc.) |
Puddle or swamp of data |
Defined |
Data sharing between different processes, systems and structures of the enterprise is partially automated, there is a unified catalog of corporate data |
Data Lake |
Managed based on quantitative data |
Data synchronization between different processes, systems and enterprise structures is not fully automated, some procedures are run on demand or manually |
Managed Data Lake |
Optimizable |
Automated procedures for emergence, update, exchange and synchronization of data between different processes, systems and business units are well established and functioning successfully |
Self-organized Data Lake |
Markets&Markets predicts that the Data Lake market will grow up to $8.81 billion by 2021 with an annual growth rate of 28.3% [4]. This trend is confirmed by the study "Angling for Insights in Today's Data Lake," showing almost 10% revenue growth for companies using this technology [2]. However, with the versatility and high potential usefulness of Data Lake, the organization of a clean Data Lake, that will bring real benefits to the business is quite a challenge. As with any DS-project, first of all, it is necessary to assess its prospects, taking into account the costs of implementation. Today, the organization of a Data Lake is justified in companies, which may be classified as data-driven organizations: telecoms, large retailers, especially online stores, banks, and industrial holdings with a distributed structure of branches. Typically in such organizations, processes are at CMMI level 4-5, and the IT infrastructure, including Big Data, is a complex set of proprietary and self-written solutions for continuous delivery of various data. Therefore, Data Lake will fit perfectly into the existing IT environment, complementing and enriching it with new data.