What is a Data Lake?
Pentaho CTO James Dixon is credited with coining the term “Data Lake”. As he describes it in his blog entry, “If you think of a Data Mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
These days, demands for BI data stores are changing. BI data consumers not only require cleansed and nicely modeled data, updated on a daily basis, but also raw, uncleansed and unmodeled data which is available near real-time. With new and much more powerful tooling like Power BI, users can shape and cleanse data in a way that fits their personal needs without the help of the IT department. This calls for a different approach when it comes to offering data to these users.
BI data consumers also demand a very short time-to-market of new data, they don’t want to wait for a few months until data is made available by a BI team, they want it today. The raw uncleansed form of data in a Data Lake can be loaded very quickly because it’s suitable for generated data loading technologies and replication, which makes this short time-to-market possible. Once users have discovered the data and have acquired enough insights that they want to share with the entire organization in a conformed way, the data can be brought to traditional Data Warehouses and cubes in a predictable manner.
Furthermore there is rise in the presence of unstructured and or semi-structured data and the need to have “big data” available for adhoc analyses. To store and analyze these forms of data new technologies and data structures are required.
When the Data Lake comes in place a lot of data streams from sources into the “lake” without knowing up front if it is eligible for answering business questions. The data can’t be modeled yet, because it’s not clear how it will be used later on. Data consumers will get the possibility to discover data and find answers before they are even defined. This differs fundamentally from the concept of a Data Warehouse in which the data is delivered through predefined data structures, based on relevant business cases and questions.
From a technology view, a Data Lake is a repository which offers storage for large quantities and varieties of both unstructured, semi-structured and structured data derived from all possible sources. It can be formed by multiple underlying databases which store these different structured forms of data in both SQL and NoSQL technologies.
For the semi-structured/unstructured side of data which is used for big data analytics, Data Lakes based on Hadoop and other NoSQL technologies are common. For the semi-structured/structured data, SQL technologies are the way to go.
In this blog post I will describe the semi-structured/structured, relational appearance of the Data Lake in the form of a SQL Server database: The Relational Data Lake.
Extract Load (Transform)
Data in a Data Lake is in raw form. Transformations will not be performed during loading and relationships and constraints between tables will not be created which is the default for transactional replication and keeps the loading process as lean and fast as possible. Because of the lack of transformations, movement of the data follows the Extract-Load-(Transform) (EL(T)) pattern instead of the traditional E-T-L. This pattern makes loading of data to the Data Lake easier, faster and much more suitable to perform using replication technologies or generated SSIS processes, for example with BIML. This creates a very attractive time-to-market for data which is added to the Data Lake. Latency of data is as low as possible, preferable data is loaded in near real-time: data should stream into the lake continuously.
Transformations take place after the data is loaded into the Data Lake, where applicable. Cosmetic transformations like translations from technical object and column names to meaningful descriptions which end users understand or other lightweight transformations can be performed in new structures (like SQL views) that are created inside the Data Lake.
Unlike Data Marts and Data Warehouses, which are optimized for data analysis by storing only the required attributes and sometimes dropping data below the required level of aggregation, a Data Lake always retains all attributes and (if possible) all records. This way it will be future proof for solutions that will require this data in a later moment in time or for users that will discover the data.
Data is made accessible through structures which can either be accessed directly, or indirectly through the exposure as OData Feeds. These structures are secured and are the only objects end users or other processes have access to. The feeds can be accessed with any tool or technology that is best suited to the task at any moment in time, for example using Power BI tooling like Excel PowerPivot/PowerQuery.
We normally create SQL Views in which security rules and required transformation are applied.
The Data Lake also acts as a hub for other repositories and solutions like Data Warehouses and Operational Cubes.
Success of the Data Lake depends on good master data. When end users discover new raw data from the Data Lake they need to be able to combine it with high quality master data to get proper insights. Therefore a master data hub is a must have when a Data Lake is created. This hub should just be a database with master data structures in it, master data management on this data is preferable but not required. The master data hub should be a standalone solution, independent from the other BI solutions, as master data isn’t part of these solutions but is only used as data source. It should be sourced independently too, preferable using master data tooling or using tools like SSIS. Just like with data from the Data Lake, master data should also only be accessed through structures which can also be exposed as OData Feeds.
Next to the purpose of combining master data with data from the Data Lake, the master data can be used as source for other BI solutions like Data Warehouses. In there, the master data structures are often used as Data Warehouse Dimensions. To prevent the unnecessary duplicate loading of master data in the Data Warehouse that already exists in the master data hub, it can be a good choice to leave the master data out of the Data Warehouse Dimensions. Only the business keys are stored which can be used to retrieve the data from the master data hub when required. This way the Data Warehouse remains slim and fast to load and master data is stored in a single centralized data store.
The entire Data Lake architecture with all the described components are fit in the model below. From bottom to top the highlights are:
- Extract/Load data from the sources to the Data Lake, preferably in near real-time.
- The Data Lake can consist of multiple SQL (and NoSQL) databases.
- Transformations and authorizations are handled in views.
- The Data Lake acts as hub for other BI solutions like Data Warehouses and Cubes.
- The master data hub is in the center of the model and in the center of the entire architecture. It’s loaded as a standalone solution and isn’t part of any of the other BI solutions.
- Traditional BI will continue to exist and continue to be just as important as it has always been. It will be sourced from the Data Warehouses and cubes (and master data hub).
- The Discovery Platform with its new Power BI tooling is the place where “various users of the lake can come to examine, dive in, or take samples.” These samples can be combined with the data from the master data hub.
Data Lake Challenges
Setting up a Data Lake comes with many challenges, especially on the aspect of data governance. For example it’s easy to create any view in the Data Lake and lose control on who gets access to what data. From a business perspective it can be very difficult to deliver the master data structures that are so important for the success of the Data Lake. And from a user perspective wrong conclusions can be made by users who get insights from the raw data, therefore the Data Warehouse should still be offered as a clean trusted data structure for decision makers and a data source for conformed reports and dashboards.
The Data Lake can be a very valuable data store that complements the traditional Data Warehouses and Cubes that will stay as important as they are now for many years to come. But considering the increased amount and variety of data, the more powerful self-service ETL and data modeling tooling which appear and the shortened required time-to-market of near real-time data from source up and to the user, the Data Lake offers a future proof data store and hub that enables the answering of yet undefined questions and gives users personal data discovery and shaping possibilities.
Thanks go to my Macaw colleague Martijn Muilwijk for brainstorming on this subject and reviewing this blog post.
Very interesting and useful post. Just some questions on architecture…
Do you see Microsoft’s APS appliance positioning itself to become the data lake? Also, for a traditional on prem “Microsoft shop” environment just starting to get more exposure to unstructured data, are there ways you can easily setup, develop & maintain a data lake without purchasing APS?
Any suggestions on architecture would be appreciated.
The Analytics Platform System (hardware, software and services) is very complete in the sense that it can host loads of structured and unstructured (Hadoop) data. The biggest advantage is its performance.
It has some functional limitations with respect to entering and retrieving data, so that it’s characteristics ressemble those of a traditional data-warehouse (not real-time).
I would suggest to implement a data-lake with not (just) APS.
As Jorg points out, the solution may exist of more than one underlying platform.
And yes, if you want you can stay with Microsoft technology. HDInsight is available in Azure, so easily ‘implemented’ and suitable for hybrid (cloud / on-premise) usage.
Thanks for your comments. I think I’ll start with HDInsight as you suggested as it is a cheap and easy way for me to start bringing data in.
In your experience are there certain non T-SQL languages that are required to load and retrieve data from HDInsight?
The state of BI in our organization really consist of what you call a Data Lake. We are currently investigating what our next steps should be.
What do you consider as off the shelf solutions for defining and exposing master data? Something like Boomi or SAS Master Data Management? Or is it as simple as exposing data through oData?
I would love to read up more in regards to how master data works with the Data Lake if you have any additional material.
@Tony, using HDInsight gives you the possibility to issue Hive and Pig queries. The big advantage of these queries is that they are SQL-like and therefore easy to understand, you don’t have to write any Java code.
A nice introduction to HDInsight can be found here:
The following tutorial is a good starting point for learning Hive with HDInsight:
@Jimmy, I would recommend you to go into a software package selection process to pick one of the off the shelf solutions that meets the requirements of your organization. As I wrote, the master data solution and data store/hub isn’t part of BI solutions or the Data Lake, it’s only used as data source for these solutions. Master data management and the implementation of master data solutions can be a very difficult and time consuming process, but offering master data structures so they can be combined with data from the Data Lake is a must have. Therefore you could start with just gathering, storing and offering master data without going into the master data management process. This could be something that comes in a later moment in time.
Thanks for the links Jorg!
LikeLiked by 1 person
Hello, a nice read. A bit of a newbie question. How is relational structured data stored in a data lake for retention retrieval. Let’s say we had a legacy application that has a SQL backend, and we had a need to keep the data for retention purposes but need to access the data from time to time. Normally you would use a data archiving solution that is designed for this purpose. With a data lake, it’s object stored oriented right? Does the SQL dB just get stored as one object and you would not be able to query the dB without breaking down the tables and fields into separate objects? You would then need to know all the objects that make up the DB to then run a query against it?
This article is full of information about Data Lake, waiting for more like this. I have also found an article anyone can check for more information about Airdrie Legal Advice . It’s also more informative & you may find more details about it here.
This article is full of information about Data Lake, waiting for more like this. I have also found an article anyone can check for more information about Selise ARC . It’s also more informative & you may find more details about it here.