What is a Data Lake?
Pentaho CTO James Dixon is credited with coining the term “Data Lake”. As he describes it in his blog entry, “If you think of a Data Mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
These days, demands for BI data stores are changing. BI data consumers not only require cleansed and nicely modeled data, updated on a daily basis, but also raw, uncleansed and unmodeled data which is available near real-time. With new and much more powerful tooling like Power BI, users can shape and cleanse data in a way that fits their personal needs without the help of the IT department. This calls for a different approach when it comes to offering data to these users.
BI data consumers also demand a very short time-to-market of new data, they don’t want to wait for a few months until data is made available by a BI team, they want it today. The raw uncleansed form of data in a Data Lake can be loaded very quickly because it’s suitable for generated data loading technologies and replication, which makes this short time-to-market possible. Once users have discovered the data and have acquired enough insights that they want to share with the entire organization in a conformed way, the data can be brought to traditional Data Warehouses and cubes in a predictable manner.
Furthermore there is rise in the presence of unstructured and or semi-structured data and the need to have “big data” available for adhoc analyses. To store and analyze these forms of data new technologies and data structures are required.
When the Data Lake comes in place a lot of data streams from sources into the “lake” without knowing up front if it is eligible for answering business questions. The data can’t be modeled yet, because it’s not clear how it will be used later on. Data consumers will get the possibility to discover data and find answers before they are even defined. This differs fundamentally from the concept of a Data Warehouse in which the data is delivered through predefined data structures, based on relevant business cases and questions.
From a technology view, a Data Lake is a repository which offers storage for large quantities and varieties of both unstructured, semi-structured and structured data derived from all possible sources. It can be formed by multiple underlying databases which store these different structured forms of data in both SQL and NoSQL technologies.
For the semi-structured/unstructured side of data which is used for big data analytics, Data Lakes based on Hadoop and other NoSQL technologies are common. For the semi-structured/structured data, SQL technologies are the way to go.
Extract Load (Transform)
Data in a Data Lake is in raw form. Transformations will not be performed during loading and relationships and constraints between tables will not be created which is the default for transactional replication and keeps the loading process as lean and fast as possible. Because of the lack of transformations, movement of the data follows the Extract-Load-(Transform) (EL(T)) pattern instead of the traditional E-T-L. This pattern makes loading of data to the Data Lake easier, faster and much more suitable to perform using replication technologies or generated SSIS processes, for example with BIML. This creates a very attractive time-to-market for data which is added to the Data Lake. Latency of data is as low as possible, preferable data is loaded in near real-time: data should stream into the lake continuously.
Transformations take place after the data is loaded into the Data Lake, where applicable. Cosmetic transformations like translations from technical object and column names to meaningful descriptions which end users understand or other lightweight transformations can be performed in new structures (like SQL views) that are created inside the Data Lake.
Unlike Data Marts and Data Warehouses, which are optimized for data analysis by storing only the required attributes and sometimes dropping data below the required level of aggregation, a Data Lake always retains all attributes and (if possible) all records. This way it will be future proof for solutions that will require this data in a later moment in time or for users that will discover the data.
Data is made accessible through structures which can either be accessed directly, or indirectly through the exposure as OData Feeds. These structures are secured and are the only objects end users or other processes have access to. The feeds can be accessed with any tool or technology that is best suited to the task at any moment in time, for example using Power BI tooling like Excel PowerPivot/PowerQuery.
We normally create SQL Views in which security rules and required transformation are applied.
The Data Lake also acts as a hub for other repositories and solutions like Data Warehouses and Operational Cubes.
Success of the Data Lake depends on good master data. When end users discover new raw data from the Data Lake they need to be able to combine it with high quality master data to get proper insights. Therefore a master data hub is a must have when a Data Lake is created. This hub should just be a database with master data structures in it, master data management on this data is preferable but not required. The master data hub should be a standalone solution, independent from the other BI solutions, as master data isn’t part of these solutions but is only used as data source. It should be sourced independently too, preferable using master data tooling or using tools like SSIS. Just like with data from the Data Lake, master data should also only be accessed through structures which can also be exposed as OData Feeds.
Next to the purpose of combining master data with data from the Data Lake, the master data can be used as source for other BI solutions like Data Warehouses. In there, the master data structures are often used as Data Warehouse Dimensions. To prevent the unnecessary duplicate loading of master data in the Data Warehouse that already exists in the master data hub, it can be a good choice to leave the master data out of the Data Warehouse Dimensions. Only the business keys are stored which can be used to retrieve the data from the master data hub when required. This way the Data Warehouse remains slim and fast to load and master data is stored in a single centralized data store.
The entire Data Lake architecture with all the described components are fit in the model below. From bottom to top the highlights are:
- Extract/Load data from the sources to the Data Lake, preferably in near real-time.
- The Data Lake can consist of multiple SQL (and NoSQL) databases.
- Transformations and authorizations are handled in views.
- The Data Lake acts as hub for other BI solutions like Data Warehouses and Cubes.
- The master data hub is in the center of the model and in the center of the entire architecture. It’s loaded as a standalone solution and isn’t part of any of the other BI solutions.
- Traditional BI will continue to exist and continue to be just as important as it has always been. It will be sourced from the Data Warehouses and cubes (and master data hub).
- The Discovery Platform with its new Power BI tooling is the place where “various users of the lake can come to examine, dive in, or take samples.” These samples can be combined with the data from the master data hub.
Data Lake Challenges
Setting up a Data Lake comes with many challenges, especially on the aspect of data governance. For example it’s easy to create any view in the Data Lake and lose control on who gets access to what data. From a business perspective it can be very difficult to deliver the master data structures that are so important for the success of the Data Lake. And from a user perspective wrong conclusions can be made by users who get insights from the raw data, therefore the Data Warehouse should still be offered as a clean trusted data structure for decision makers and a data source for conformed reports and dashboards.
The Data Lake can be a very valuable data store that complements the traditional Data Warehouses and Cubes that will stay as important as they are now for many years to come. But considering the increased amount and variety of data, the more powerful self-service ETL and data modeling tooling which appear and the shortened required time-to-market of near real-time data from source up and to the user, the Data Lake offers a future proof data store and hub that enables the answering of yet undefined questions and gives users personal data discovery and shaping possibilities.
Thanks go to my Macaw colleague Martijn Muilwijk for brainstorming on this subject and reviewing this blog post.