Relational Data Lake

What is a Data Lake?
Pentaho CTO James Dixon is credited with coining the term “Data Lake”. As he describes it in his blog entry, “If you think of a Data Mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

These days, demands for BI data stores are changing. BI data consumers not only require cleansed and nicely modeled data, updated on a daily basis, but also raw, uncleansed and unmodeled data which is available near real-time. With new and much more powerful tooling like Power BI, users can shape and cleanse data in a way that fits their personal needs without the help of the IT department. This calls for a different approach when it comes to offering data to these users.

BI data consumers also demand a very short time-to-market of new data, they don’t want to wait for a few months until data is made available by a BI team, they want it today. The raw uncleansed form of data in a Data Lake can be loaded very quickly because it’s suitable for generated data loading technologies and replication, which makes this short time-to-market possible. Once users have discovered the data and have acquired enough insights that they want to share with the entire organization in a conformed way, the data can be brought to traditional Data Warehouses and cubes in a predictable manner.

Furthermore there is rise in the presence of unstructured and or semi-structured data and the need to have “big data” available for adhoc analyses. To store and analyze these forms of data new technologies and data structures are required.

When the Data Lake comes in place a lot of data streams from sources into the “lake” without knowing up front if it is eligible for answering business questions. The data can’t be modeled yet, because it’s not clear how it will be used later on. Data consumers will get the possibility to discover data and find answers before they are even defined. This differs fundamentally from the concept of a Data Warehouse in which the data is delivered through predefined data structures, based on relevant business cases and questions.

Technology
From a technology view, a Data Lake is a repository which offers storage for large quantities and varieties of both unstructured, semi-structured and structured data derived from all possible sources. It can be formed by multiple underlying databases which store these different structured forms of data in both SQL and NoSQL technologies.
20141217_JK_Technologies

For the semi-structured/unstructured side of data which is used for big data analytics, Data Lakes based on Hadoop and other NoSQL technologies are common. For the semi-structured/structured data, SQL technologies are the way to go.

In this blog post I will describe the semi-structured/structured, relational appearance of the Data Lake in the form of a SQL Server database: The Relational Data Lake.
RelationalDataLake2

Extract Load (Transform)
Data in a Data Lake is in raw form. Transformations will not be performed during loading and relationships and constraints between tables will not be created which is the default for transactional replication and keeps the loading process as lean and fast as possible. Because of the lack of transformations, movement of the data follows the Extract-Load-(Transform) (EL(T)) pattern instead of the traditional E-T-L. This pattern makes loading of data to the Data Lake easier, faster and much more suitable to perform using replication technologies or generated SSIS processes, for example with BIML. This creates a very attractive time-to-market for data which is added to the Data Lake. Latency of data is as low as possible, preferable data is loaded in near real-time: data should stream into the lake continuously.

Transformations take place after the data is loaded into the Data Lake, where applicable. Cosmetic transformations like translations from technical object and column names to meaningful descriptions which end users understand or other lightweight transformations can be performed in new structures (like SQL views) that are created inside the Data Lake.

Unlike Data Marts and Data Warehouses, which are optimized for data analysis by storing only the required attributes and sometimes dropping data below the required level of aggregation, a Data Lake always retains all attributes and (if possible) all records. This way it will be future proof for solutions that will require this data in a later moment in time or for users that will discover the data.

Accessing data
Data is made accessible through structures which can either be accessed directly, or indirectly through the exposure as OData Feeds. These structures are secured and are the only objects end users or other processes have access to. The feeds can be accessed with any tool or technology that is best suited to the task at any moment in time, for example using Power BI tooling like Excel PowerPivot/PowerQuery.

We normally create SQL Views in which security rules and required transformation are applied.

The Data Lake also acts as a hub for other repositories and solutions like Data Warehouses and Operational Cubes.

Master Data
Success of the Data Lake depends on good master data. When end users discover new raw data from the Data Lake they need to be able to combine it with high quality master data to get proper insights. Therefore a master data hub is a must have when a Data Lake is created. This hub should just be a database with master data structures in it, master data management on this data is preferable but not required. The master data hub should be a standalone solution, independent from the other BI solutions, as master data isn’t part of these solutions but is only used as data source. It should be sourced independently too, preferable using master data tooling or using tools like SSIS. Just like with data from the Data Lake, master data should also only be accessed through structures which can also be exposed as OData Feeds.

Next to the purpose of combining master data with data from the Data Lake, the master data can be used as source for other BI solutions like Data Warehouses. In there, the master data structures are often used as Data Warehouse Dimensions. To prevent the unnecessary duplicate loading of master data in the Data Warehouse that already exists in the master data hub, it can be a good choice to leave the master data out of the Data Warehouse Dimensions. Only the business keys are stored which can be used to retrieve the data from the master data hub when required. This way the Data Warehouse remains slim and fast to load and master data is stored in a single centralized data store.

Architecture
The entire Data Lake architecture with all the described components are fit in the model below. From bottom to top the highlights are:

  • Extract/Load data from the sources to the Data Lake, preferably in near real-time.
  • The Data Lake can consist of multiple SQL (and NoSQL) databases.
  • Transformations and authorizations are handled in views.
  • The Data Lake acts as hub for other BI solutions like Data Warehouses and Cubes.
  • The master data hub is in the center of the model and in the center of the entire architecture. It’s loaded as a standalone solution and isn’t part of any of the other BI solutions.
  • Traditional BI will continue to exist and continue to be just as important as it has always been. It will be sourced from the Data Warehouses and cubes (and master data hub).
  • The Discovery Platform with its new Power BI tooling is the place where “various users of the lake can come to examine, dive in, or take samples.” These samples can be combined with the data from the master data hub.

20141211JK_Data Lake BI Architecture

Data Lake Challenges
Setting up a Data Lake comes with many challenges, especially on the aspect of data governance. For example it’s easy to create any view in the Data Lake and lose control on who gets access to what data. From a business perspective it can be very difficult to deliver the master data structures that are so important for the success of the Data Lake. And from a user perspective wrong conclusions can be made by users who get insights from the raw data, therefore the Data Warehouse should still be offered as a clean trusted data structure for decision makers and a data source for conformed reports and dashboards.

Summary
The Data Lake can be a very valuable data store that complements the traditional Data Warehouses and Cubes that will stay as important as they are now for many years to come. But considering the increased amount and variety of data, the more powerful self-service ETL and data modeling tooling which appear and the shortened required time-to-market of near real-time data from source up and to the user, the Data Lake offers a future proof data store and hub that enables the answering of yet undefined questions and gives users personal data discovery and shaping possibilities.

Thanks go to my Macaw colleague Martijn Muilwijk for brainstorming on this subject and reviewing this blog post.

SSIS – Package design pattern for loading a data warehouse – Part 2

Since my last blog post about a SSIS package design pattern I’ve received quite some positive reactions and feedback. Microsoft also added a link to the post on the SSIS portal which made it clear to me that there is quite some attention for this subject.

The feedback I received was mainly about two things:
1. Can you visualize the process or make it clearer without the whole technical story so it’s easier to understand.
2. How should the Extract phase of the ETL process be implemented when source tables are used by multiple dimensions and/or fact tables.

In this post I will try to answer these questions. By doing so I hope to offer a complete design pattern that is usable for most data warehouse ETL solutions developed using SSIS.


SSIS package design pattern for loading a data warehouse

Using one SSIS package per dimension / fact table gives developers and administrators of ETL systems quite some benefits and is advised by Kimball since SSIS has been released. I have mentioned these benefits in my previous post and will not repeat them here.

When using a single modular package approach, developers sometimes face problems concerning flexibility or a difficult debugging experience. Therefore, they sometimes choose to spread the logic of a single dimension or fact table in multiple packages. I have thought about a design pattern with the benefits of a single modular package approach and still having all the flexibility and debugging functionalities developers need.

If you have a little bit of programming knowledge you must have heard about classes and functions. Now think about your SSIS package as a class or object that exists within code. These classes contain functions that you can call separately from other classes (packages). That would be some nice functionality to have, but unfortunately this is not possible within SSIS by default.
To realize this functionality in SSIS I thought about SSIS Sequence Containers as functions and SSIS packages as classes.
I personally always use four Sequence Containers in my SSIS packages:
– SEQ Extract (extract the necessary source tables to a staging database)
– SEQ Transform (transform these source tables to a dimension or fact table)
– SEQ Load (load this table into the data warehouse)
– SEQ Process (process the data warehouse table to the cube)

The technical trick that I performed – you can read about the inner working in my previous post – makes it possible to execute only a single Sequence Container within a package, just like with functions in classes when programming code.
The execution of a single dimension or fact table can now be performed from a master SSIS package like this:

1 – [Execute Package Task] DimCustomer.Extract
2 – [Execute Package Task] DimCustomer.Transform
3 – [Execute Package Task] DimCustomer.Load
4 – [Execute Package Task] DimCustomer.Process

The package is executed 4 times with an Execute Package Task, but each time only the desired function (Sequence Container) will run.

If we look at this in a UML sequence diagram we see the following:
ETL_Example1

I think this sequence diagram gives you a good overview of how this design pattern is organized. For the technical solution and the download of a template package you should check my previous post.


How should the Extract phase of the ETL process be implemented when a single source table is used by multiple dimensions and/or fact tables?

One of the questions that came up with using this design pattern is how to handle the extraction of source tables that are used in multiple dimensions and/or fact tables. The problem here is that a single table would be extracted multiple times which is, of course, undesirable.

On coincidence I was reading the book “SQL Server 2008 Integration Services: Problem – Design – Solution” (which is a great book!) and one of the data extraction best practices (Chapter 5) is to use one package for the extraction of each source table. Each of these packages would have a very simple dataflow from the source table to the destination table within the staging area.
Of course this approach will be more time consuming than using one big extract package with all table extracts in it but fortunately it also gives you some benefits:
– Debugging, sometimes a source has changed, i.e. a column’s name could have been changed or completely deleted. The error that SSIS will log when this occurs will point the administrators straight to the right package and source table. Another benefit here is that only one package will fail and needs to be edited, while the others can still execute and remain unharmed.
– Flexibility, you can execute a single table extract from anywhere (master package or dim/fact package).

I recently created some solutions using this extract approach and really liked it. I used 2 SSIS projects:
– one with the dimension and fact table packages
– one with only the extract packages
I have used the following naming conventions on the extract packages: Source_Table.dtsx and deployed them to a separate SSIS folder. This way the packages won’t bother the overview during development.
A tip here is to use BIDS Helper; it has a great functionality to deploy one or more packages from BIDS.

Merging this approach in the design pattern will give the following result:
– The dimension and fact table extract Sequence Containers will no longer have data flow tasks in it but execute package tasks which point to the extract packages.
– The Extract Sequence Container of the master package will execute all the necessary extract packages at once.

This way a single source table will always get extracted only one time when executing your ETL from the master package and you still have the possibility to unit test your entire dimension or fact table packages.
Drawing this approach again in a sequence diagram gives us the following example with a run from the master package (only the green Sequence Containers are executed):
ETL_Example2

And like this with a run of a single Dimension package:
ETL_Example2_Customer

Overall, the design pattern will now always look like this when executed from a master package:
ETL_Overview

Conclusion
I think this design pattern is now good enough to be used as a standard approach for the most data warehouse ETL projects using SSIS. Thanks for all the feedback! New feedback is of course more than welcome!

SSIS – Package design pattern for loading a data warehouse

I recently had a chat with some BI developers about the design patterns they’re using in SSIS when building an ETL system. We all agreed in creating multiple packages for the dimensions and fact tables and one master package for the execution of all these packages.

These developers even created multiple packages per single dimension/fact table:

  • One extract package where the extract(E) logic of all dim/fact tables is stored
  • One dim/fact package with the transform(T) logic of a single dim/fact table
  • One dim/fact package with the load(L) logic of a single dim/fact table

I like the idea of building the Extract, Transform and Load logic separately, but I do not like the way the logic was spread over multiple packages.
I asked them why they chose for this solution and there were multiple reasons:

  • Enable running the E/T/L parts separately, for example: run only the entire T phase of all dim/fact tables.
  • Run the extracts of all dimensions and fact tables simultaneously to keep the loading window on the source system as short as possible.

To me these are good reasons, running the E/T/L phases separately is a thing a developer often wants during the development and testing of an ETL system.
Keeping the loading window on the source system as short as possible is something that’s critical in some projects.

Despite the good arguments to design their ETL system like this, I still prefer the idea of having one package per dimension / fact table, with complete E/T/L logic, for the following reasons:

  • All the logic is in one place
  • Increase understandability
  • Perform unit testing
  • If there is an issue with a dimension or fact table, you only have to make changes in one place, which is safer and ore efficient
  • You can see your packages as separate ETL “puzzle pieces” that are reusable
  • It’s good from a project manager point of view; let your customer accept dimensions and fact tables one by one and freeze the appropriate package afterwards
  • The overview in BIDS, having an enormous amount of packages does not make it clearer 😉
  • Simplifies deployment after changes have been made
  • Changes are easier to track in source control systems
  • Team development will be easier; multiple developers can work on different dim/fact tables without bothering each other.

So basically my goal was clear: to build a solution that has all the possibilities the aforesaid developers asked for, but in one package per dimension / fact table; the best of both worlds.

Solution:

The solution I’ve created is based on a parent-child package structure. One parent (master) package will execute multiple child (dim/fact) packages. This solution is based on a single (child) package for each dimension and fact table. Each of these packages contains the following Sequence Containers in the Control Flow: 
 ChildControlFlow 

Normally it would not be possible to execute only the Extract, Transform, Load or (cube) Process Sequence Containers of the child (dim/fact) packages simultaneously.

To make this possible I have created four Parent package variable configurations, one for each ETL phase Sequence Container in the child package:
clip_image003

clip_image005

Each of these configurations is set on the Disable property of one of the Sequence Containers:
clip_image007

Using this technique makes it possible to run separate Sequence Containers of the child package from the master package, simply by did- or enabling the appropriate sequence containers with parent package variables.
Because the default value of the Disable property of the Sequence Containers is False, you can still run an entire standalone child package, without the need to change anything.

Ok, so far, so good. But, how do I execute only one phase of all the dimension and fact packages simultaneously? Well quite simple:

First add 4 Sequence Containers to the Master package. One for each phase of the ETL, just like in the child packages


Add Execute Package Tasks for all your packages in every Sequence Container

clip_image008


If you would execute this master package now, every child package would run 4 times as there are 4 Execute Package Tasks that run the same package in every sequence container.
To get the required functionality I have created 4 variables inside each Sequence Container (Scope). These will be used as parent variable to set the Disable properties in the child packages. So basically I’ve created 4 variables x 4 Sequence Containers = 16 variables for the entire master package.

Variables for the EXTRACT Sequence Container (vDisableExtract False):
clip_image009

Variables for the TRANSFORM Sequence Container (vDisableTransform False):
clip_image010

The LOAD and PROCESS Sequence Containers contain variables are based on the same technique.

Results:

Run all phases of a standalone package: Just execute the package:
clip_image011

Run a single phase of the ETL system (Extract/Transform/Load/Process): Execute the desired sequence container in the main package:

RunAllTransforms 

Run a single phase of a single package from the master package:
RunSinglePhaseOfOnePackage

Run multiple phases of the ETL system, for example only the T and L: Disable the Sequence Containers of the phases that need to be excluded in the master package:

RunMultiplePhasesAndExcludeOthers

Run all the child packages in the right order from the master package:
When you add a breakpoint on, for example, the LOAD Sequence Container you see that all the child packages are at the same ETL phase as their parent: 
RunCompleteMasterPackageBreakPoint


When pressing Continue the package completes: 
RunCompleteMasterPackageBreakPointCompleted


Conclusion:

This parent/child package design pattern for loading a Data Warehouse gives you all the flexibility and functionality you need. It’s ready and easy to use during development and production without the need to change anything.

With only a single SSIS package for each dimension and fact table you now have the functionality that separate packages would offer. You will be able to, for example, run all the Extracts for all dimensions and fact tables simultaneously like the developers asked for and still have the benefits that come with the one package per dimension/fact table approach.

Of course having a single package per dimension or fact table will not be the right choice in all cases but I think it is a good standard approach.
Same applies to the ETL phases (Sequence Containers). I use E/T/L/P, but if you have different phases, which will be fine, you can still use the same technique.

Download the solution with template packages from the URL’s below. Only thing you need to do is change the connection managers to the child packages (to your location on disk) and run the master package!

Download for SSIS 2008

Download for SSIS 2005

If you have any suggestions, please leave them as a comment. I would like to know what your design pattern is as well!

ATTENTION: See Part-2 on this subject for more background information!

Backgrounds:

How to: Use the Values of Parent Variables in a Child Package: http://technet.microsoft.com/en-us/library/ms345179.aspx