SSIS – Package design pattern for loading a data warehouse

I recently had a chat with some BI developers about the design patterns they’re using in SSIS when building an ETL system. We all agreed in creating multiple packages for the dimensions and fact tables and one master package for the execution of all these packages.

These developers even created multiple packages per single dimension/fact table:

  • One extract package where the extract(E) logic of all dim/fact tables is stored
  • One dim/fact package with the transform(T) logic of a single dim/fact table
  • One dim/fact package with the load(L) logic of a single dim/fact table

I like the idea of building the Extract, Transform and Load logic separately, but I do not like the way the logic was spread over multiple packages.
I asked them why they chose for this solution and there were multiple reasons:

  • Enable running the E/T/L parts separately, for example: run only the entire T phase of all dim/fact tables.
  • Run the extracts of all dimensions and fact tables simultaneously to keep the loading window on the source system as short as possible.

To me these are good reasons, running the E/T/L phases separately is a thing a developer often wants during the development and testing of an ETL system.
Keeping the loading window on the source system as short as possible is something that’s critical in some projects.

Despite the good arguments to design their ETL system like this, I still prefer the idea of having one package per dimension / fact table, with complete E/T/L logic, for the following reasons:

  • All the logic is in one place
  • Increase understandability
  • Perform unit testing
  • If there is an issue with a dimension or fact table, you only have to make changes in one place, which is safer and ore efficient
  • You can see your packages as separate ETL “puzzle pieces” that are reusable
  • It’s good from a project manager point of view; let your customer accept dimensions and fact tables one by one and freeze the appropriate package afterwards
  • The overview in BIDS, having an enormous amount of packages does not make it clearer 😉
  • Simplifies deployment after changes have been made
  • Changes are easier to track in source control systems
  • Team development will be easier; multiple developers can work on different dim/fact tables without bothering each other.

So basically my goal was clear: to build a solution that has all the possibilities the aforesaid developers asked for, but in one package per dimension / fact table; the best of both worlds.

Solution:

The solution I’ve created is based on a parent-child package structure. One parent (master) package will execute multiple child (dim/fact) packages. This solution is based on a single (child) package for each dimension and fact table. Each of these packages contains the following Sequence Containers in the Control Flow: 
 ChildControlFlow 

Normally it would not be possible to execute only the Extract, Transform, Load or (cube) Process Sequence Containers of the child (dim/fact) packages simultaneously.

To make this possible I have created four Parent package variable configurations, one for each ETL phase Sequence Container in the child package:
clip_image003

clip_image005

Each of these configurations is set on the Disable property of one of the Sequence Containers:
clip_image007

Using this technique makes it possible to run separate Sequence Containers of the child package from the master package, simply by did- or enabling the appropriate sequence containers with parent package variables.
Because the default value of the Disable property of the Sequence Containers is False, you can still run an entire standalone child package, without the need to change anything.

Ok, so far, so good. But, how do I execute only one phase of all the dimension and fact packages simultaneously? Well quite simple:

First add 4 Sequence Containers to the Master package. One for each phase of the ETL, just like in the child packages


Add Execute Package Tasks for all your packages in every Sequence Container

clip_image008


If you would execute this master package now, every child package would run 4 times as there are 4 Execute Package Tasks that run the same package in every sequence container.
To get the required functionality I have created 4 variables inside each Sequence Container (Scope). These will be used as parent variable to set the Disable properties in the child packages. So basically I’ve created 4 variables x 4 Sequence Containers = 16 variables for the entire master package.

Variables for the EXTRACT Sequence Container (vDisableExtract False):
clip_image009

Variables for the TRANSFORM Sequence Container (vDisableTransform False):
clip_image010

The LOAD and PROCESS Sequence Containers contain variables are based on the same technique.

Results:

Run all phases of a standalone package: Just execute the package:
clip_image011

Run a single phase of the ETL system (Extract/Transform/Load/Process): Execute the desired sequence container in the main package:

RunAllTransforms 

Run a single phase of a single package from the master package:
RunSinglePhaseOfOnePackage

Run multiple phases of the ETL system, for example only the T and L: Disable the Sequence Containers of the phases that need to be excluded in the master package:

RunMultiplePhasesAndExcludeOthers

Run all the child packages in the right order from the master package:
When you add a breakpoint on, for example, the LOAD Sequence Container you see that all the child packages are at the same ETL phase as their parent: 
RunCompleteMasterPackageBreakPoint


When pressing Continue the package completes: 
RunCompleteMasterPackageBreakPointCompleted


Conclusion:

This parent/child package design pattern for loading a Data Warehouse gives you all the flexibility and functionality you need. It’s ready and easy to use during development and production without the need to change anything.

With only a single SSIS package for each dimension and fact table you now have the functionality that separate packages would offer. You will be able to, for example, run all the Extracts for all dimensions and fact tables simultaneously like the developers asked for and still have the benefits that come with the one package per dimension/fact table approach.

Of course having a single package per dimension or fact table will not be the right choice in all cases but I think it is a good standard approach.
Same applies to the ETL phases (Sequence Containers). I use E/T/L/P, but if you have different phases, which will be fine, you can still use the same technique.

Download the solution with template packages from the URL’s below. Only thing you need to do is change the connection managers to the child packages (to your location on disk) and run the master package!

Download for SSIS 2008

Download for SSIS 2005

If you have any suggestions, please leave them as a comment. I would like to know what your design pattern is as well!

ATTENTION: See Part-2 on this subject for more background information!

Backgrounds:

How to: Use the Values of Parent Variables in a Child Package: http://technet.microsoft.com/en-us/library/ms345179.aspx

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s