Since my last blog post about a SSIS package design pattern I’ve received quite some positive reactions and feedback. Microsoft also added a link to the post on the SSIS portal which made it clear to me that there is quite some attention for this subject.
The feedback I received was mainly about two things:
1. Can you visualize the process or make it clearer without the whole technical story so it’s easier to understand.
2. How should the Extract phase of the ETL process be implemented when source tables are used by multiple dimensions and/or fact tables.
In this post I will try to answer these questions. By doing so I hope to offer a complete design pattern that is usable for most data warehouse ETL solutions developed using SSIS.
SSIS package design pattern for loading a data warehouse
Using one SSIS package per dimension / fact table gives developers and administrators of ETL systems quite some benefits and is advised by Kimball since SSIS has been released. I have mentioned these benefits in my previous post and will not repeat them here.
When using a single modular package approach, developers sometimes face problems concerning flexibility or a difficult debugging experience. Therefore, they sometimes choose to spread the logic of a single dimension or fact table in multiple packages. I have thought about a design pattern with the benefits of a single modular package approach and still having all the flexibility and debugging functionalities developers need.
If you have a little bit of programming knowledge you must have heard about classes and functions. Now think about your SSIS package as a class or object that exists within code. These classes contain functions that you can call separately from other classes (packages). That would be some nice functionality to have, but unfortunately this is not possible within SSIS by default.
To realize this functionality in SSIS I thought about SSIS Sequence Containers as functions and SSIS packages as classes.
I personally always use four Sequence Containers in my SSIS packages:
– SEQ Extract (extract the necessary source tables to a staging database)
– SEQ Transform (transform these source tables to a dimension or fact table)
– SEQ Load (load this table into the data warehouse)
– SEQ Process (process the data warehouse table to the cube)
The technical trick that I performed – you can read about the inner working in my previous post – makes it possible to execute only a single Sequence Container within a package, just like with functions in classes when programming code.
The execution of a single dimension or fact table can now be performed from a master SSIS package like this:
1 – [Execute Package Task] DimCustomer.Extract
2 – [Execute Package Task] DimCustomer.Transform
3 – [Execute Package Task] DimCustomer.Load
4 – [Execute Package Task] DimCustomer.Process
The package is executed 4 times with an Execute Package Task, but each time only the desired function (Sequence Container) will run.
I think this sequence diagram gives you a good overview of how this design pattern is organized. For the technical solution and the download of a template package you should check my previous post.
How should the Extract phase of the ETL process be implemented when a single source table is used by multiple dimensions and/or fact tables?
One of the questions that came up with using this design pattern is how to handle the extraction of source tables that are used in multiple dimensions and/or fact tables. The problem here is that a single table would be extracted multiple times which is, of course, undesirable.
On coincidence I was reading the book “SQL Server 2008 Integration Services: Problem – Design – Solution” (which is a great book!) and one of the data extraction best practices (Chapter 5) is to use one package for the extraction of each source table. Each of these packages would have a very simple dataflow from the source table to the destination table within the staging area.
Of course this approach will be more time consuming than using one big extract package with all table extracts in it but fortunately it also gives you some benefits:
– Debugging, sometimes a source has changed, i.e. a column’s name could have been changed or completely deleted. The error that SSIS will log when this occurs will point the administrators straight to the right package and source table. Another benefit here is that only one package will fail and needs to be edited, while the others can still execute and remain unharmed.
– Flexibility, you can execute a single table extract from anywhere (master package or dim/fact package).
I recently created some solutions using this extract approach and really liked it. I used 2 SSIS projects:
– one with the dimension and fact table packages
– one with only the extract packages
I have used the following naming conventions on the extract packages: Source_Table.dtsx and deployed them to a separate SSIS folder. This way the packages won’t bother the overview during development.
A tip here is to use BIDS Helper; it has a great functionality to deploy one or more packages from BIDS.
Merging this approach in the design pattern will give the following result:
– The dimension and fact table extract Sequence Containers will no longer have data flow tasks in it but execute package tasks which point to the extract packages.
– The Extract Sequence Container of the master package will execute all the necessary extract packages at once.
This way a single source table will always get extracted only one time when executing your ETL from the master package and you still have the possibility to unit test your entire dimension or fact table packages.
Drawing this approach again in a sequence diagram gives us the following example with a run from the master package (only the green Sequence Containers are executed):
I think this design pattern is now good enough to be used as a standard approach for the most data warehouse ETL projects using SSIS. Thanks for all the feedback! New feedback is of course more than welcome!