Since my last blog post about a SSIS package design pattern I’ve received quite some positive reactions and feedback. Microsoft also added a link to the post on the SSIS portal which made it clear to me that there is quite some attention for this subject.
The feedback I received was mainly about two things:
1. Can you visualize the process or make it clearer without the whole technical story so it’s easier to understand.
2. How should the Extract phase of the ETL process be implemented when source tables are used by multiple dimensions and/or fact tables.
In this post I will try to answer these questions. By doing so I hope to offer a complete design pattern that is usable for most data warehouse ETL solutions developed using SSIS.
SSIS package design pattern for loading a data warehouse
Using one SSIS package per dimension / fact table gives developers and administrators of ETL systems quite some benefits and is advised by Kimball since SSIS has been released. I have mentioned these benefits in my previous post and will not repeat them here.
When using a single modular package approach, developers sometimes face problems concerning flexibility or a difficult debugging experience. Therefore, they sometimes choose to spread the logic of a single dimension or fact table in multiple packages. I have thought about a design pattern with the benefits of a single modular package approach and still having all the flexibility and debugging functionalities developers need.
If you have a little bit of programming knowledge you must have heard about classes and functions. Now think about your SSIS package as a class or object that exists within code. These classes contain functions that you can call separately from other classes (packages). That would be some nice functionality to have, but unfortunately this is not possible within SSIS by default.
To realize this functionality in SSIS I thought about SSIS Sequence Containers as functions and SSIS packages as classes.
I personally always use four Sequence Containers in my SSIS packages:
– SEQ Extract (extract the necessary source tables to a staging database)
– SEQ Transform (transform these source tables to a dimension or fact table)
– SEQ Load (load this table into the data warehouse)
– SEQ Process (process the data warehouse table to the cube)
The technical trick that I performed – you can read about the inner working in my previous post – makes it possible to execute only a single Sequence Container within a package, just like with functions in classes when programming code.
The execution of a single dimension or fact table can now be performed from a master SSIS package like this:
1 – [Execute Package Task] DimCustomer.Extract
2 – [Execute Package Task] DimCustomer.Transform
3 – [Execute Package Task] DimCustomer.Load
4 – [Execute Package Task] DimCustomer.Process
The package is executed 4 times with an Execute Package Task, but each time only the desired function (Sequence Container) will run.
If we look at this in a UML sequence diagram we see the following:
I think this sequence diagram gives you a good overview of how this design pattern is organized. For the technical solution and the download of a template package you should check my previous post.
How should the Extract phase of the ETL process be implemented when a single source table is used by multiple dimensions and/or fact tables?
One of the questions that came up with using this design pattern is how to handle the extraction of source tables that are used in multiple dimensions and/or fact tables. The problem here is that a single table would be extracted multiple times which is, of course, undesirable.
On coincidence I was reading the book “SQL Server 2008 Integration Services: Problem – Design – Solution” (which is a great book!) and one of the data extraction best practices (Chapter 5) is to use one package for the extraction of each source table. Each of these packages would have a very simple dataflow from the source table to the destination table within the staging area.
Of course this approach will be more time consuming than using one big extract package with all table extracts in it but fortunately it also gives you some benefits:
– Debugging, sometimes a source has changed, i.e. a column’s name could have been changed or completely deleted. The error that SSIS will log when this occurs will point the administrators straight to the right package and source table. Another benefit here is that only one package will fail and needs to be edited, while the others can still execute and remain unharmed.
– Flexibility, you can execute a single table extract from anywhere (master package or dim/fact package).
I recently created some solutions using this extract approach and really liked it. I used 2 SSIS projects:
– one with the dimension and fact table packages
– one with only the extract packages
I have used the following naming conventions on the extract packages: Source_Table.dtsx and deployed them to a separate SSIS folder. This way the packages won’t bother the overview during development.
A tip here is to use BIDS Helper; it has a great functionality to deploy one or more packages from BIDS.
Merging this approach in the design pattern will give the following result:
– The dimension and fact table extract Sequence Containers will no longer have data flow tasks in it but execute package tasks which point to the extract packages.
– The Extract Sequence Container of the master package will execute all the necessary extract packages at once.
This way a single source table will always get extracted only one time when executing your ETL from the master package and you still have the possibility to unit test your entire dimension or fact table packages.
Drawing this approach again in a sequence diagram gives us the following example with a run from the master package (only the green Sequence Containers are executed):
And like this with a run of a single Dimension package:
Overall, the design pattern will now always look like this when executed from a master package:
I think this design pattern is now good enough to be used as a standard approach for the most data warehouse ETL projects using SSIS. Thanks for all the feedback! New feedback is of course more than welcome!
Jorg, did you take a look at the SQLBI Methodology (http://www.sqlbi.com/sqlbimethodology.aspx)?
We use the concept of OLTP Mirror as a first stage of data extraction and the staging is just a temporary area for processing data. We don’t necessarily use SSIS to extract data, but for example we use SqlBulkTool (http://www.sqlbi.com/sqlbulktool.aspx – it’s a free tool provided with source code), which is faster and handles partitioning too.
I’d like to get your feedback if you had a chance to read it.
That looks very promising. I will definitely take a look at it and give you my feedback.
Very interesting article. I found it very useful and I am designing an ETL process I am currently responsible of carrying out around this principle. Just one question or rather a different approach with regards to this second part…(or maybe i just misunderstood you completely)
What if instead of creating a seperate package for each source table, you use constraints in the appropriate and necessary manner to process some of the dimensions or fact tables which require the same source tables in a sequential manner. One might argue that it removes the ‘parallelism feature’ but I guess its just a different approach which I am considering.
If for example dimPlayer and dimCountry both require source table tPlayerDetails you set a dataflow in the EXTRACT container for the dimPlayer table and a sequence constraint to dimCountry. You omit the extract from tPlayerDetails in the dimCountry package completely since it was carried out in the dimPlayer package.
Good to hear that you benefit from this approach!
I think your approach works fine when executing the complete ETL process but what if you want to unit test dimCountry, how do you extract the data from tPlayerDetails?
If I understand what you are saying this will not be possible without running the extract of dimPlayer by hand before running dimCountry. Of course this isn’t a big problem, it’s just a detail and your approach will work fine when deploying your solution to a production environment!
What is the purpose of staging data into tables, when you can directly Extract, data cleansing, Transformations and Load to Dims / Facts in a single Data Flow Task ?
Because you usually want to minimize the pressure on the source systems and want to keep the load window as small as possible. Next to these reasons it’s usually easier to extract only the changed records when using a staging area that’s identical to the source system.
But in some cases you can do everything in a single DFT, in other cases you want to use one or more staging areas. It just depends.
I design Package which have many dimension whos data is very less can we club them into single package is that will be good approach.
I’ve been thinking about Vinod’s comment about the purpose of going to the trouble of creating staging tables since I’m considering that too.
As a developer I’m really keen on TDD in that it’s a documentation of intent and is a protection against future breaking changes. I think that both of those are hugely important in ETL and staging tables would allow a controlled set of data to be created and then transformed and loaded to a destination. The destination could then be inspected to see that the developers intentions had been met.
For instance, a slowly changing dimension could be tested by loading the staging tables, executing the T and L parts of a package, change the staging data and then re-running the package. A simple Sql script could inspect the target to ensure that the data has been loaded correctly.
If your load into the DWH needs to update rows it will cause fragmentation in the destination table. Better is to prepare the data in the staging and merge source and destination together into a new table or partition which will be at the end ‘switched in’.
As reported to our CIO who saw Jorg’s solution:
His solution is intriguing and supposedly recommended by Kimball. If Kimball agrees, we need to re read that chapter in the tool kit and perhaps take Jorg’s recommendation seriously.
Unwieldiness will be a concern in real life scenarios. Economies of scale too.
Our ultimate goal of completely automating ETL code generation seems less (or at least equally “un”) attainable in this approach. But it seems that as Jorg gets more and more feedback, he is moving in a traditional direction anyway.
One concern here is a paradigm in which dims are extracted just like but separately from facts. Dims are a byproduct of facts when it comes to relevance and etl. They are only business keys in the extract phase of etl. They are easily extracted with new (and old if necessary) facts.
It appears that maintenance is not diminished nor is the challenge involved when more than 1 person is making a change.
Do you have updated template with the new source tables package for me look at please.
Excellent blog.thank you.
What is the purpose of staging data into tables, when you can directly Extract, data cleansing, Transformations and Load to Dims / Facts in a single Data Flow Task ?IF YOU LIKE TO SHARE FRASES HERE IT IS FOR YOU: frases-de-canserbero.blogspot.com
Hi Jorg, great posts. I have one question about the approach used for a single source table used by multiple dimensions. You move the extract dataflow into its own separate package, and call all of them into the EXTRACT container of the master. But how do you prevent it from being called twice during the master execution? Do you modify the scoped variable vDisableExtract to true as well?