SSIS – Decrease your fact table loading time up to 40%

Replace the multiple “Lookup Error Output” Derived Columns and Union All’s with a single Derived Column and get a performance boost…

Almost every BI developer needs to perform lookups while loading the Data Warehouse. When loading the fact tables for example, lookups are used to receive the (surrogate) primary keys of the dimension tables that are connected to the fact table.

I, and I think most developers, perform the DWH fact-dimension lookups in the following way:
1. Each fact record contains business keys to the different dimensions.
2. Each dimension record contains a business key and a surrogate key (integer).
3. Each dimension contains one unknown member, with a surrogate key of value 0.
4. Match the dimension business key in the fact record to the business key in the dimension to receive the dimension surrogate key, using a Lookup Component. If the lookup gained no match, point to the unknown member in the dimension (surrogate key 0).
5. Store the fact record in the DWH fact table with the gained surrogate keys of the dimensions.

Loading a fact table in the way described above would typically look like this:

I have tested this solution with 313.341 records. This took 1 minute and 23 seconds.
This way of loading the fact table contains a lot of semi-blocking components: the Union All’s. Read more about semi-blocking components in this blog, posted by me a little while ago. These components cause an unnecessary negative impact on the performance, as you can read in my other blog.

There is a much more efficient way to load a fact table in a data warehouse:
1. Set the all the Lookup Error Outputs to “Ignore Failure”.
2. Delete all the Union All and Derived Column components.
3. Add a Derived Column component to the end of the flow and add the unknown surrogate keys like this:

The data flow now looks like this:

The loading of exactly the same amount of records (313.341) now took just 45 seconds! This is a performance boost of almost 40%. Loading your fact table like this does not only decrease your loading time, it also takes less development time. So a win-win situation!

A minor downside might be that you can’t easily see how much records have failed a particular lookup component, but it’s not so hard to make a simple (SSRS) report which gives you this overview. It might be a good opportunity to include the information of all your fact tables in one single report, it will give you a much better overview of lookup-failures during the load of your fact tables then all the separate SSIS dataflow tasks.

7 comments

hari says:

24 October 2017 at 20:58

i want what are the synchronous and asynchronous trans

LikeLike

Daniel Barkman says:

24 October 2017 at 20:59

Do you have a download of this sample?

LikeLike

Emine says:

24 October 2017 at 20:59

great idea Jorg !!! simple ! yet it saves a lot of time

LikeLike

Ritz says:

24 October 2017 at 20:59

How do you handle Late Arriving Dimensions? Do you insert Inferred Dimensions during fact load?

LikeLike

Ritz says:

24 October 2017 at 21:00

To be able to get the most current surrogate key, need to use SQL query in the Lookup Item to select only the Active/current record, else you will start getting multiple surrogate keys if the records have historical data

LikeLike

Gary says:

24 October 2017 at 21:00

Just FYI…I haven’t tested this exact number of rows nor do I have your table layout so can’t repeat this test but I’d bet money that if you simply outer joined all thee dim tables to the fact query in the source of the dataflow using CASE WHEN statements to substitute the unknown dimension rows, that the runtime would drop to something like 1 second. I’m guessing your rows and dimensions are very small with that kind of throughput since you’re driving those rows through memory of the ETL tool through multiple tasks. My tests of that with data I have was about as fast as watching paint dry. When I pushed it to the database it screamed.
Rule #1 I tell all new ETL developers is – never ever make the ETL tool do something that can be done exponentially faster in the database if you need scalability. That applies to all ETL tools. Course if you’re only loading 300K rows, you can probably do anything you want and who cares? If you’re loading a billion rows like I do, the above scenario is going to take a mighty long time.

LikeLike

1. Phil says:
  
  22 January 2019 at 18:01
  
  How would this approach work if planning to use the data in something like Power BI? I would want to ensure the records that do not have a link are still shown with the default ‘0’ attributes.
  
  LikeLike