Replace the multiple “Lookup Error Output” Derived Columns and Union All’s with a single Derived Column and get a performance boost…
Almost every BI developer needs to perform lookups while loading the Data Warehouse. When loading the fact tables for example, lookups are used to receive the (surrogate) primary keys of the dimension tables that are connected to the fact table.
I, and I think most developers, perform the DWH fact-dimension lookups in the following way:
1. Each fact record contains business keys to the different dimensions.
2. Each dimension record contains a business key and a surrogate key (integer).
3. Each dimension contains one unknown member, with a surrogate key of value 0.
4. Match the dimension business key in the fact record to the business key in the dimension to receive the dimension surrogate key, using a Lookup Component. If the lookup gained no match, point to the unknown member in the dimension (surrogate key 0).
5. Store the fact record in the DWH fact table with the gained surrogate keys of the dimensions.
Loading a fact table in the way described above would typically look like this:
I have tested this solution with 313.341 records. This took 1 minute and 23 seconds.
This way of loading the fact table contains a lot of semi-blocking components: the Union All’s. Read more about semi-blocking components in this blog, posted by me a little while ago. These components cause an unnecessary negative impact on the performance, as you can read in my other blog.
There is a much more efficient way to load a fact table in a data warehouse:
1. Set the all the Lookup Error Outputs to “Ignore Failure”.
2. Delete all the Union All and Derived Column components.
3. Add a Derived Column component to the end of the flow and add the unknown surrogate keys like this:
The data flow now looks like this:
The loading of exactly the same amount of records (313.341) now took just 45 seconds! This is a performance boost of almost 40%. Loading your fact table like this does not only decrease your loading time, it also takes less development time. So a win-win situation!
A minor downside might be that you can’t easily see how much records have failed a particular lookup component, but it’s not so hard to make a simple (SSRS) report which gives you this overview. It might be a good opportunity to include the information of all your fact tables in one single report, it will give you a much better overview of lookup-failures during the load of your fact tables then all the separate SSIS dataflow tasks.