BI on your terms with SQL Server 2016

The last
few years Microsoft’s strategy was all about cloud first (or cloud only?), releasing
new BI products and updates to existing products to the cloud in high pace
without almost any investments in on-premises BI. In 2015 Microsoft seems to
change its course, they now aim more on the enabling of hybrid scenarios,
investing a lot in both cloud (Power BI/Azure) and on-premises with SQL Server
2016.
Microsoft’s message regarding BI for 2015/2016 is:  “BI on your terms”.

BI on your
terms means leveraging up-to-date possibilities for one or a combination (hybrid)
of the following architectures:

  • Cloud
    with Azure and Power BI

  • On-Premises
    with SQL Server 2016

  • Server
    driven or Self-Service

To be able
to offer quality hybrid architectures Microsoft invests a lot in the
on-premises BI suite with SQL Server 2016 and they have announced to keep
investing in it the coming years. So not only cloud first like we have seen in previous
years, but more on hybrid possibilities, and if you desire on-premises only.

For the
first time in many years an exciting version of SQL Server is coming in terms
of BI. The main topics are:

  • Hybrid
    BI (Cloud/On-Premises)

  • Modern
    Reports

  • Enhanced
    Analysis

  • Mobile
    BI

Below is an
overview of the new BI related features per SQL Server 2016 service or product.
As the length of this list shows, SQL Server 2016 will be a massive BI
version!!

Analysis Services Tabular

  • Enhanced
    modeling capabilities in the semantic layer

    • Many-to-many
      relationships

    • BI
      Directional cross filtering. This means you can not only filter on the 1 side
      of a 1 to many relationship in your tabular model, but also on the many side. For
      example, two connected tables, Sales
      à Product:

      • Product: product, product category

      • Sales: sales date, connection to
        product table

        Now select products sold
        filtering on sales date(many side) while also filtering on product category (1
        side). This is not possible in today’s version of SSAS tabular.

  • Time
    intelligence

    • Date/time
      columns are automatically converted to rich date/time tables starting from the
      column’s MIN date till the MAX date found

  • New
    DAX functions

    • A
      lot of new functions that at the moment require quite complex formulas like
      present time, date difference, percentile, product, geomean, median, etc.

  • Performance
    improvements

    • For
      end users

      • Query engine optimized

    • For
      developers

      • Metadata operations; modeling
        related operations are much faster

    • For
      data processing

      • Parallel partition processing

  • Expose
    on-premises tabular models in the cloud (hybrid)
    à Power BI feature, possible already today
    with SQL Server 2012
    .

Analysis Services Dimensional

  • Netezza
    as a Data Source (Netezza Data Warehouse | IBM – NDM Technologies)

  • Performance
    improvements

    • Unnatural
      hierarchies

    • Distinct
      counts

    • Other
      performance improvements in areas where multidimensional is not performant at
      the moment

  • DBCC
    (DataBase Check Consistency) support.
    Checks the logical and physical integrity of objects in the specified
    database.

  • Expose
    on-premises multidimensional cubes in the cloud with Power BI (hybrid)

SQL Server Database Engine

  • Integration
    of R analytical engine, predictive analytic capabilities via T-SQL queries

  • PolyBase
    available without the need of PDW, makes it possible to query both structured relational
    SQL, and unstructured Hadoop data through T-SQL statements

  • Data
    encryption for stored data and data in motion

  • Row-level
    security

  • Updates
    to the in-memory OLTP engine, for example updateable in-memory nonclustered
    columnstore indexes

  • Parsing
    and storing native JSON data

  • XEvents-based
    monitoring in Management Studio

Reporting Services

  • New
    look and feel and possibility to apply themes and branding using CSS

  • New
    visualizations, chart types like tree maps and sun bursts

  • Improved
    flexible parameter panel with support for:

    • Autocomplete

    • Search

    • Hierarchical
      tree display

  • Runs
    in all modern browsers on both desktops as tablets (any device)

  • Integration
    of R analytical engine

  • Power
    Query as a data source

  • Pin
    on-premises SSRS reports to Power BI Dashboards (hybrid)

Integration Services

  • High
    Availability support

  • Power
    Query integration

  • Azure
    Data Factory integration (hybrid)

    • Execute
      on-premises SSIS packages from Azure Data Factory

    • Azure
      Data Factory data flow task

    • Azure
      storage connector

    • Azure
      commandlets

  • OData
    4.0 support

  • Hadoop
    File System (HDFS) support

  • JSON
    support

  • New
    Oracle/Teradata connector (4.0)

  • Incremental
    deployment options

  • Custom
    logging levels

  • SSIS
    package templates to reuse ETL code

Mobile BI

  • In
    the cloud with Power BI

    • Power
      BI App for Windows Phone (coming soon) and iOS

  • On-premises
    with Datazen Server

    • Now
      available for free for SQL Enterprise Edition customers (2008 or later)

    • All
      major platforms: Windows Phone, Android, iOS

    • Beautiful
      interface and data visualizations

    • Optimizable
      for Phone, Tablet and Laptop

SharePoint vNext integration

  • Edit
    Mode of PowerPivot Excel workbooks in browser

  • Support
    for Excel vNext (Office 2016) DAX functions

Master Data Services

  • Improved
    performance for large models

  • Row-level
    compression per entity

  • Improved
    user interface

  • Configurable
    retention settings

  • Enhanced
    security possibilities for read, write, delete and create operations and
    support for multiple system administrators with specific permissions

  • Excel
    Add-in is 15 times faster and is updated to support bulk entity based staging
    operation

Visual Studio

  • Database
    and BI project types merged into one Visual Studio

  • New
    scripting language for tabular models. Currently tabular models are wrapped
    into multidimensional constructs and when you deploy it will be reverse
    engineered to the tabular model. The new native language for tabular will be
    easy to understand, modify and deploy.

  • SSIS
    designer supports previous versions of SQL Server

Of course
there is still also a lot of exiting news coming from the cloud side of
Microsoft BI, for example the
Azure Data Lake is announced
, following the principles of myblogpost about the relational data lake. You can expect a post about the
Azure Data Lake on this blog soon!



P.S. Don’t forget
to suggest and vote for feature requests for SQL Server yourself at:  
http://aka.ms/SqlBiUserVoice

 

SSIS Denali CTP3 – What’s new?

Last week Microsoft released CTP3 of SQL Server Integration Services (SSIS), code name: Denali. In this blog post I will look into the new key features and some of the minor improvements in SSIS Denali.

1. Development

Shared Connection Managers
Connection Managers can now be shared on SSIS project level.
You can create them in the solution explorer, in the folder Connection Managers:
clip_image001

Once created, they will appear automatically in all your SSIS packages. The names are in bold so you can recognize them between your package-level connection managers:
clip_image002

You can also create Shared Cache Connection Managers for your cached lookups. This will be very useful for lookups that are performed multiple times, for example when you look up dimension tables multiple times from your fact table ETL. You can just convert the local cache connection manager by right clicking it:
clip_image003

What happened to the Shared Data Sources we knew from prior SSIS versions? Well they are no more. Shared Data Sources only lived at design time and not at runtime so they were not really useful anyway. Shared Connection Managers do live at runtime and offer even more possibilities as I will show later on, so they replace the old Shared Data Sources.

Data Flow – Column mappings
SSIS always mapped columns from source to transformations or destinations with the help of lineage ids. Every column had a unique metadata ID that was known by all components in the data flow. If something changed in the source this would break the lineage ids and raised error messages like: The external metadata column collection is out of synchronization with the data source columns.
To fix this error you would re-map all broken lineage ids with the “Restore Invalid Column References Editor”.
In Denali lineage-ids are no longer used. Mappings are done on column names, which is great because you can now use auto map on column names and even copy/paste pieces of another data flow and connect them by mapping the corresponding column names.

Data Flow – Flexible order of authoring
This improvement helps you edit data flow components even when they don’t have an input attached. Theoretically you can build your data flow backwards; start with the destination and track back to the source.

Data Flow – Groupings
You can now group data flow components. Select the components you wish to group, right click and select Group:
clip_image004 

The result is some sort of a data flow sequence container:
clip_image005

By clicking the arrow it will collapse:
clip_image006

Data flow groups are 100% eye candy; you can’t set any properties on them.

Data Flow – Data Quality Services Cleansing transformation
With this transformation you can apply data quality rules in the data flow. This is done by using a Knowledge Base which can be created by yourself or downloaded from the Windows Azure Marketplace. For example you could apply a rule that checks if a given postal code column is valid for a particular town column in your record.

Data Flow – Data Tap
In Denali, we have the possibility to attach a “tap” at a data flow path (arrow). This tap captures all data coming through and dumps it in CSV files.

2. Configuration

SSIS Denali is fundamentally different to its predecessors when it comes to the concept and usage of configurations. SSIS package configurations are obsolete (they will still be available if you really want to use them ;-)) and parameters and environments are in the new kids in town.

Parameters
SSIS Parameters look a lot like SSIS variables but there are some differences. There are two types of these parameters:
1. Package Parameters:
Look at Package parameters as C# parameters, which are passed as input to a C# function(=your package). You can set them when executing (call) a package and the lifecycle of the parameters are limited to the SSIS package itself.

The difference with SSIS variables? You can set the parameters while executing a package with SSMS or an Execute Package Task.

Define your Package Parameters at the Package Parameters tab:
clip_image007

2. Project Parameters:
Identical to Package Parameters, except for the scope, these Parameters exist on project level and can be referenced throughout all SSIS packages in your project. Package Parameters only live inside a single SSIS package.

You can define Project Parameters in the solution explorer within your SSIS project:
clip_image008

Both Package and Project Parameters can be referenced from your SSIS packages, you recognize them by their prefix, $Package or $Project:
clip_image009

Setting parameters in a Execute Package Task is achieved by the new Parameter bindings tab:
clip_image010

It’s also possible to parameterize SSIS tasks on the Control Flow by right clicking them and choose Parameterize:
clip_image011

Loads of capabilities here! I now realize that I have created a workaround for Package Parameters with my SSIS Package design pattern for loading a data warehouse where I (mis)used Package Configurations as Package Parameters. Creating a new package design pattern for Denali definitely goes on my TODO list!

Environments
Environments are a collection of SSIS package settings that can be define on the SSIS Server. At runtime, the environment will override these settings in the SSIS packages. You can create multiple environments and when you run your SSIS packages you can select which environment it should use. It’s also possible to let multiple SSIS projects run under one environment, so flexibility all around the clock.

To make you understand the principle of Environments right away I have created a simple example that you will commonly use: Create two Environments, one with development settings and one with production settings.

I have deployed a Test SSIS project to SSMS which contains one SSIS package with one Shared Connection Manager. Notice the new Integration Services folder structure:
clip_image012

Next right click Environments and choose Create Environment:
clip_image013

Type Development as name for the Environment and click OK:
clip_image014

Now double click the Development Environment:
clip_image015

Click on the Variables tab and create an Environment Variable which will overwrite the Shared Connection Manager. Type in the connection string for the development server as the Value and click OK:
clip_image017

Next create another Environment with the name Production and also create an Environment Variable with the name SCM_EnvironmentVar. The only difference between these two variables should be the value of the variable; it contains the production server connection string instead of the development value.
You now have two Environments with one Environment Variable for the Shared Connection Manager each:
clip_image018

We now need to reference the Development and Production Environments in the Test project so they can be used. Right click the Test project and choose Configure:
clip_image019

Go to the references page and click Add..
clip_image021

Add both Environments:
clip_image023

Both environment are now referenced from the Test SSIS project and are available. Let’s configure the Shared Connection Manager so it will use the value specified in one of the environments. Click on the Parameters page, the Connection Managers tab and the next to the ConnectionString property of the Shared Connection Manager:
clip_image025

Now select Use Environment Variable and select SCM_EnvironmentVar. Click OK:
clip_image026

The name of our Environment Variable is shown in the Value box of the ConnectionString property:
clip_image028

We have now succesfully attached the Environment Variable to override the ConnectionString property with a value from either the Development or the Production Environment. Executing the package under one of both Environments is very easy. To do so, right click the package and choose Run:
clip_image029

The Run Package dialog box appears and you instantly see a message that reminds you to assign a value to the ConnectionString property of connection manager SharedConnectionManager:
clip_image030

At the bottom of the dialog box you can select the Environment your package should use. Select the .\Development Environment and click OK:
clip_image031

In the overview report you see the package has succesfully ran under the Development Environment:
clip_image032

If you change the package to use the Production Environment and you run the package again, you’ll get the following execution information:
clip_image033

Pretty straightforward and easy to use. The big difference? All the configuration work has been done in SQL Server Management Studio! Exactly the place where it should be done. In the current SSIS versions most of this work was done within Visual Studio and the configuration was done by editing tables. So unfriendly! Big improvement, great work MS!
In this example I have set a property of a Shared Connection Manager but of course it is possible to set Package or Project Parameters with Environment Variables.

3. Deployment

We have had the option to deploy SSIS packages from Visual Studio for a long time thanks to BIDS Helper. Microsoft finally added this functionality out of the box in Denali. They even did more: you can now also build your projects from within Visual Studio!

There are two ways to deploy (push) your projects to the new SSIS server, directly from Visual Studio or by using an Integration Services Project Deployment File. They both use the new Integration Services Deployment Wizard.
Deployments directly from Visual Studio can be done by right clicking your project and then choose Deploy. Now, the Deployment Wizard will pop straight up:
clip_image034

The other way is building your project first. This was already possible in SSIS version(s) prior to Denali but did not really have useful meanings.

In Denali this is different. Building your project will generate an .ispac file which contains your entire project and this can be used by the Deployment Wizard. Again a very nice new feature.

Handing over a new release of a SSIS solution to a customer’s IT department for deployment can now be done in a very nice, professional, manner. Ispac files can also be opened by Visual Studio to import your SSIS project.
Building your project can be done by right clicking your project and choosing Build:
clip_image035

The output window displays the following:
clip_image036

When looking at the bin\Development folder we see the new Test.aspac file:
clip_image037

clip_image038

Double clicking (or choosing Deploy in Visual Studio) this file will start the new Integration Services Deployment Wizard:

You can now select the project that you wish to deploy. You can either select to use the .ispac Project deployment file or choose to deploy an existing project located in any Integration Services catalog. The last option is useful when you wish to deploy from, for example, a TEST SSIS solution that is already located in a catalog to an ACCEPTATION environment in another catalog.
clip_image039

The project is loaded and validated:
clip_image040

Next, you now need to select the destination, which has to be a SSIS Server:
clip_image041

Review your selections and click on Deploy to start the actual deployment:
clip_image042

The results show a successful deployment. Notice the protection level has been changed. What happened here?
The wizard clears the protection level, as we know from Visual Studio, to prevent “cannot decrypt password” errors.

Instead of using protection levels, passwords will be stored in clear text. The entire package (with the passwords in it) will be stored encrypted in the SSIS Server tables you are deploying to.
clip_image043

The project has now been deployed to the server:
clip_image044

When you right click the Test project and choose for Versions you are able to see the current version of your project:
clip_image046

If you deploy a newer version later on, the existing version(s) stay on the server and you can easily role back to a previous version if you’ve made a bad deployment.

The diagram below shows the entire deployment life cycle (source: TechNet):
clip_image048

4. Management

The new SSIS Server is the central storage and administration point of your SSIS solutions. No longer is this a standalone server that you’ll need to manage. Basically it is a database with Stored Procedures.

You now have an Integration Services node available in SSMS when you connect to the database engine:
clip_image049

Under the Integration Services node you will find your SSISDB catalog which holds all your SSIS solutions with its packages, environments, etc. The physical storage of these objects will be conducted in a SQL Server database with the same name as the catalog:
clip_image050

This database also contains all the stored procedures containing all the programming code for the SSIS Server:
clip_image051

With the help of these stored procedures you can manage your SSIS Server: e.g. your parameter values, connection managers, and override properties by using environments.
Next to the configuration functionalities I have discussed earlier you can also implement security to control access to the catalog, both on folder level and package level. When you right click a folder/package in the catalog and choose the properties you’ll get the following window where you can manage security access:
clip_image052

Finally you’ll get an out of the box reporting dashboard which is built on the out of the box logging functionality that SSIS offers. On this dashboard you’ll get information about the execution history of your package and its sub packages, view which parameters were used, view specific messages and get a performance report over time.
All the information is logged automatically if any package runs on the server. The information is very detailed; you can even get the row counts between SSIS data flow transformations!

When running a package you are able to select a logging level on the Advanced tab:
clip_image053
You can choose for:
None: turn logging of for performance reasons
Basic: error and warning logging
Performance: detailed trace information
Verbose: diagnostics and fault debugging

When I run my Test package I’ll get the following execution dashboard. There are some hyperlinks that navigate to more detailed reports.
clip_image055

So no need for creating your own logging framework anymore, it’s all out of the box!

SSIS – Connect to Oracle on a 64-bit machine (Updated for SSIS 2008 R2)

We recently had a few customers where a connection to Oracle on a 64 bit machine was necessary. A quick search on the internet showed that this could be a big problem. I found all kind of blog and forum posts of developers complaining about this. A lot of developers will recognize the following error message:

Test connection failed because of an error in initializing provider. Oracle client and networking components were not found. These components are supplied by Oracle Corporation and are part of the Oracle Version 7.3.3 or later client software installation.
Provider is unable to function until these components are installed.


After a lot of searching, trying and debugging I think I found the right way to do it!

Problems

Because BIDS is a 32 bit application, as well on 32 as on 64 bit machines, it cannot see the 64 bit driver for Oracle. Because of this, connecting to Oracle from BIDS on a 64 bit machine will never work when you install the 64 bit Oracle client.

Another problem is the “Microsoft Provider for Oracle”, this driver only exists in a 32 bit version and Microsoft has no plans to create a 64 bit one in the near future.

The last problem I know of is in the Oracle client itself, it seems that a connection will never work with the instant client, so always use the full client.
There are also a lot of problems with the 10G client, one of it is the fact that this driver can’t handle the “(x86)” in the path of SQL Server. So using the 10G client is no option!

Solution

  • Download the Oracle 11G full client.
  • Install the 32 AND the 64 bit version of the 11G full client (Installation Type: Administrator) and reboot the server afterwards. The 32 bit version is needed for development from BIDS with is 32 bit, the 64 bit version is needed for production with the SQLAgent, which is 64 bit.
  • Configure the Oracle clients (both 32 and 64 bits) by editing  the files tnsnames.ora and sqlnet.ora. Try to do this with an Oracle DBA or, even better, let him/her do this.
  • Use the “Oracle provider for OLE DB” from SSIS, don’t use the “Microsoft Provider for Oracle” because a 64 bit version of it does not exist.
  • Schedule your packages with the SQLAgent.

Background information

  • Visual Studio (BI Dev Studio)is a 32bit application.
  • SQL Server Management Studio is a 32bit application.
  • dtexecui.exe is a 32bit application.
  • dtexec.exe has both 32bit and 64bit versions.
  • There are x64 and x86 versions of the Oracle provider available.
  • SQLAgent is a 64bit process.

My advice to BI consultants is to get an Oracle DBA or professional for the installation and configuration of the 2 full clients (32 and 64 bit). Tell the DBA to download the biggest client available, this way you are sure that they pick the right one 😉

Testing if the clients have been installed and configured in the right way can be done with Windows ODBC Data Source Administrator:
Start…
Programs…
Administrative tools…
Data Sources (ODBC)

ADITIONAL STEPS FOR SSIS 2008 R2

It seems that, unfortunately, some additional steps are necessary for SQL Server 2008 R2 installations:

1. Open REGEDIT (Start… Run… REGEDIT) on the server and search for the following entry (for the 32 bits driver): HKEY_LOCAL_MACHINE\Software\Microsoft\MSDTC\MTxOCI
Make sure the following values are entered:

image

2. Next, search for (for the 64 bits driver): HKEY_LOCAL_MACHINE\Software\Wow6432Node\Microsoft\MSDTC\MTxOCI
Make sure the same values as above are entered.

3. Reboot your server.

SSIS Denali as part of “Enterprise Information Management”

When watching the SQL PASS session “What’s Coming Next in SSIS?” of Steve Swartz, the Group Program Manager for the SSIS team, an interesting question came up:

Why is SSIS thought of to be BI, when we use it so frequently for other sorts of data problems?

The answer of Steve was that he breaks the world of data work into three parts:

  • Process of inputs

  • BI
     
  • Enterprise Information Management
    All the work you have to do when you have a lot of data to make it useful and clean and get it to the right place. This covers master data management, data quality work, data integration and lineage analysis to keep track of where the data came from. All of these are part of Enterprise Information Management.
    image

Next, Steve told Microsoft is developing SSIS as part of a large push in all of these areas in the next release of SQL. So SSIS will be, next to a BI tool, part of Enterprise Information Management in the next release of SQL Server.

I’m interested in the different ways people use SSIS, I’ve basically used it for ETL, data migrations and processing inputs. In which ways did you use SSIS?

The next version of SSIS is coming!

The latest releases of SQL Server contained (almost) no new SSIS features. With the release of SSIS 2008 the ability to use C# scripts, the improved data flow and the cached lookup were most thrilling new features. The release of SQL 2008 R2 only gave us the ability to use a bulk insert mode for the ADO.NET destination, which was a bit disappointing.

Fortunately Matt Mason from the SSIS team announced that the next version of SQL Server (SQL 11) contain quite some exiting new functionality for SSIS!

– Undo/Redo support. Finally, this should have been added a long time ago 😉

– Improved copy/paste mechanism. Let’s hope we keep the formatting of components after copy/pasting them!

– Data flow sequence container

– New icons and rounded corners for tasks and transformations

– Improved backpressure for data flow transformations with multiple inputs (for example a Merge Join). When one of the inputs get to much data compared to the other, the component that receives the data can tell the data flow that it needs more data on the other input

– The Toolbox window will automatically locate and show newly installed custom tasks

I’m Curious about the first CTP!

SSIS – Delete all files except for the most recent one

Quite often one or more sources for a data warehouse consist of flat files. Most of the times these files are delivered as a zip file with a date in the file name, for example FinanceDataExport_20100528.zip

Currently I work at a project that does a full load into the data warehouse every night. A zip file with some flat files in it is dropped in a directory on a daily basis. Sometimes there are multiple zip files in the directory, this can happen because the ETL failed or somebody puts a new zip file in the directory manually. Because the ETL isn’t incremental only the most recent file needs to be loaded. To implement this I used the simple code below; it checks which file is the most recent and deletes all other files.

Usage is quite simple, just copy/paste the code in your script task and create two SSIS variables:

  • SourceFolder (type String): The folder that contains the (zip) files
  • DateInFilename (type Boolean): A flag, set it to True if your filename ends with the date YYYYMMDD, set it to false if creation date of the files should be used

Note: In a previous blog post I wrote about unzipping zip files within SSIS, you might also find this useful: SSIS – Unpack a ZIP file with the Script Task

Public Sub Main()

    'Use this piece of code to loop through a set of files in a directory
    'and delete all files except for the most recent one based on a date in the filename.

    'File name example:
    'DataExport_20100413.zip

    Dim rootDirectory As New DirectoryInfo(Dts.Variables("SourceFolder").Value.ToString) 'Set the directory in SSIS variable SourceFolder. For example: D:\Export\
    Dim mostRecentFile As String = ""
    Dim currentFileDate As Integer
    Dim mostRecentFileDate As Integer
    Dim currentFileCreationDate As Date
    Dim mostRecentFileCreationDate As Date

    Dim dateInFilename As Boolean = Dts.Variables("DateInFilename").Value 'If your filename ends with the date YYYYMMDD set SSIS variable DateInFilename to True. If not set to False.


    If dateInFilename Then

        'Check which file is the most recent
        For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")

            currentFileDate = CInt(Left(Right(fi.Name, 12), 8)) 'Get date from current filename (based on a file that ends with: YYYYMMDD.zip)

            If currentFileDate > mostRecentFileDate Then

                mostRecentFileDate = currentFileDate
                mostRecentFile = fi.Name

            End If

        Next

    Else 'Date is not in filename, use creation date

        'Check which file is the most recent
        For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")

            currentFileCreationDate = fi.CreationTime 'Get creation date of current file

            If currentFileCreationDate > mostRecentFileCreationDate Then

                mostRecentFileCreationDate = currentFileCreationDate
                mostRecentFile = fi.Name

            End If

        Next

    End If


    'Delete all files except the most recent one
    For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")

        If fi.Name <> mostRecentFile Then

            File.Delete(rootDirectory.ToString + "\" + fi.Name)

        End If

    Next

    Dts.TaskResult = ScriptResults.Success
End Sub

SSIS – Package design pattern for loading a data warehouse – Part 2

Since my last blog post about a SSIS package design pattern I’ve received quite some positive reactions and feedback. Microsoft also added a link to the post on the SSIS portal which made it clear to me that there is quite some attention for this subject.

The feedback I received was mainly about two things:
1. Can you visualize the process or make it clearer without the whole technical story so it’s easier to understand.
2. How should the Extract phase of the ETL process be implemented when source tables are used by multiple dimensions and/or fact tables.

In this post I will try to answer these questions. By doing so I hope to offer a complete design pattern that is usable for most data warehouse ETL solutions developed using SSIS.


SSIS package design pattern for loading a data warehouse

Using one SSIS package per dimension / fact table gives developers and administrators of ETL systems quite some benefits and is advised by Kimball since SSIS has been released. I have mentioned these benefits in my previous post and will not repeat them here.

When using a single modular package approach, developers sometimes face problems concerning flexibility or a difficult debugging experience. Therefore, they sometimes choose to spread the logic of a single dimension or fact table in multiple packages. I have thought about a design pattern with the benefits of a single modular package approach and still having all the flexibility and debugging functionalities developers need.

If you have a little bit of programming knowledge you must have heard about classes and functions. Now think about your SSIS package as a class or object that exists within code. These classes contain functions that you can call separately from other classes (packages). That would be some nice functionality to have, but unfortunately this is not possible within SSIS by default.
To realize this functionality in SSIS I thought about SSIS Sequence Containers as functions and SSIS packages as classes.
I personally always use four Sequence Containers in my SSIS packages:
– SEQ Extract (extract the necessary source tables to a staging database)
– SEQ Transform (transform these source tables to a dimension or fact table)
– SEQ Load (load this table into the data warehouse)
– SEQ Process (process the data warehouse table to the cube)

The technical trick that I performed – you can read about the inner working in my previous post – makes it possible to execute only a single Sequence Container within a package, just like with functions in classes when programming code.
The execution of a single dimension or fact table can now be performed from a master SSIS package like this:

1 – [Execute Package Task] DimCustomer.Extract
2 – [Execute Package Task] DimCustomer.Transform
3 – [Execute Package Task] DimCustomer.Load
4 – [Execute Package Task] DimCustomer.Process

The package is executed 4 times with an Execute Package Task, but each time only the desired function (Sequence Container) will run.

If we look at this in a UML sequence diagram we see the following:
ETL_Example1

I think this sequence diagram gives you a good overview of how this design pattern is organized. For the technical solution and the download of a template package you should check my previous post.


How should the Extract phase of the ETL process be implemented when a single source table is used by multiple dimensions and/or fact tables?

One of the questions that came up with using this design pattern is how to handle the extraction of source tables that are used in multiple dimensions and/or fact tables. The problem here is that a single table would be extracted multiple times which is, of course, undesirable.

On coincidence I was reading the book “SQL Server 2008 Integration Services: Problem – Design – Solution” (which is a great book!) and one of the data extraction best practices (Chapter 5) is to use one package for the extraction of each source table. Each of these packages would have a very simple dataflow from the source table to the destination table within the staging area.
Of course this approach will be more time consuming than using one big extract package with all table extracts in it but fortunately it also gives you some benefits:
– Debugging, sometimes a source has changed, i.e. a column’s name could have been changed or completely deleted. The error that SSIS will log when this occurs will point the administrators straight to the right package and source table. Another benefit here is that only one package will fail and needs to be edited, while the others can still execute and remain unharmed.
– Flexibility, you can execute a single table extract from anywhere (master package or dim/fact package).

I recently created some solutions using this extract approach and really liked it. I used 2 SSIS projects:
– one with the dimension and fact table packages
– one with only the extract packages
I have used the following naming conventions on the extract packages: Source_Table.dtsx and deployed them to a separate SSIS folder. This way the packages won’t bother the overview during development.
A tip here is to use BIDS Helper; it has a great functionality to deploy one or more packages from BIDS.

Merging this approach in the design pattern will give the following result:
– The dimension and fact table extract Sequence Containers will no longer have data flow tasks in it but execute package tasks which point to the extract packages.
– The Extract Sequence Container of the master package will execute all the necessary extract packages at once.

This way a single source table will always get extracted only one time when executing your ETL from the master package and you still have the possibility to unit test your entire dimension or fact table packages.
Drawing this approach again in a sequence diagram gives us the following example with a run from the master package (only the green Sequence Containers are executed):
ETL_Example2

And like this with a run of a single Dimension package:
ETL_Example2_Customer

Overall, the design pattern will now always look like this when executed from a master package:
ETL_Overview

Conclusion
I think this design pattern is now good enough to be used as a standard approach for the most data warehouse ETL projects using SSIS. Thanks for all the feedback! New feedback is of course more than welcome!

SSIS – Package design pattern for loading a data warehouse

I recently had a chat with some BI developers about the design patterns they’re using in SSIS when building an ETL system. We all agreed in creating multiple packages for the dimensions and fact tables and one master package for the execution of all these packages.

These developers even created multiple packages per single dimension/fact table:

  • One extract package where the extract(E) logic of all dim/fact tables is stored
  • One dim/fact package with the transform(T) logic of a single dim/fact table
  • One dim/fact package with the load(L) logic of a single dim/fact table

I like the idea of building the Extract, Transform and Load logic separately, but I do not like the way the logic was spread over multiple packages.
I asked them why they chose for this solution and there were multiple reasons:

  • Enable running the E/T/L parts separately, for example: run only the entire T phase of all dim/fact tables.
  • Run the extracts of all dimensions and fact tables simultaneously to keep the loading window on the source system as short as possible.

To me these are good reasons, running the E/T/L phases separately is a thing a developer often wants during the development and testing of an ETL system.
Keeping the loading window on the source system as short as possible is something that’s critical in some projects.

Despite the good arguments to design their ETL system like this, I still prefer the idea of having one package per dimension / fact table, with complete E/T/L logic, for the following reasons:

  • All the logic is in one place
  • Increase understandability
  • Perform unit testing
  • If there is an issue with a dimension or fact table, you only have to make changes in one place, which is safer and ore efficient
  • You can see your packages as separate ETL “puzzle pieces” that are reusable
  • It’s good from a project manager point of view; let your customer accept dimensions and fact tables one by one and freeze the appropriate package afterwards
  • The overview in BIDS, having an enormous amount of packages does not make it clearer 😉
  • Simplifies deployment after changes have been made
  • Changes are easier to track in source control systems
  • Team development will be easier; multiple developers can work on different dim/fact tables without bothering each other.

So basically my goal was clear: to build a solution that has all the possibilities the aforesaid developers asked for, but in one package per dimension / fact table; the best of both worlds.

Solution:

The solution I’ve created is based on a parent-child package structure. One parent (master) package will execute multiple child (dim/fact) packages. This solution is based on a single (child) package for each dimension and fact table. Each of these packages contains the following Sequence Containers in the Control Flow: 
 ChildControlFlow 

Normally it would not be possible to execute only the Extract, Transform, Load or (cube) Process Sequence Containers of the child (dim/fact) packages simultaneously.

To make this possible I have created four Parent package variable configurations, one for each ETL phase Sequence Container in the child package:
clip_image003

clip_image005

Each of these configurations is set on the Disable property of one of the Sequence Containers:
clip_image007

Using this technique makes it possible to run separate Sequence Containers of the child package from the master package, simply by did- or enabling the appropriate sequence containers with parent package variables.
Because the default value of the Disable property of the Sequence Containers is False, you can still run an entire standalone child package, without the need to change anything.

Ok, so far, so good. But, how do I execute only one phase of all the dimension and fact packages simultaneously? Well quite simple:

First add 4 Sequence Containers to the Master package. One for each phase of the ETL, just like in the child packages


Add Execute Package Tasks for all your packages in every Sequence Container

clip_image008


If you would execute this master package now, every child package would run 4 times as there are 4 Execute Package Tasks that run the same package in every sequence container.
To get the required functionality I have created 4 variables inside each Sequence Container (Scope). These will be used as parent variable to set the Disable properties in the child packages. So basically I’ve created 4 variables x 4 Sequence Containers = 16 variables for the entire master package.

Variables for the EXTRACT Sequence Container (vDisableExtract False):
clip_image009

Variables for the TRANSFORM Sequence Container (vDisableTransform False):
clip_image010

The LOAD and PROCESS Sequence Containers contain variables are based on the same technique.

Results:

Run all phases of a standalone package: Just execute the package:
clip_image011

Run a single phase of the ETL system (Extract/Transform/Load/Process): Execute the desired sequence container in the main package:

RunAllTransforms 

Run a single phase of a single package from the master package:
RunSinglePhaseOfOnePackage

Run multiple phases of the ETL system, for example only the T and L: Disable the Sequence Containers of the phases that need to be excluded in the master package:

RunMultiplePhasesAndExcludeOthers

Run all the child packages in the right order from the master package:
When you add a breakpoint on, for example, the LOAD Sequence Container you see that all the child packages are at the same ETL phase as their parent: 
RunCompleteMasterPackageBreakPoint


When pressing Continue the package completes: 
RunCompleteMasterPackageBreakPointCompleted


Conclusion:

This parent/child package design pattern for loading a Data Warehouse gives you all the flexibility and functionality you need. It’s ready and easy to use during development and production without the need to change anything.

With only a single SSIS package for each dimension and fact table you now have the functionality that separate packages would offer. You will be able to, for example, run all the Extracts for all dimensions and fact tables simultaneously like the developers asked for and still have the benefits that come with the one package per dimension/fact table approach.

Of course having a single package per dimension or fact table will not be the right choice in all cases but I think it is a good standard approach.
Same applies to the ETL phases (Sequence Containers). I use E/T/L/P, but if you have different phases, which will be fine, you can still use the same technique.

Download the solution with template packages from the URL’s below. Only thing you need to do is change the connection managers to the child packages (to your location on disk) and run the master package!

Download for SSIS 2008

Download for SSIS 2005

If you have any suggestions, please leave them as a comment. I would like to know what your design pattern is as well!

ATTENTION: See Part-2 on this subject for more background information!

Backgrounds:

How to: Use the Values of Parent Variables in a Child Package: http://technet.microsoft.com/en-us/library/ms345179.aspx

SSIS – Blowing-out the grain of your fact table

Recently I had to create a fact table with a lower grain than the source database. My source database contained order lines with a start- and end date and monthly revenue amounts.

To create reports that showed overall monthly revenue per year, lowering the grain was necessary. Because the lines contained revenue per month I decided to blow out the grain of my fact table to monthly records for all the order lines of the source database. For example, an order line with a start date of 1 January 2009 and an end date of 31 December 2009 should result in 12 order lines in the fact table, one line for each month.

To achieve this result I exploded the source records against my DimDate. I used a standard DimDate:
clip_image001[4]

The query below did the job; use it in a SSIS source component and it will explode the order lines to a monthly grain:

Code Snippet
  1. SELECT OL.LineId
  2.       ,DD.ActualDate
  3.       ,OL.StartDate
  4.       ,OL.EndDate
  5.      
  6.   FROM OrderLine OL
  7.   INNER JOIN DimDate DD
  8.       ON DD.Month
  9.       BETWEEN
  10.       (YEAR(OL.StartDate)*100+MONTH(OL.StartDate))
  11.       AND
  12.       (YEAR(OL.EndDate)*100+MONTH(OL.EndDate))
  13.      
  14.   WHERE DD.DayOfMonth = 1

Some explanation about this query below:

· I always want to connect a record to the first day of the month in DimDate, that’s why this WHERE clause is used:

Code Snippet
  1. WHERE DD.DayOfMonth = 1

· Because I want to do a join on the month (format: YYYMM) of DimDate I need to format the start and end date on the same way (YYYYMM):

Code Snippet
  1. (YEAR(OL.StartDate)*100+MONTH(OL.StartDate))

The source, order lines with a start and end date:
clip_image002[4]

The Result, monthly order lines:
clip_image003[4]