Replication Services as ETL extraction tool

In my last blog post I explained the principles of Replication Services and the possibilities it offers in a BI environment. One of the possibilities I described was the use of snapshot replication as an ETL extraction tool:
“Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!”

Well I have tried it out and I must say it worked well. I was able to let replication services do work in a fraction of the time it would cost me to do the same in SSIS.
What I did was the following:

  1. Configure snapshot replication for some Adventure Works tables, this was quite simple and straightforward.
  2. Create an SSIS package that executes the snapshot replication on demand and waits for its completion.
    This is something that you can’t do with out of the box functionality. While configuring the snapshot replication two SQL Agent Jobs are created, one for the creation of the snapshot and one for the distribution of the snapshot. Unfortunately these jobs are  asynchronous which means that if you execute them they immediately report back if the job started successfully or not, they do not wait for completion and report its result afterwards. So I had to create an SSIS package that executes the jobs and waits for their completion before the rest of the ETL process continues.

Fortunately I was able to create the SSIS package with the desired functionality. I have made a step-by-step guide that will help you configure the snapshot replication and I have uploaded the SSIS package you need to execute it.

Configure snapshot replication

0 
The first step is to create a publication on the database you want to replicate.
Connect to SQL Server Management Studio and right-click Replication, choose for New.. Publication…

1 
The New Publication Wizard appears, click Next

2
Choose your “source” database and click Next

3
Choose Snapshot publication and click Next

4 
You can now select tables and other objects that you want to publish

5
Expand Tables and select the tables that are needed in your ETL process

6
In the next screen you can add filters on the selected tables which can be very useful. Think about selecting only the last x days of data for example.

7
Its possible to filter out rows and/or columns. In this example I did not apply any filters.

8
Schedule the Snapshot Agent to run at a desired time, by doing this a SQL Agent Job is created which we need to execute from a SSIS package later on.

9
Next you need to set the Security Settings for the Snapshot Agent. Click on the Security Settings button.

10 
In this example I ran the Agent under the SQL Server Agent service account. This is not recommended as a security best practice. Fortunately there is an excellent article on TechNet which tells you exactly how to set up the security for replication services. Read it here and make sure you follow the guidelines!

11 
On the next screen choose to create the publication at the end of the wizard

12
Give the publication a name (SnapshotTest) and complete the wizard

13 
The publication is created and the articles (tables in this case) are added

Now the publication is created successfully its time to create a new subscription for this publication.

14 
Expand the Replication folder in SSMS and right click Local Subscriptions, choose New Subscriptions

15 
The New Subscription Wizard appears

16 
Select the publisher on which you just created your publication and select the database and publication (SnapshotTest)

17 
You can now choose where the Distribution Agent should run. If it runs at the distributor (push subscriptions) it causes extra processing overhead. If you use a separate server for your ETL process and databases choose to run each agent at its subscriber (pull subscriptions) to reduce the processing overhead at the distributor.

18
Of course we need a database for the subscription and fortunately the Wizard can create it for you. Choose for New database

19 
Give the database the desired name, set the desired options and click OK

20
You can now add multiple SQL Server Subscribers which is not necessary in this case but can be very useful.

21 
You now need to set the security settings for the Distribution Agent. Click on the …. button

22
Again, in this example I ran the Agent under the SQL Server Agent service account. Read the security best practices here

23 
Click Next

24 
Make sure you create a synchronization job schedule again. This job is also necessary in the SSIS package later on.

25
Initialize the subscription at first synchronization

26
Select the first box to create the subscription when finishing this wizard

27
Complete the wizard by clicking Finish

28
The subscription will be created

29
In SSMS you see a new database is created, the subscriber. There are no tables or other objects in the database available yet because the replication jobs did not ran yet.

Now expand the SQL Server Agent, go to Jobs and search for the job that creates the snapshot:

 30
Rename this job to “CreateSnapshot”

Now search for the job that distributes the snapshot:

 31
Rename this job to “DistributeSnapshot”

Create an SSIS package that executes the snapshot replication

We now need an SSIS package that will take care of the execution of both jobs. The CreateSnapshot job needs to execute and finish before the DistributeSnapshot job runs. After the DistributeSnapshot job has started the package needs to wait until its finished before the package execution finishes.
The Execute SQL Server Agent Job Task is designed to execute SQL Agent Jobs from SSIS. Unfortunately this SSIS task only executes the job and reports back if the job started succesfully or not, it does not report if the job actually completed with success or failure. This is because these jobs are asynchronous.

The SSIS package I’ve created does the following:

  1. It runs the CreateSnapshot job
  2. It checks every 5 seconds if the job is completed with a for loop
  3. When the CreateSnapshot job is completed it starts the DistributeSnapshot job
  4. And again it waits until the snapshot is delivered before the package will finish successfully

33

Quite simple and the package is ready to use as standalone extract mechanism. After executing the package the replicated tables are added to the subscriber database and are filled with data:

 32

Download the SSIS package here (SSIS 2008)

Conclusion

In this example I only replicated 5 tables, I could create a SSIS package that does the same in approximately the same amount of time. But if I replicated all the 70+ AdventureWorks tables I would save a lot of time and boring work! With replication services you also benefit from the feature that schema changes are applied automatically which means your entire extract phase wont break. Because a snapshot is created using the bcp utility (bulk copy) it’s also quite fast, so the performance will be quite good.

Disadvantages of using snapshot replication as extraction tool is the limitation on source systems. You can only choose SQL Server or Oracle databases to act as a publisher.

So if you plan to build an extract phase for your ETL process that will invoke a lot of tables think about replication services, it would save you a lot of time and thanks to the Extract SSIS package I’ve created you can perfectly fit it in your usual SSIS ETL process.

Replication Services in a BI environment

In this blog post I will explain the principles of SQL Server Replication Services without too much detail and I will take a look on the BI capabilities that Replication Services could offer in my opinion.

SQL Server Replication Services provides tools to copy and distribute database objects from one database system to another and maintain consistency afterwards. These tools basically copy or synchronize data with little or no transformations, they do not offer capabilities to transform data or apply business rules, like ETL tools do.
The only “transformations” Replication Services offers is to filter records or columns out of your data set. You can achieve this by selecting the desired columns of a table and/or by using WHERE statements like this:
SELECT <published_columns> FROM [Table] WHERE [DateTime] >= getdate() – 60

There are three types of replication:

Transactional Replication

Transactional replication components and data flow
This type replicates data on a transactional level. The Log Reader Agent reads directly on the transaction log of the source database (Publisher) and clones the transactions to the Distribution Database (Distributor), this database acts as a queue for the destination database (Subscriber). Next, the Distribution Agent moves the cloned transactions that are stored in the Distribution Database to the Subscriber.
The Distribution Agent can either run at scheduled intervals or continuously which offers near real-time replication of data!

So for example when a user executes an UPDATE statement on one or multiple records in the publisher database, this transaction (not the data itself) is copied to the distribution database and is then also executed on the subscriber. When the Distribution Agent is set to run continuously this process runs all the time and transactions on the publisher are replicated in small batches (near real-time), when it runs on scheduled intervals it executes larger batches of transactions, but the idea is the same.

Snapshot Replication

Snapshot replication components and data flow
This type of replication makes an initial copy of database objects that need to be replicated, this includes the schemas and the data itself. All types of replication must start with a snapshot of the database objects from the Publisher to initialize the Subscriber. Transactional replication need an initial snapshot of the replicated publisher tables/objects to run its cloned transactions on and maintain consistency.

The Snapshot Agent copies the schemas of the tables that will be replicated to files that will be stored in the Snapshot Folder which is a normal folder on the file system. When all the schemas are ready, the data itself will be copied from the Publisher to the snapshot folder. The snapshot is generated as a set of bulk copy program (BCP) files. Next, the Distribution Agent moves the snapshot to the Subscriber, if necessary it applies schema changes first and copies the data itself afterwards. The application of schema changes to the Subscriber is a nice feature, when you change the schema of the Publisher with, for example, an ALTER TABLE statement, that change is propagated by default to the Subscriber(s).

Merge Replication
Merge replication is typically used in server-to-client environments, for example when subscribers need to receive data, make changes offline, and later synchronize changes with the Publisher and other Subscribers, like with mobile devices that need to synchronize one in a while. Because I don’t really see BI capabilities here, I will not explain this type of replication any further.

Replication Services in a BI environment
Transactional Replication can be very useful in BI environments. In my opinion you never want to see users to run custom (SSRS) reports or PowerPivot solutions directly on your production database, it can slow down the system and can cause deadlocks in the database which can cause errors. Transactional Replication can offer a read-only, near real-time database for reporting purposes with minimal overhead on the source system.

Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!

Update:
I got a questing regarding the supported Replication Services features in the different versions of SQL Server (Standard,Enterprise,etc). There is a nice table on MSDN that shows this!