Orchestrate Azure Data Factory pipelines and other Azure Data Platform management tasks using Azure Automation

Posted by

Azure Data Factory (ADF) is a great SaaS solution to compose and orchestrate your Azure data services. It works fine to create, schedule and manage your data pipelines but it has limitations that can make it hard to use in some scenarios. The two main ones are:

1.       Some tasks that you regularly want to perform can’t be accomplished with ADF.
A few examples are:

·         Start/Pause an Azure SQL Data Warehouse

·         Start/Pause an Azure Analysis Services instance

·         Process an Azure Analysis Services cube

·         Manage on-premises resources

 

2.       Triggering (running) ADF pipelines on demand.
For most people it is hard to switch from the well-known SQL Agent jobs – in which it’s really easy to trigger SSIS processes and execute other tasks – to
scheduling ADF pipelines. You now have to create activity windows and define data slices that are dependent on the availability of data sets. Most would like to be able to trigger an ADF pipeline either on demand or when some other task finished successfully.

The tasks listed at point 1 – and almost everything else you can think of – can be done using Azure PowerShell with Azure Resource Manager. Microsoft does an awesome job here: every new SaaS solution has great ARM support from the start. To make use of PowerShell in Azure and more importantly to automate your scripts and processes, Azure Automation comes in to play. It makes it possible to run all your PowerShell scripts in Azure as a SaaS solution. It is an orchestrator in which you can execute all kinds of Data Platform related operations, both in Azure and on-premises as hybrid workers, enabling hybrid Data Platform orchestration scenarios.

Triggering ADF pipelines on demand is a bit trickier. An “Execute now” command in PowerShell does not exist, which is understandable if you know how the scheduling and execution mechanism of ADF has been build. I don’t think it will be easy to change this, unless some major changes to ADF will be made. Fortunately, there is a workaround! If you ever deployed a new ADF pipeline you might have noticed that after deployment, pipelines start immediately when they have an activity window with a date/time in the past. Because it is possible to update the activity window date/time of a pipeline using PowerShell, it’s also possible to trigger a pipeline’s execution by changing its activity window date/time to some value in the past.

In this blog post I will show how you can trigger an ADF pipeline on a daily basis and monitor its execution using PowerShell code that runs in an Azure Automation runbook. Because executing other tasks (point 1) using PowerShell is easy, it becomes possible to orchestrate advanced Azure Data Platform tasks, with the execution of an ADF pipeline as just a part of the total process.

Azure Setup

In this example I’ve used the Azure setup shown below. The ADF pipeline contains one simple copy activity that copies a file from one blob storage container to another.

·         Resource Group

o   Blob Storage

§  Container: input

§  Container: output

o   Azure Data Factory

o   Azure Automation


Azure Data Factory Setup

I created the Azure Data Factory pipeline with the Copy Data wizard:

clip_image001[4]

I configured the pipeline to “Run regularly on schedule” with a recurring pattern of “Daily”, “every 1 day” (see the blue rectangle in the screenshot below).

Choosing “Run once now” would set the pipeline property “PipelineMode” to “OneTime” and would disable the ADF scheduler. Initially you might think this seems to be the option you want to use as we want to trigger the pipeline ourselves on demand, but unfortunately that configuration has some limitations:

·         The ADF diagram view does not show one-time pipelines. This makes it impossible to monitor the pipeline using the Monitor & Manage dashboard.

·         One-time pipelines can’t be updated. This would block us from updating the activity window properties to trigger the pipeline.

The “Start date time” is automatically set to yesterday with the current time. This triggers the pipeline to start running immediately after deployment; as explained ADF automatically triggers pipelines with a start date/time in the past. In this case, we want to trigger the execution ourselves, so set the date to some date in the future (see the red rectangle in the screenshot below).

clip_image003[4]

For this pipeline I chose to create a simple example that copies a file from one blob storage container to another. Of course, you can design your pipeline anyway you prefer, e.g. using the Copy Data wizard or Visual Studio. Just make sure you schedule it with a daily recurrence.


Azure Automation Setup

The next step is to create a new Azure Automation account. Within that account we need to import the Data Factory PowerShell cmdlets, because the standard set of PowerShell cmdlets in Automation do not contain any ADF related functions.

Navigate to Assets:

clip_image004[4]

Click Modules:
clip_image005[4]

Browse the gallery:
clip_image006[4]

 

Search for “Data Factory”, select AzureRM.DataFactories and click Import:
clip_image007[4]

AzureRM.DataFactories should now appear in your list of available modules:
clip_image009[4]

 

Create Credential

Now we need to create a Credential to be able to automatically login and run our PowerShell script unattended from Azure Automation.

Navigate to Assets again and then click Credentials
clip_image010[4]

 

Click “Add a credential” and supply a user account that has the required permissions to access your Azure Data Factory. You can use the organizational account you use to login to the Azure Portal. It might look like jorgk@yourorganizationalaccountname.com or something like that.
clip_image012[4]

Automation Runbook
We are now ready to create the Automation Runbook which will trigger the Azure Data Factory pipeline, by updating its Active Period to a date/time in the past.

The script performs the following steps:

1.       Authenticate with the Automation Credential

2.       Connect to the Azure Data Factory

3.       Update the pipeline active period to yesterday

4.       Unpause the pipeline; execution will begin

5.       Monitor the pipeline execution

6.       Pause the pipeline

Copy/paste the script below to a Windows PowerShell Script (.ps1) file and name it “TriggerAdfPipeline.ps1”.

# Variables; modify
$rgn = "AzureDataPlatformOrchestration" #Resource Group Name
$acn = "adpo-auto-cred" #Automation Credential Name
$dfn = "adpo-adf" #Data Factory Name
$pln = "CopyPipeline-cu6" #PipeLine Name
$dsn = "OutputDataset-hgv" #DataSet Name (output dataset of pipeline that needs to be produced)

# To test from PowerShell client, uncomment the 2 rows below and provide subscription ID
#Login-AzureRmAccount
#Set-AzureRMContext -SubscriptionId "00000000-0000-0000-0000-000000000000"

# Authenticate
# To test from PowerShell client, comment out the 2 rows below
$AzureCred = Get-AutomationPSCredential -Name $acn
Add-AzureRmAccount -Credential $AzureCred | Out-Null

# Get data factory object
$df=Get-AzureRmDataFactory -ResourceGroupName $rgn -Name $dfn
If($df) {
           Write-Output "Connected to data factory $dfn in resource group $rgn."
}

# Create start/end DateTime (yesterday)
$sdt = [System.DateTime]::Today.AddDays(-1) #Yesterday 12:00:00 AM
$edt = [System.DateTime]::Today.AddSeconds(-1) #Yesterday 11:59:59 PM

# Update active period to yesterday
$apr=Set-AzureRmDataFactoryPipelineActivePeriod -DataFactory $df -PipelineName $pln -StartDateTime $sdt -EndDateTime $edt
If($apr) {
           Write-Output "Pipeline $pln of data factory $dfn updated with StartDateTime $sdt and EndDateTime $edt."
}

# Unpause pipeline
$rpl = Resume-AzureRmDataFactoryPipeline -DataFactory $df -Name $pln
If($rpl) {
           Write-Output "Pipeline $pln resumed."
}

# Create arrays that hold all possible data factory slice states
$failedStates = "Failed Validation", "Timed Out", "Skip", "Failed"
$pendingStates = "Retry Validation", "PendingValidation", "Retry", "InProgress", "PendingExecution"
$finishedStates = "Ready"

# Wait while data factory slice is in pending state
While (Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Where -Property State -In $pendingStates) {
           Write-Output "Slice status is:"
           Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Select -Property State
           Write-Output "Wait 15 seconds"
           Start-Sleep 15
}

# Since data factory slice is not pending (anymore), it is either failed or finished
If(Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Where -Property State -In $failedStates){
           Write-Output "Slice failed."
           Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Select -Property State
    } ElseIf (Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Where -Property State -In $finishedStates) {
           Write-Output "Slice finished."
    } Else {
           Write-Output "No State found?"
           Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Select -Property State
}

# Pause pipeline
$spr = Suspend-AzureRmDataFactoryPipeline -DataFactory $df -Name $pln
If($spr){
           Write-Output "Pipeline $pln paused."
}

Navigate to your Azure Automation account in the Azure Portal and click “Runbooks”:

clip_image013[4]

Click “Add a runbook”:

clip_image014[4]

Now select the TriggerAdfPipeline.ps1 file to import the PowerShell script as Runbook:

clip_image016[4]

 

Runbook TriggerAdfPipeline is created. Click it to open it:

clip_image018[4]

 

A Runbook must be published before you are able to start or schedule it. Click Edit:

clip_image020[4]

 

Before publishing, test the Runbook first. Click on the Test pane button and then click Start:

clip_image022[4]

 

The pipeline slice ran successfully:

clip_image024[4]

Now publish the Runbook:

clip_image026[4]

You can now create webhooks, schedule and monitor your Runbook jobs:

clip_image028[4]

Extending the script with additional tasks

It’s easy to perform other Azure Data Platform related tasks using PowerShell and Automation Runbooks. Because the script I created waits for the ADF pipeline to end, you can easily execute other tasks before or after the ADF’s execution. The following examples will most likely be useful:

·         Pause/Start Azure SQL Data warehouse

o   Documentation

o   PowerShell Gallery

·         Pause/Start Azure Analysis Services

o   PowerShell Gallery

·         Process Azure Analysis Services

o   See my blog post Process Azure Analysis Services databases from Azure Automation

 

Update: Triggering/calling pipelines for a second time

The script that is provided in this blog posts only works if you call your pipeline not more than one time a day. If you want to trigger your pipeline multiple times a day this solution needs a minor modification.

  1. The pipeline schedule should be “every 15 minutes” instead of daily
  2. Update the “TriggerAdfPipeline.ps1” script to make it work for “second time”:

Change:

# Create start/end DateTime (yesterday)
$sdt = [System.DateTime]::Today.AddDays(-1) #Yesterday 12:00:00 AM
$edt = [System.DateTime]::Today.AddSeconds(-1) #Yesterday 11:59:59 PM


To:

# Create start/end DateTime
$sdt = Get-Date
$sdt = $sdt.AddMinutes(-2) #Now –2 minutes
$edt = Get-Date
$edt = $edt.AddMinutes(-1) #Now –1 minute

7 comments

  1. That is really helpful, thank you.  I am aware of the Automation piece, but had not looked at it and so had not brought it into my daily processes/workflows.  But now I will!

    Like

  2. Hi ..
    thanks for this great post..
    unfortunately i think that this method only works when calling the pipeline for the first time.. than the “second time” setting  active period to “yesterday” does not force to run the pipeline again. This period got to be set as a date before the date of the last pipeline success run (if last pipeline forced run was yesterday to re run again i got to set active time period to 2 day ago).
    Neither setting the active date period in the future at the end of the powershell script solve the problem.
    Regards

    Like

  3. There is workaround to make it work for “second time”, but the df can only trigger once in every 15 mins:
    1. The df pipeline schedule should be “every 15 minutes” instead of daily
    2. Updated the “TriggerAdfPipeline.ps1” to make it work for “second time”:
    $sdt = [System.DateTime]::Today.AddDays(-1) #Yesterday 12:00:00 AM
    $edt = [System.DateTime]::Today.AddSeconds(-1) #Yesterday 11:59:59 PM
    to
    $sdt = Get-Date
    $sdt = $sdt.AddMinutes(-2)
    $edt = Get-Date
    $edt = $edt.AddMinutes(-1)

    Like

  4. How to get runtime of a dataset or a pipeline ?
    If im looping thru each dataset or apipeline, then how to subtraact he processing start time and Processing End Time. ?

    Like

  5. I tried to run the script (TriggerAdfPipeline.ps1) , using the email and password I use to login to azure normally and I get the following error msg . 1.Login-AzureRmAccount : Unable to find an entry point named ‘GetPerAdapterInfo’ in DLL ‘iphlpapi.dll’.
    2.Get-AutomationPSCredential : Credentials asset not found. To create this Credentials asset, navigate to the Assets
    blade and create a Credentials asset named: TestAcct.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s