Azure SQL Database scheduled autoscaling

One of the major benefits of Azure is scalability, the cloud has been built to be elastic to be as cost-effective as possible. Azure solutions can be scaled in ways that would never be possible with your own hardware. You are able to scale up and out when the demand arises, and, more importantly, you can scale down and in when you don’t need all the resources anymore, thereby saving a lot of money.

Autoscaling based on a schedule allows you to scale your solution according to predictable resource demand. For example you could require the highest capacity on Monday during peak hours, while the rest of the week the traffic is decreased allowing you to scale down. Outside business hours and during weekends you could then scale down further to the absolute minimum.

Azure SQL Database currently offers four service tiers: Basic, Standard, Premium, and Premium RS. Databases are isolated from each other and portable. You can change between tiers easily and without downtime.
Azure SQL database tiers and performance levels


Scheduled autoscaling with Azure Automation

To automate scaling of Azure SQL Databases based on a schedule I have created an Azure Automation PowerShell Runbook called AzureSQLDatabaseScheduledAutoScaling. You can easily import this runbook into your Azure Automation account and then execute it against your database by entering values for a few parameters that specify your database details, your Automaton Run As Account and a database-specific scaling schedule. A step-by-step guide can be found at the bottom of this post.

Scheduling of the runbook itself can be done by creating a recurring hourly Automation Schedule. A single runbook can have multiple jobs running at one time. This means it is possible to create multiple schedules for the AzureSQLDatabaseScheduledAutoScaling runbook, enabling scaling of multiple databases simultaneously.

Scaling of the database will only be performed in case the runbook execution moment matches a day/time slot specified in the scaling schedule parameter. If the runbook executes and the database is in the desired edition/tier already, no work will be done. Don’t worry about the costs of Azure Automation, the first 500 run minutes are free and after that you pay €0.002 / minute (as per september 2017).

Azure Automation Runbook Architecture

When you start using this solution, keep the following in mind:

  • The length of time for a switch of edition/tier varies, but is generally under 4 seconds, and under 30 seconds in 99% of the time. The duration of the entire scale process depends on both the size and service tier of the database before and after the change. For example a 250 GB database that is changing to, from, or within a Standard service tier, should complete within 6 hours. For a database of the same size that is changing performance levels within the Premium service tier, it should complete within 3 hours.
  • Behind the scenes, changing the edition/tier of a database creates a replica of the original database at the new edition/tier, and then switches connections over to the replica. No data is lost during this process. During the switch over to the replica, connections to the database are disabled, so some transactions in flight may be rolled back. If there are large numbers of transactions in flight at the moment connections are disabled, the length of time for the switch over may be longer.
  • To scale a database down, the database must be smaller than the maximum allowed size of the target service tier.
  • Objects that are not supported in target tiers can become inaccessible. For example when you scale down a Premium or Premium RS database that contains clustered columnstore indexes to Standard or Basic, the tables will be inaccessible until you scale back to Premium (RS) again.
  • Premium RS tier is still in preview and currently not yet supported by the Set-AzureRmSqlDatabase cmdlet that is used in the code of this solution. Trying to scale to it results in the following error message : Cannot bind parameter ‘Edition’. Cannot convert value “Premium RS” to type “Microsoft.Azure.Commands.Sql.Database.Model.DatabaseEdition”. Error: “Unable to match the identifier name Premium RS to a valid enumerator name. Specify one of the following enumerator names and try again: None, Premium, Basic, Standard, DataWarehouse, Free”


The Runbook PowerShell script

The PowerShell script below can be used in an Azure Automation Runbook with type “PowerShell”.

The script contains the following logic:

  1. Authenticate with Azure Automation Run As account.
  2. Get the current day of week and current time, and convert it to the time zone that is specified.
  3. Check which of the specified ‘slots’ in the scaling schedule is currently active.
  4. Scale database to edition/tier that is specified in the scaling schedule.
    If no schedule for the current day/time is found, scale to default edition/tier.
    If database is already in target edition/tier, do nothing.

.SYNOPSIS
Vertically scale an Azure SQL Database up or down according to a
schedule using Azure Automation.

.DESCRIPTION
This Azure Automation runbook enables vertically scaling of
an Azure SQL Database according to a schedule. Autoscaling based
on a schedule allows you to scale your solution according to
predictable resource demand. For example you could require a
high capacity (e.g. P2) on Monday during peak hours, while the rest
of the week the traffic is decreased allowing you to scale down
(e.g. P1). Outside business hours and during weekends you could then
scale down further to a minimum (e.g. S0). This runbook
can be scheduled to run hourly. The code checks the
scalingSchedule parameter to decide if scaling needs to be
executed, or if the database is in the desired state already and
no work needs to be done. The script is Timezone aware.

.PARAMETER resourceGroupName
Name of the resource group to which the database server is
assigned.

.PARAMETER azureRunAsConnectionName
Azure Automation Run As account name. Needs to be able to access
the $serverName.

.PARAMETER serverName
Azure SQL Database server name.

.PARAMETER databaseName
Azure SQL Database name (case sensitive).

.PARAMETER scalingSchedule
Database Scaling Schedule. It is possible to enter multiple
comma separated schedules: [{},{}]
Weekdays start at 0 (sunday) and end at 6 (saturday).
If the script is executed outside the scaling schedule time slots
that you defined, the defaut edition/tier (see below) will be
configured.

.PARAMETER scalingScheduleTimeZone
Time Zone of time slots in $scalingSchedule.
Available time zones: [System.TimeZoneInfo]::GetSystemTimeZones().

.PARAMETER defaultEdition
Azure SQL Database Edition that wil be used outside the slots
specified in the scalingSchedule paramater value.
Example values: Basic, Standard, Premium RS, Premium.
For more information on editions/tiers levels,
http://msdn.microsoft.com/en-us/library/azure/dn741336.aspx

.PARAMETER defaultTier
Azure SQL Database Tier that wil be used outside the slots
specified in the scalingSchedule paramater value.
Example values: Basic, S0, S1, S2, S3, PRS1, PRS2, PRS4,
PRS6, P1, P2, P4, P6, P11, P15.

.EXAMPLE
-resourceGroupName myResourceGroup
-azureRunAsConnectionName AzureRunAsConnection
-serverName myserver
-databaseName myDatabase
-scalingSchedule [{WeekDays:[1], StartTime:”06:59:59″,
StopTime:”17:59:59″, Edition: “Premium”, Tier: “P2″},
{WeekDays:[2,3,4,5], StartTime:”06:59:59″,
StopTime:”17:59:59”, Edition: “Premium”, Tier: “P1”}]
-scalingScheduleTimeZone W. Europe Standard Time
-defaultEdition Standard
-defaultTier S0

.NOTES
Author: Jorg Klein
Last Update: 18/09/2017

param(
[parameter(Mandatory=$true)]
[string] $resourceGroupName, 

[parameter(Mandatory=$true)]
[string] $azureRunAsConnectionName, 

[parameter(Mandatory=$true)]
[string] $serverName, 

[parameter(Mandatory=$true)]
[string] $databaseName, 

[parameter(Mandatory=$true)]
[string] $scalingSchedule, 

[parameter(Mandatory=$true)]
[string] $scalingScheduleTimeZone, 

[parameter(Mandatory=$true)]
[string] $defaultEdition, 

[parameter(Mandatory=$true)]
[string] $defaultTier
) 

Write-Output "Script started." 

#Authenticate with Azure Automation Run As account (service principal)
$runAsConnectionProfile = Get-AutomationConnection `
-Name $azureRunAsConnectionName
Add-AzureRmAccount -ServicePrincipal `
-TenantId $runAsConnectionProfile.TenantId `
-ApplicationId $runAsConnectionProfile.ApplicationId `
-CertificateThumbprint ` $runAsConnectionProfile.CertificateThumbprint | Out-Null
Write-Output "Authenticated with Automation Run As Account." 

#Get current date/time and convert to $scalingScheduleTimeZone
$stateConfig = $scalingSchedule | ConvertFrom-Json
$startTime = Get-Date
Write-Output "Azure Automation local time: $startTime."
$toTimeZone = [System.TimeZoneInfo]::FindSystemTimeZoneById($scalingScheduleTimeZone)
Write-Output "Time zone to convert to: $toTimeZone."
$newTime = [System.TimeZoneInfo]::ConvertTime($startTime, $toTimeZone)
Write-Output "Converted time: $newTime."
$startTime = $newTime 

#Get current day of week, based on converted start time
$currentDayOfWeek = [Int]($startTime).DayOfWeek
Write-Output "Current day of week: $currentDayOfWeek." 

# Get the scaling schedule for the current day of week
$dayObjects = $stateConfig | Where-Object {$_.WeekDays -contains $currentDayOfWeek } `
|Select-Object Edition, Tier, `
@{Name="StartTime"; Expression = {[datetime]::ParseExact($_.StartTime,"HH:mm:ss", [System.Globalization.CultureInfo]::InvariantCulture)}}, `
@{Name="StopTime"; Expression = {[datetime]::ParseExact($_.StopTime,"HH:mm:ss", [System.Globalization.CultureInfo]::InvariantCulture)}} 

# Get the database object
$sqlDB = Get-AzureRmSqlDatabase `
-ResourceGroupName $resourceGroupName `
-ServerName $serverName `
-DatabaseName $databaseName
Write-Output "DB name: $($sqlDB.DatabaseName)"
Write-Output "Current DB status: $($sqlDB.Status), edition: $($sqlDB.Edition), tier: $($sqlDB.CurrentServiceObjectiveName)" 

if($dayObjects -ne $null) { # Scaling schedule found for this day
    # Get the scaling schedule for the current time. If there is more than one available, pick the first
    $matchingObject = $dayObjects | Where-Object { ($startTime -ge $_.StartTime) -and ($startTime -lt $_.StopTime) } | Select-Object -First 1
    if($matchingObject -ne $null)
    {
        Write-Output "Scaling schedule found. Check if current edition & tier is matching..."
        if($sqlDB.CurrentServiceObjectiveName -ne $matchingObject.Tier -or $sqlDB.Edition -ne $matchingObject.Edition)
        {
            Write-Output "DB is not in the edition and/or tier of the scaling schedule. Changing!"
            $sqlDB | Set-AzureRmSqlDatabase -Edition $matchingObject.Edition -RequestedServiceObjectiveName $matchingObject.Tier | out-null
            Write-Output "Change to edition/tier as specified in scaling schedule initiated..."
            $sqlDB = Get-AzureRmSqlDatabase -ResourceGroupName $resourceGroupName -ServerName $serverName -DatabaseName $databaseName
            Write-Output "Current DB status: $($sqlDB.Status), edition: $($sqlDB.Edition), tier: $($sqlDB.CurrentServiceObjectiveName)"
        }
        else
        {
            Write-Output "Current DB tier and edition matches the scaling schedule already. Exiting..."
        }
    }
    else { # Scaling schedule not found for current time
        Write-Output "No matching scaling schedule time slot for this time found. Check if current edition/tier matches the default..."
        if($sqlDB.CurrentServiceObjectiveName -ne $defaultTier -or $sqlDB.Edition -ne $defaultEdition)
        {
            Write-Output "DB is not in the default edition and/or tier. Changing!"
            $sqlDB | Set-AzureRmSqlDatabase -Edition $defaultEdition -RequestedServiceObjectiveName $defaultTier | out-null
            Write-Output "Change to default edition/tier initiated."
            $sqlDB = Get-AzureRmSqlDatabase -ResourceGroupName $resourceGroupName -ServerName $serverName -DatabaseName $databaseName
            Write-Output "Current DB status: $($sqlDB.Status), edition: $($sqlDB.Edition), tier: $($sqlDB.CurrentServiceObjectiveName)"
        }
        else
        {
            Write-Output "Current DB tier and edition matches the default already. Exiting..."
        }
    }
}
else # Scaling schedule not found for this day
{
    Write-Output "No matching scaling schedule for this day found. Check if current edition/tier matches the default..."
    if($sqlDB.CurrentServiceObjectiveName -ne $defaultTier -or $sqlDB.Edition -ne $defaultEdition)
    {
        Write-Output "DB is not in the default edition and/or tier. Changing!"
        $sqlDB | Set-AzureRmSqlDatabase -Edition $defaultEdition -RequestedServiceObjectiveName $defaultTier | out-null
        Write-Output "Change to default edition/tier initiated."
        $sqlDB = Get-AzureRmSqlDatabase -ResourceGroupName $resourceGroupName -ServerName $serverName -DatabaseName $databaseName
        Write-Output "Current DB status: $($sqlDB.Status), edition: $($sqlDB.Edition), tier: $($sqlDB.CurrentServiceObjectiveName)"
    }
    else
    {
        Write-Output "Current DB tier and edition matches the default already. Exiting..."
    }
} 

Write-Output "Script finished."


Step-by-step setup guide

Navigate to the Azure portal and create an Azure Automation account in the same Azure Subscription as where the Azure SQL Database that you want to scale resides. Make sure the Run As account has been created. When you create a new Automation account in the Azure portal, it automatically creates a new service principal(Run As account) and assigns it the Contributor role-based access control (RBAC) role in the subscription by default. These rights are sufficient to scale Azure SQL Databases.

I have uploaded my PowerShell script to TechNet Script Center so you can easily import it as Runbook. Navigate to your Automation Account and click the “Runbooks Gallery” menu item. Make sure you filter the Gallery Source on “Script Center” and search for “Azure SQL Database scheduled autoscaling”. Open the item from the gallery and click Import.
Runbooks Gallery

Before you can use the runbook you have to publish it first. Click on “Overview” if you are not already on that page and then click “Edit”.
Edit Runbook

Now click “Publish”.
Publish Runbook

The runbook is now ready to use. Next step is to schedule it to run hourly, of course with your desired parameters. Click “Schedules” and then “Add a schedule”.
Schedule Runbook

Give your schedule a name, set the start time on a whole hour, the Recurrence on “Recurring”, Recur every “1 Hour” and Set expiration to “No”.
Recurring Schedule Runbook

Now configure the parameters. Explaination can be found under “The Runbook PowerShell script” earlier in this article.
Runbook parameters

When you now navigate to “Jobs” you get an overview of all executions.
Runbook Jobs

When you open a job and look at the Output, you will see all details of that particular run. For example, I can see my database was scaled down from P1 to S0 on friday at 18:00:
Runbook Job Output

Connect Logic Apps to Data Lake Store using a service principal

Since a couple of months Azure Logic Apps offers an Azure Data Lake Connector that can perform the following actions on an Azure Data Lake Store:

  • Azure Data Lake – Append File (concurrent)
  • Azure Data Lake – Read File
  • Azure Data Lake – Upload File
  • Azure Data Lake – Append File (sequential)
  • Azure Data Lake – Create Folder
  • Azure Data Lake – Delete File
  • Azure Data Lake – List Files

Lately I have noticed several people on forums that are facing “Unauthorized” errors while using the Azure Data Lake connector. In this blogpost I will explain how you can authenticate using a service principal that has been authorized on the Azure Data Lake Store for a specific action. As an example I will show how to create a folder.

Azure Data Lake Store uses Azure Active Directory for authentication. When you use the Logic Apps Azure Data Lake connector, you see that there are two possible ways to authenticate:
000

You can either sign in with an Azure AD account, or you can connect using a service principal, the option I will describe. Both options result in a situation in which your Logic App will provide an OAuth 2.0 token that will be attached to each request that will be made to the Data Lake Store.

So what is a service principal? It is basically an impersonation identity of your application which is registered in Azure Active Directory. Using a service principal is preferred instead of running under your own credentials, because:

  • You can assign permissions to the app identity that are different than your own permissions. Typically, these permissions are restricted to only what the app needs to do.
  • You do not have to change the app credentials if your responsibilities change.
  • You can use a certificate to automate authentication when executing an unattended script.

You will receive a service principal object next to an application object when you register an Azure AD application in the Azure portal. You can create multiple service principals in case you have a multi-tenant application. This is not the case for the example I describe in this blog post but I advise you to read the following article to fully understand the concept of applications and service principals, and the relationship between them: Application and service principal objects in Azure Active Directory (Azure AD)

Now let me guide you through the following steps:

  1. Registering an application in Azure AD
  2. Granting the service principal access to an Azure Data Lake Store
  3. Use the service principal in a Logic App to connect with the Azure Data Lake Store and create a folder.

OverviewLet’s first start with what is required to configure the Azure Data Lake connector in the Logic App to use a service principal:
00

  • Connection Name: Display Name of the API Connection that will be created
  • Tenant: Directory ID property of your Azure Active Directory tenant
  • Client ID: Application ID of your Application Registration in Azure Active Directory
  • Client Secret: Key of your Application Registration in Azure Active Directory


Find out your Azure Active Directory ID (Tenant)

Navigate to the Azure portal, then go to the Azure Active Directory page and click Properties. Now copy the “Directory ID” and store it (Tenant property).
0

Register Application in Azure Active Directory

Navigate to the Azure Active Directory page again. On the Overview page, click “App registrations”
1

Click “New application registration”.
2

Enter a name for your App registration, choose select Web App / API for application type and enter a dummy Sign-on URL as we don’t intend to use Web Authorization.
3

You now probably see a long list of existing App registrations. Search for the one you just created and click it.
4

Copy and save the “Application ID” value, we need it for the Client ID property of the Logic App Data Lake connector.
5

Now click “All settings”, followed by “Keys”.
6A

Enter a key Description and choose a duration. A new key Value will be created. Copy and save it. We need it for the Client Secret property of the Logic App Data Lake connector.
6B

Grant the Application Registration (service principal) access to the Azure Data Lake Store

In this example I will grant the Application Registration write and execute permissions, so the Logic App will be able to create a folder in the Azure Data Lake Store.

In the Azure portal, navigate to your Azure Data Lake Store. Click “Overview” and then “Data Explorer”.
7

Click “Access”.
8A

Click “Add”.
8B

You can now search for Active Directory users or groups, but also for Application Registrations. Search on your App Registration name and click it.
9

To be able to create folders, Write and Execute permissions are required. In this example I grant access to “This folder and all children”, in my case “this folder” is the root (/ Folder). I add these permissions as “An access permission entry”, this means the permissions are applied to the currently existing files and folders. “A default permission entry” will also be applied to folders and files that are added to this folder in the future.
10

The permissions have been granted to the App Registration (service principal).
11

Create the Logic App and use the service principal to connect

Create a new Logic App from the Azure portal. From the online Logic App Designer search for the Azure Data Lake Actions. Choose “Create Folder”.
12

Click “Connect with Service Principal”.
13

Configure the connection using the values we saved in the previous steps.
15

Now enter your Azure Data Lake Store Account Name. This is just the name of your Azure Data Lake Store. Also enter a name for the folder in the Folder Path box.
16

You can now click the Run button to test your Logic App which should finish successfully.
17

When you now check the Data Explorer of your Azure Data Lake Store you will see the new folder has been created.
18

Azure Data Lake Store encryption using Azure Key Vault for key management

Case
You want to create an encrypted Azure Data Lake Store (ADLS)
with a master encryption key that is stored and managed in your own existing Azure Key Vault.

Using this setup, which is showed in the diagram below, all data in your Data Lake Store will be encrypted before it gets stored on disk. To decrypt the data, a master encryption key is required.
In this scenario a “customer managed” key will be used, this means the key is created and managed in your own Azure Key Vault. This as an alternative to a key that is managed and owned by the Data Lake Store service, which is the default. Managing keys in the Key Vault gives additional possibilities like revoking access to the key for the ADLS service identity or even permanently deleting the key from the Key Vault.

clip_image001 

Solution
In this blog post I’ll guide you through to the 3 steps below, all in an automated way using PowerShell scripting and an Azure Resource Manager (ARM) template to create your encrypted ADLS. I plan to blog later about the possibilities that Visual Studio Team Services offers to perform these deployment tasks.

1.       Create new “customer managed” key in existing Azure Key Vault

2.       Create a new ADLS with data encryption enabled

3.       Grant ADLS service principal access to Azure Key Vault and enable Key Vault managed encryption using your “customer managed” key

 

Prerequisites

·       Create Azure Resource Group I have created one named “adls-keyvault-demo” (akd)

·       Create Azure Key Vault if you do not already have one. I have created one named “akd-keyvault”

·      AzureRM 4.1.0. Module from the PowerShell Gallery. Required since we will use the new Enable-AzureRmDataLakeStoreKeyVault PowerShell function

 

PowerShell script

Executing the PowerShell script below creates the new key in your existing Azure Key Vault, it then creates a new ADLS using an ARM template (see below) and finally it will enable Key Vault managed encryption for your new ADLS. The comments in the script give further explanation and messages during execution will be written to the Windows PowerShell console to inform you on what’s happening. Make sure you have at least AzureRM 4.1.0 installed and the account you will use have sufficient permissions.

The following variables are used:

·       subscriptionId – Azure Subscription ID

·       rg – Azure Resource Group name

·       keyVaultUri – Key Vault DNS Name. Check your Key Vault Properties in Azure Portal.

·       keyName – Name of Key Vault key that will be used for the ADLS

·       armTemplateFileAdls – Path of your ADLS ARM template JSON file. You can find the definition below the PowerShell script, copy/paste it into a JSON file and store it on disk

·       adlsName – Name of your ADLS


# Variables; modify 
$subscriptionId = “00000000-0000-0000-0000-000000000000”

$rg = “adls-keyvault-demo”

$keyVaultUri = https://akd-keyvault.vault.azure.net/”

$keyName = “akd-adls-key”

$armTemplateFileAdls = “C:\CreateEncryptedADLS.JSON”

$adlsName = “akdadls”

 

#Authenticate to Azure and set the subscription context

Login-AzureRmAccount

Set-AzureRMContext -SubscriptionId $subscriptionId

 

Write-Host “Get Key Vault Name from URI $keyVaultUri

$keyVaultHost = ([System.Uri]$keyVaultUri).Host

$keyVaultName = $keyVaultHost.Substring(0, $keyVaultHost.IndexOf(‘.’))

 

Write-Host “Creating software-protected key $keyName in Key Vault $keyVaultName

$adlsKey = Add-AzureKeyVaultKey -Destination Software -Name $keyName -VaultName $keyVaultName

 

#Get current Version identifier of key which will be used for the creation the ADLS using the encryptionKeyVersion parameter

$adlsKeyId = $adlsKey.Version.ToString()

 

Write-Host “Create new encrypted ADLS by deploying ARM script $armTemplateFileAdls in resource group $rg

New-AzureRmResourceGroupDeployment -ResourceGroupName $rg -TemplateFile $armTemplateFileAdls `

-DataLakeStoreName $adlsName -KeyVaultName $keyVaultName -DataLakeStoreKeyVaultKeyName $keyName -DataLakeStoreKeyVaultKeyVersion $adlsKeyId

 

#Get the ADLS account and it’s Service Principal Id

$adlsAccount = Get-AzureRmDataLakeStoreAccount -Name $adlsName

$adlsAccountSPId = $adlsAccount.Identity.PrincipalId

 

Write-Host “Grant ADLS account Service Principal $adlsAccountSPName required permissions on the Key Vault”

#Grant ADLS account access to perform encrypt, decrypt and get operations with the key vault

Set-AzureRmKeyVaultAccessPolicy -VaultName $keyVaultName -ObjectId $adlsAccountSPId -PermissionsToKeys encrypt,decrypt,get -BypassObjectIdValidation

 

Write-Host “Enable ADLS Key Vault managed encryption”

Enable-AdlStoreKeyVault -Account $adlsAccount.Name

 

Write-Host “ADLS $adlsName is now encrypted using key $keyName in Key Vault $keyVaultName


 

ARM Template ADLS


{

  “$schema”: https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#”,

  “contentVersion”: “1.0.0.0”,

  “parameters”: {

    “DataLakeStoreName”: {

      “type”: “string”,

      “minLength”: 1

    },

    “KeyVaultName”: {

      “type”: “string”,

      “minLength”: 1

    },

    “DataLakeStoreKeyVaultKeyName”: {

      “type”: “string”,

      “minLength”: 1

    },

    “DataLakeStoreKeyVaultKeyVersion”: {

      “type”: “string”,

      “minLength”: 1

    }

 

  },

  “resources”: [

    {

      “type”: “Microsoft.DataLakeStore/accounts”,

      “name”: “[parameters(‘DataLakeStoreName’)]”,

      “apiVersion”: “2016-11-01”,

      “location”: “North Europe”,

      “tags”: {

        “displayName”: “Datalake Store”

      },

      “identity”: {

        “type”: “SystemAssigned”

      },

      “properties”: {

        “encryptionState”: “Enabled”,

        “encryptionConfig”: {

          “type”: “UserManaged”,

          “keyVaultMetaInfo”: {

            “keyVaultResourceId”: “[resourceId(‘Microsoft.KeyVault/vaults’, parameters(‘KeyVaultName’))]”,

            “encryptionKeyName”: “[parameters(‘DataLakeStoreKeyVaultKeyName’)]”,

            “encryptionKeyVersion”: “[parameters(‘DataLakeStoreKeyVaultKeyVersion’)]”

          }

        }

      }

    }

  ]

}


 

After you successfully execute the PowerShell script, navigate to the Azure portal to check if everything is OK.

Data Lake Store à Settings à Encryption

image

The account is successfully encrypted using the Key Vault key. The ADLS account has a generated Service Principal named “RN_akdadls” which we granted permissions to the Key Vault in the PowerShell script.

 

Key Vault à Settings à Keys

clip_image005

The key has been created and is enabled.

 

Key Vault à Settings à Access policies

clip_image007

The ADLS Service Principal has an access policy that we set with the PowerShell script.

Opening it shows the key permissions:

clip_image009


Special thanks to my Macaw colleague Simon Zeinstra for working together on this solution!

Hybrid-BI data platform network architecture

This blog post gives an overview of a typical hybrid Azure/on-premises BI data platform network architecture. The most important settings, configurations, shortcomings and other things you should take into account are described. For more details and in-depth documentation links to Microsoft pages are provided at the bottom of this post.

The following common scenarios are covered:

·       Write data from on-premises data sources to an Azure SQL Database or Azure SQL Data Warehouse using SQL Server Integration Services.

·       Use the On-Premises Data Gateway (or ADF gateway) to make on-premises sources available for various Azure services like PowerBI.com and Azure Data Factory, or to be able to process an Azure Analysis Services cube with on-premises data.

·       Connect to an Azure SQL Database using client tools or PowerBI.com.

·       Connect to an Azure Analysis Services tabular model using client tools or PowerBI.com.

The architecture diagram below (click to open) shows how the different on-premises resources connect to Azure services. For every connection the ports that should be open in your firewall, network protocols, encryption methods and authentication types are shown.

Azure DDP

  

1.       For the SSIS ADO.NET driver, use the following connection string parameters to use an encrypted connection: Encrypt=True and TrustServerCertificate=False.

 

2.       All connections to Azure SQL Database require encryption (SSL/TLS) at all times while data is “in transit” to and from the database.

3.       Set “Allow access to Azure services” property for your Azure SQL Database server to ON. Be aware that this means not only the Azure services in your Azure Subscription can reach your Azure SQL Database server, but all Azure services worldwide, also from other customers.

 

4.       Only SQL Server Authentication is currently supported when connecting to an Azure SQL Database from SSIS, Power BI Desktop or Excel. Power BI Desktop models deployed to PowerBI.com will therefore also connect to an Azure SQL Database using SQL Server Authentication. The latest version of SQL Server Management Studio does support Azure Active Directory Integrated Authentication.

 

5.       To access Azure SQL Database from your local computer, ensure the firewall on your network and local computer allows outgoing communication on TCP port 1433. Outgoing traffic can be filtered to allow only traffic to Azure datacenter IP addresses for your region. This is sometimes a requirement before organizations want to allow outgoing traffic through port 1433. Inbound connections for port 1433 can be blocked.

 

6.       By default in Power BI Desktop, the “Encrypt connections” option is checked for your data source. If the data source doesn’t support encryption, Power BI Desktop will prompt to ask if an unencrypted connection should be used.

 

7.       Port 443 is used for default communication to PowerBI.com. Ports 5671 and 5672 can be used for Advanced Message Queuing Protocol (AMQP). Ports 9350 thru 9354 can be used for listeners on Service Bus Relay over TCP. If you use the ADF gateway to write data to Azure SQL Database or Azure SQL DW outgoing traffic through port 1433 needs to be allowed.

Links:

Process Azure Analysis Services databases from Azure Automation

In my last blog post I showed how to trigger Azure Data Factory (ADF) pipelines from Azure Automation. I also mentioned the option to process an Azure Analysis Services cube from Azure Automation. For example right after your ADF data processing finishes, which will probably be a common use case. In this blog post I show you how you can use the Analysis Services PowerShell provider, also known as SQLASCDMLETS, from Azure Automation.

Create custom SQLASCMDLETS module
The SQLASCDMLETS are not (yet) available in the PowerShell Gallery so unfortunately it’s not possible to import the cmdlets straight into Automation like I did with the ADF cmdlets in my previous blog post. Instead we have to create our own module which will contain the SQLASCMDLETS and its dependencies.

UPDATE: You can now use the SqlServer PowerShell Gallery module. Creating a custom module like described below in the blue text is not required anymore. Instead, follow the 3 steps below and continue reading below the blue text.

1. Navigate to https://www.powershellgallery.com/packages/SqlServer/

2. Click the “Deploy to Azure Automation” button

3. Import the SqlServer module to your Azure Automation account


The required files come with SQL Server Management Studio (SSMS) which you can download and install for free. It’s important to note you need the latest version (140) of the SQLASCDMLETS which is shipped with the latest Release Candidate of SSMS. Download and install it.

If you try to use the previous version of the SQLASCDMLETS (130) you will get an error in Automation because it tries to authenticate with a claims token while only windows authentication is supported by the 130 version of SQLASCDMLETS: “The value ‘ClaimsToken’ is not supported for the connection string property”.

After installing SSMS you should now be able to see the following directory: C:\Program Files (x86)\Microsoft SQL Server\140\Tools\PowerShell\Modules\SQLASCMDLETS

Copy the SQLASCMDLETS folder to a temporary location, for example C:\SQLASCMDLETS.
You will need the following files:

  • Microsoft.AnalysisServices.PowerShell.Cmdlets.dll
  • SQLASCMDLETS.PSD1

We also need the libraries SQLASCMDLETS depends on. Search your computer for the following files and copy paste them to the C:\SQLASCMDLETS folder. Make sure you copy them from a folder that has “140” in the path so you are sure you have the correct version.

  • Microsoft.AnalysisServices.Core.dll
  • Microsoft.AnalysisServices.Tabular.dll
  • Microsoft.AnalysisServices.dll

Now zip the entire folder, make sure the name is “SQLASCMDLETS.zip”.


Import custom SQLASCMDLETS module to Azure Automation
Navigate to your Azure Automation account in the Azure portal.

Click Assets:       

Click Modules:       

Click Add a module to import a custom module:       

Now upload the SQLASCMDLETS.zip file:       

The zip file will be extracted:       

Wait until the extraction finished and the status changes to Available. Click on the module name:       

You now see the available activities including the one we will use to process the Azure Analysis Services Database.



Create Azure Automation Credential
Now we need to create a Credential to be able to automatically login and run our PowerShell script unattended from Azure Automation.

Navigate to Assets again and then click Credentials:
clip_image016

Click “Add a credential” and enter an organization account that has permissions to process your Azure Analysis Services database. Make sure you enter the User Principal Name (UPN) and not a Windows AD account. It is often the email address and may look like jorgk@yourorganizationalaccountname.com. Give the new Credential a name, I chose “adpo-auto-cred”. It will be referenced in the PowerShell script below.
clip_image018


Create Automation Runbook
You can use the simple PowerShell script below to process your Azure Analysis Services database from Azure Automation. It will use the “adpo-auto-cred” credential to authenticate and will process your database using the Invoke-ProcessASDatabase SQLASCMDLETS function.
Replace “dbname” with your database name and “server” with your server, e.g. asazure://westeurope.asazure.windows.net/yourserver and you are good to go.

Copy/paste the script below to a Windows PowerShell Script (.ps1) file and name it “ProcessASDatabase.ps1”.


$AzureCred = Get-AutomationPSCredential -Name “adpo-auto-cred”

 

Add-AzureRmAccount -Credential $AzureCred | Out-Null

 

Invoke-ProcessASDatabase -databasename “dbname” -server “server” -RefreshType “Full” -Credential $AzureCred 


Navigate to your Azure Automation account in the Azure Portal and click “Runbooks”:
clip_image020

Click “Add a runbook”:
clip_image022

Click “Import an existing workbook” and select the ProcessASDatabase.ps1 file to import the PowerShell script as Runbook:
clip_image024

Runbook ProcessASDatabase is created. Click it to open it:
clip_image026

A Runbook must be published before you are able to start or schedule it. Click Edit:
clip_image027

Before publishing, test the Runbook first. Click on the “Test pane” button and then click Start:
clip_image029

The script executed successfully:
clip_image031

Connect to your Azure AS sever with SSMS and check the database properties to be sure processing succeeded:
clip_image033

Now publish the Runbook:
clip_image035

That’s it, you now have an Azure Automation Runbook that you can schedule, monitor and integrate with your other data platform related tasks!

Microsoft Data Platform MVP

MVP_Logo_Preferred_Cyan300_RGB_300ppi

What a great start of 2017! I have received my first Microsoft Data Platform MVP award:

Mail

Thank you to everyone who felt I deserved this! I am honored to be part of such a great group of experts now and I look forward to collaborate with everybody in 2017.

Special thanks to my employer Macaw for always encouraging me to share my knowledge with the community for the past 10 years and giving me the space, capabilities and innovative projects to make this possible!

Have a wonderful 2017!

Jorg.

Orchestrate Azure Data Factory pipelines and other Azure Data Platform management tasks using Azure Automation

Azure Data Factory (ADF) is a great SaaS solution to compose and orchestrate your Azure data services. It works fine to create, schedule and manage your data pipelines but it has limitations that can make it hard to use in some scenarios. The two main ones are:

1.       Some tasks that you regularly want to perform can’t be accomplished with ADF.
A few examples are:

·         Start/Pause an Azure SQL Data Warehouse

·         Start/Pause an Azure Analysis Services instance

·         Process an Azure Analysis Services cube

·         Manage on-premises resources

 

2.       Triggering (running) ADF pipelines on demand.
For most people it is hard to switch from the well-known SQL Agent jobs – in which it’s really easy to trigger SSIS processes and execute other tasks – to
scheduling ADF pipelines. You now have to create activity windows and define data slices that are dependent on the availability of data sets. Most would like to be able to trigger an ADF pipeline either on demand or when some other task finished successfully.

The tasks listed at point 1 – and almost everything else you can think of – can be done using Azure PowerShell with Azure Resource Manager. Microsoft does an awesome job here: every new SaaS solution has great ARM support from the start. To make use of PowerShell in Azure and more importantly to automate your scripts and processes, Azure Automation comes in to play. It makes it possible to run all your PowerShell scripts in Azure as a SaaS solution. It is an orchestrator in which you can execute all kinds of Data Platform related operations, both in Azure and on-premises as hybrid workers, enabling hybrid Data Platform orchestration scenarios.

Triggering ADF pipelines on demand is a bit trickier. An “Execute now” command in PowerShell does not exist, which is understandable if you know how the scheduling and execution mechanism of ADF has been build. I don’t think it will be easy to change this, unless some major changes to ADF will be made. Fortunately, there is a workaround! If you ever deployed a new ADF pipeline you might have noticed that after deployment, pipelines start immediately when they have an activity window with a date/time in the past. Because it is possible to update the activity window date/time of a pipeline using PowerShell, it’s also possible to trigger a pipeline’s execution by changing its activity window date/time to some value in the past.

In this blog post I will show how you can trigger an ADF pipeline on a daily basis and monitor its execution using PowerShell code that runs in an Azure Automation runbook. Because executing other tasks (point 1) using PowerShell is easy, it becomes possible to orchestrate advanced Azure Data Platform tasks, with the execution of an ADF pipeline as just a part of the total process.

Azure Setup

In this example I’ve used the Azure setup shown below. The ADF pipeline contains one simple copy activity that copies a file from one blob storage container to another.

·         Resource Group

o   Blob Storage

§  Container: input

§  Container: output

o   Azure Data Factory

o   Azure Automation

Azure Data Factory Setup

I created the Azure Data Factory pipeline with the Copy Data wizard:

clip_image001[4] 

I configured the pipeline to “Run regularly on schedule” with a recurring pattern of “Daily”, “every 1 day” (see the blue rectangle in the screenshot below).

Choosing “Run once now” would set the pipeline property “PipelineMode” to “OneTime” and would disable the ADF scheduler. Initially you might think this seems to be the option you want to use as we want to trigger the pipeline ourselves on demand, but unfortunately that configuration has some limitations:

·         The ADF diagram view does not show one-time pipelines. This makes it impossible to monitor the pipeline using the Monitor & Manage dashboard.

·         One-time pipelines can’t be updated. This would block us from updating the activity window properties to trigger the pipeline.

The “Start date time” is automatically set to yesterday with the current time. This triggers the pipeline to start running immediately after deployment; as explained ADF automatically triggers pipelines with a start date/time in the past. In this case, we want to trigger the execution ourselves, so set the date to some date in the future (see the red rectangle in the screenshot below).

clip_image003[4]

For this pipeline I chose to create a simple example that copies a file from one blob storage container to another. Of course, you can design your pipeline anyway you prefer, e.g. using the Copy Data wizard or Visual Studio. Just make sure you schedule it with a daily recurrence.


Azure Automation Setup

The next step is to create a new Azure Automation account. Within that account we need to import the Data Factory PowerShell cmdlets, because the standard set of PowerShell cmdlets in Automation do not contain any ADF related functions.

Navigate to Assets:

clip_image004[4]

Click Modules:
clip_image005[4]

Browse the gallery:
clip_image006[4]

 

Search for “Data Factory”, select AzureRM.DataFactories and click Import:
clip_image007[4]

AzureRM.DataFactories should now appear in your list of available modules:
clip_image009[4]


Create Credential

Now we need to create a Credential to be able to automatically login and run our PowerShell script unattended from Azure Automation.

Navigate to Assets again and then click Credentials
clip_image010[4]

 

Click “Add a credential” and supply a user account that has the required permissions to access your Azure Data Factory. You can use the organizational account you use to login to the Azure Portal. It might look like jorgk@yourorganizationalaccountname.com or something like that.
clip_image012[4]

Automation Runbook
We are now ready to create the Automation Runbook which will trigger the Azure Data Factory pipeline, by updating its Active Period to a date/time in the past.

The script performs the following steps:

1.       Authenticate with the Automation Credential

2.       Connect to the Azure Data Factory

3.       Update the pipeline active period to yesterday

4.       Unpause the pipeline; execution will begin

5.       Monitor the pipeline execution

6.       Pause the pipeline

Copy/paste the script below to a Windows PowerShell Script (.ps1) file and name it “TriggerAdfPipeline.ps1”.


 

# Variables; modify 

$rgn = “AzureDataPlatformOrchestration” #Resource Group Name

$acn = “adpo-auto-cred” #Automation Credential Name

$dfn = “adpo-adf” #Data Factory Name

$pln = “CopyPipeline-cu6” #PipeLine Name

$dsn = “OutputDataset-hgv” #DataSet Name (output dataset of pipeline that needs to be produced)

 

# To test from PowerShell client, uncomment the 2 rows below and provide subscription ID

#Login-AzureRmAccount

#Set-AzureRMContext -SubscriptionId “00000000-0000-0000-0000-000000000000”

 

# Authenticate

# To test from PowerShell client, comment out the 2 rows below

$AzureCred = Get-AutomationPSCredential -Name $acn

Add-AzureRmAccount -Credential $AzureCred | Out-Null

 

# Get data factory object

$df=Get-AzureRmDataFactory -ResourceGroupName $rgn -Name $dfn

If($df) {

       Write-Output “Connected to data factory $dfn in resource group $rgn.”

}

 

# Create start/end DateTime (yesterday)

$sdt = [System.DateTime]::Today.AddDays(-1) #Yesterday 12:00:00 AM

$edt = [System.DateTime]::Today.AddSeconds(-1) #Yesterday 11:59:59 PM

 

# Update active period to yesterday

$apr=Set-AzureRmDataFactoryPipelineActivePeriod -DataFactory $df -PipelineName $pln -StartDateTime $sdt -EndDateTime $edt

If($apr) {

       Write-Output “Pipeline $pln of data factory $dfn updated with StartDateTime $sdt and EndDateTime $edt.”

}

 

# Unpause pipeline

$rpl = Resume-AzureRmDataFactoryPipeline -DataFactory $df -Name $pln

If($rpl) {

       Write-Output “Pipeline $pln resumed.”

}

 

# Create arrays that hold all possible data factory slice states

$failedStates = “Failed Validation”, “Timed Out”, “Skip”, “Failed”

$pendingStates = “Retry Validation”, “PendingValidation”, “Retry”, “InProgress”, “PendingExecution”

$finishedStates = “Ready”

 

# Wait while data factory slice is in pending state

While (Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Where -Property State -In $pendingStates) {

       Write-Output “Slice status is:”

       Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Select -Property State

       Write-Output “Wait 15 seconds”

       Start-Sleep 15

}

 

# Since data factory slice is not pending (anymore), it is either failed or finished

If(Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Where -Property State -In $failedStates){

       Write-Output “Slice failed.”

       Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Select -Property State

} ElseIf (Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Where -Property State -In $finishedStates) {

       Write-Output “Slice finished.”

} Else {

       Write-Output “No State found?”

       Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $dsn -StartDateTime $sdt -EndDateTime $edt | Select -Property State

}

 

# Pause pipeline

$spr = Suspend-AzureRmDataFactoryPipeline -DataFactory $df -Name $pln

If($spr){

       Write-Output “Pipeline $pln paused.”

}

 


Navigate to your Azure Automation account in the Azure Portal and click “Runbooks”:

clip_image013[4]

Click “Add a runbook”:

clip_image014[4]

Now select the TriggerAdfPipeline.ps1 file to import the PowerShell script as Runbook:

clip_image016[4]

 

Runbook TriggerAdfPipeline is created. Click it to open it:

clip_image018[4]

 

A Runbook must be published before you are able to start or schedule it. Click Edit:

clip_image020[4]

 

Before publishing, test the Runbook first. Click on the Test pane button and then click Start:

clip_image022[4]

 

The pipeline slice ran successfully:

clip_image024[4]

Now publish the Runbook:

clip_image026[4]

You can now create webhooks, schedule and monitor your Runbook jobs:

clip_image028[4]

Extending the script with additional tasks

It’s easy to perform other Azure Data Platform related tasks using PowerShell and Automation Runbooks. Because the script I created waits for the ADF pipeline to end, you can easily execute other tasks before or after the ADF’s execution. The following examples will most likely be useful:

·         Pause/Start Azure SQL Data warehouse

o   Documentation

o   PowerShell Gallery

·         Pause/Start Azure Analysis Services

o   PowerShell Gallery

·         Process Azure Analysis Services

o   See my blog post Process Azure Analysis Services databases from Azure Automation

Update: Triggering/calling pipelines for a second time

The script that is provided in this blog posts only works if you call your pipeline not more than one time a day. If you want to trigger your pipeline multiple times a day this solution needs a minor modification.

  1. The pipeline schedule should be “every 15 minutes” instead of daily
  2. Update the “TriggerAdfPipeline.ps1” script to make it work for “second time”:

Change:

# Create start/end DateTime (yesterday)

$sdt = [System.DateTime]::Today.AddDays(-1) #Yesterday 12:00:00 AM

$edt = [System.DateTime]::Today.AddSeconds(-1) #Yesterday 11:59:59 PM


To:

# Create start/end DateTime

$sdt = Get-Date

$sdt = $sdt.AddMinutes(-2) #Now –2 minutes

$edt = Get-Date

$edt = $edt.AddMinutes(-1) #Now –1 minute

Azure Data Lake Analytics U-SQL decryption possibilities

The following information applies to the situation as of October 2016.

One of the things that make U-SQL so powerful is C# integration. It gives you the possibility to create your own C# classes and methods and then use them in your U-SQL scripts. Recently we executed a couple of big data projects at Macaw using the Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). During one of these projects we had to decrypt some custom Rijndael (AES) encrypted JSON content in the ADLS and store the result as a CSV. These CSV files would then be encrypted by default by enabling encryption at rest for the entire ADLS.

We were able to support two decryption scenarios: decrypt the entire file or decrypt a part of a file (particular sensitive string values). I will go into detail about each scenario.

Scenario 1: Decrypt entire files
Because we were dealing with JSON files and U-SQL does not offer a native JSON extractor yet, we used the JsonExtractor C# sample that is shared on GitHub by Michal Rys
.
We modified the Extract method of the JsonExtractor class to include a CustomDecryptor that will decrypt the input.BaseStream (= entire file). Next to that we set AtomicFileProcessing to true.


using System.IO;

using System.Linq;

using System.Collections.Generic;

using Microsoft.Analytics.Interfaces;

using Newtonsoft.Json.Linq;

using Custom.Framework.Utilities.Encoding;

 

namespace CustomADLALibrary

{

 

    [SqlUserDefinedExtractor(AtomicFileProcessing = true)]

    public class CustomExtractor : IExtractor

    {

        /// <summary/>

        private string rowpath;

 

        /// <summary/>

        public CustomExtractor(string rowpath = null)

        {

            this.rowpath = rowpath;

        }

 

        /// <summary/>

        public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)

        {

 

            CustomDecryptor Decryptor = new CustomDecryptor();

            using (var reader = new StreamReader(input.BaseStream))

            {

                var jsonAsString = reader.ReadToEnd();

                var decryptedString = Decryptor.Decrypt(jsonAsString);

                var root = JToken.Parse(decryptedString);

 

                foreach (JObject o in SelectChildren(root, this.rowpath))

                {

                    this.JObjectToRow(o, output);

 

                    yield return output.AsReadOnly();

                }

            }

        }

 

        /// <summary/>

        private static IEnumerable<JObject> SelectChildren(JToken root, string path)

        {

            if (!string.IsNullOrEmpty(path))

            {

                return root.SelectTokens(path).OfType<JObject>();

            }

            var o = root as JObject;

            if (o != null)

            {

                return new[] { o };

            }

            return root.Children().OfType<JObject>();

        }

 

        /// <summary/>

        protected virtual void JObjectToRow(JObject o, IUpdatableRow row)

        {

            foreach (var c in row.Schema)

            {

                JToken token = null;

                object value = c.DefaultValue;

 

                if (o.TryGetValue(c.Name, out token) && token != null)

                {

                    value = JsonFunctions.ConvertToken(token, c.Type) ?? c.DefaultValue;

                }

                row.Set<object>(c.Name, value);

            }

        }

    }

}


The modified Extract method is using a CustomDecryptor object in which the actual decryption takes place:


using System;

using System.IO;

using System.Security.Cryptography;

 

public class CustomDecryptor

{

 

    private readonly byte[] _key = { — }; //YourKey

    private readonly byte[] _iv = { — }; //YourIv

 

    private readonly ICryptoTransform _decryptor;

 

    public CustomDecryptor()

    {

        var myRijndael = new RijndaelManaged { Key = this._key, IV = this._iv, Padding = PaddingMode.PKCS7 };

 

        this._encryptor = myRijndael.CreateEncryptor(myRijndael.Key, myRijndael.IV);

        this._decryptor = myRijndael.CreateDecryptor(myRijndael.Key, myRijndael.IV);

    }

 

    public string Decrypt(string input)

    {

        // Create the streams used for decryption.

        using (MemoryStream msDecrypt = new MemoryStream(Convert.FromBase64String(input)))

        {

            using (CryptoStream csDecrypt = new CryptoStream(msDecrypt, _decryptor, CryptoStreamMode.Read))

            {

                using (StreamReader srDecrypt = new StreamReader(csDecrypt))

                {

                    return srDecrypt.ReadToEnd();

                }

            }

        }

    }

}


From U-SQL we can now easily decrypt entire files. Make sure you publish the Microsoft.Json assembly and your custom assembly (CustomADLALibrary) that contains the modified JsonExtractor that is used before you try to execute the U-SQL job. See my previous blog post for instructions on how to programmatically register U-SQL assemblies using PowerShell.


REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [CustomADLALibrary];

 

DECLARE @INPUT_FILE string = @”somePath/encryptedFile.json”;

DECLARE @OUTPUT_FILE string = @”someOtherPath/decryptedFile.csv”;

 

@decryptedExtract =

    EXTRACT column1 string,

            column2 string

    FROM @INPUT_FILE

    USING new CustomADLALibrary.CustomExtractor();

@result =

    SELECT *

    FROM @decryptedExtract;

OUTPUT @result

TO @OUTPUT_FILE

USING Outputters.Csv(quoting : false);


Scenario 2: Decrypt string values
In this case the files were not completely encrypted but only particular sensitive string values in the files. Therefore the custom extractor was not needed and the decryption could take place directly in the U-SQL SELECT statement as shown in the example below.


REFERENCE ASSEMBLY [CustomADLALibrary];

 

DECLARE @INPUT_FILE string = @”somePath/stringEncryptedFile.csv”;

DECLARE @OUTPUT_FILE string = @”someOtherPath/decryptedFile.csv”;

 

@extract =

    EXTRACT column1 string,

            column2 string

    FROM @INPUT_FILE

    USING Extractors.Csv();

 

@result =

    SELECT column1 AS unencryptedColumn,

           new WeakObjectEncryptor().Decrypt(column2) AS decryptedColumn

    FROM @extract;

 

OUTPUT @result

TO @OUTPUT_FILE

USING Outputters.Csv(quoting : false);


These decryption examples show one of the many possibilities of the new powerful new U-SQL language. It’s quite easy to use and understandable for both people with a SQL background and C# programmers. In this case I worked together with my colleague Luuk Gortzak who helped with the C# scripts. Credits to Luuk for helping me out!

Script Azure Data Lake Analytics assembly deployments to U-SQL Catalog

Registering your custom assemblies using Visual Studio Data Lake Tools is easy, just right click your U-SQL C# Class Library project, click “Register Assembly”, fill in the details and your assembly gets published and created in your U-SQL Catalog.

But what if you want to script this process to automate your deployments? That is not documented yet and as I follow the principle that we should be able to rebuild the whole environment with a press on the button, I found out how to do this.

Step 0 – Prerequisites
Install latest version of Azure PowerShell. The PowerShell script below won’t work with an older version.
http://aka.ms/webpi-azps

Step 1 – Build your dll from Visual Studio.
We need the dll of your custom assembly, create it by right clicking your U-SQL Class Library project and choose to Build. You can see where the dll has been created on disk in the output window.

Step 2 – Create a U-SQL job file that will create the assembly.
Modify the U-SQL script below and store it in a file somewhere locally on disk. Name it createAssembly.usql


USE DATABASE yourUsqlCatalog;

 

CREATE ASSEMBLY IF NOT EXISTS yourCSharp

FROM @”/Assemblies/yourCSharp.dll”;


Step 3 – Upload the assembly to your ADLS and execute the U-SQL job using PowerShell.
Enter values for the variables in the PowerShell script below and execute it.


#Variables; modify 

$dataLakeStoreName = “yourAdlsAccount”

$dataLakeAnalyticsName = “yourAdlaAccount”

$assemblyLocalPath = “c:\yourCSharp.dll” #step 1

$usqlScriptLocalPath = “c:\createAssembly.usql” #step 2

$assemblyAdlsPath = “/Assemblies/yourCSharp.dll” #Assemblies folder will be created if not exists

$azureSubscriptionId = “00000000-0000-0000-0000-000000000000”

 

#Login (login pop up appears)

Login-AzureRmAccount

 

#Connect to the Azure Subscription in which your ADLA Catalog exists 

Set-AzureRMContext -SubscriptionId $azureSubscriptionId 

 

#Import dll to ADLS

Import-AzureRmDataLakeStoreItem -AccountName $dataLakeStoreName -Path $assemblyLocalPath -Destination $assemblyAdlsPath

 

#Submit new job to ADLA (createAssembly.usql)

$job = Submit-AzureRmDataLakeAnalyticsJob -Name “Create Assembly” -AccountName $dataLakeAnalyticsName –ScriptPath $usqlScriptLocalPath -DegreeOfParallelism 1

 

 While (($t = Get-AzureRmDataLakeAnalyticsJob -AccountName $dataLakeAnalyticsName -JobId $job.JobId).State -ne “Ended”){

     Write-Host “Job status: “$t.State“…”

     Start-Sleep -seconds 10

 }

 

 Get-AzureRmDataLakeAnalyticsJob -AccountName $dataLakeAnalyticsName -JobId $job.JobId


Step 4 – Validate.
Validate if your dll is uploaded to a folder in your ADLS called “Assemblies”, and next, if your assembly is created in your U-SQL Catalog with Visual Studio Server Explorer (Azure).

Step 5 – Reference your new assembly in your U-SQL scripts.
You can now start using your assembly by referencing it in the first lines of code in your U-SQL script.


USE DATABASE yourUsqlCatalog;

 

REFERENCE ASSEMBLY yourCSharp;


 

Use AdlCopy to generate U-SQL jobs that copy data between Azure Blob Storage and Azure Data Lake Store

AdlCopy is a command-line tool (it runs on the user’s machine) that allows you to copy data from Azure Storage Containers or Blobs into Azure Data Lake Store.

You can use the AdlCopy tool in two ways:

  • Standalone, where the tool uses Data Lake Store resources to perform the task. This can be a cheap and more ad-hoc/manual option to move data from blob storage to ADLS compared to using Data Factory or ADLA U-SQL jobs. I assume only ADLS transaction costs will be made if your blob storage resides in the same region as your ADLS.
  • Using a Data Lake Analytics account, where the units assigned to your Data Lake Analytics account are used to perform the copy operation. You would typically use this option when the data to be moved is in the range of gigabytes and terabytes, and you want better and predictable performance throughput.

Another advantage is that on execution the AdlCopy tool generates an ADLA job that contains U-SQL code for each source file from your blob storage that will be copied to the ADLS. You can reuse the code of this job to schedule execution of the copy process on a regular basis. Requirement for this scenario is that your blob storage source files have static paths and file names, as the U-SQL code of the generated job contains hardcoded references to the source files locations.

Download AdlCopy: http://aka.ms/downloadadlcopy

Instructions: https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-copy-data-azure-storage-blob/

Copy all data from blob storage container to ADLS folder using a generated U-SQL job that can be reused:

1. Start Command Prompt

2. Navigate to dir where AldCopy is installed:

%HOMEPATH%\Documents\adlcopy

3. Run the following command (standalone, copy is performed by ADLS):

AdlCopy.exe /Source https://yourblob.blob.core.windows.net/yourcontainer/ /Dest swebhdfs://yourdatalakestore.azuredatalakestore.net/yourfolder/ /SourceKey –

4. Data is copied:

clip_image001

5. Run the following command to execute the process using the Data Lake Analytics service:

AdlCopy.exe /Source https://yourblob.blob.core.windows.net/yourcontainer/ /Dest swebhdfs://yourdatalakestore.azuredatalakestore.net/yourfolder/ /SourceKey – /Account yourDataLakeAnalyticsAccount /Units 2

6. Now because the ADLA service has been used a U-SQL job was generated and executed, which can be seen in the azure portal:

clip_image003

7. The U-SQL code shows that for each file in the blob storage EXTRACT from blob storage and OUTPUT to ADLS statements were created:

clip_image005

8. You can now choose “Duplicate Script”, delete the DROP ASSEMBLY IF EXIST and CREATE ASSEMBLY statements which you don’t need again, and save your new job. You can now execute/schedule the copy process again without the need of the AdlCopy tool.

clip_image006