Connect Logic Apps to Data Lake Store using a service principal

Since a couple of months Azure Logic Apps offers an Azure Data Lake Connector that can perform the following actions on an Azure Data Lake Store:

  • Azure Data Lake – Append File (concurrent)
  • Azure Data Lake – Read File
  • Azure Data Lake – Upload File
  • Azure Data Lake – Append File (sequential)
  • Azure Data Lake – Create Folder
  • Azure Data Lake – Delete File
  • Azure Data Lake – List Files

Lately I have noticed several people on forums that are facing “Unauthorized” errors while using the Azure Data Lake connector. In this blogpost I will explain how you can authenticate using a service principal that has been authorized on the Azure Data Lake Store for a specific action. As an example I will show how to create a folder.

Azure Data Lake Store uses Azure Active Directory for authentication. When you use the Logic Apps Azure Data Lake connector, you see that there are two possible ways to authenticate:
000

You can either sign in with an Azure AD account, or you can connect using a service principal, the option I will describe. Both options result in a situation in which your Logic App will provide an OAuth 2.0 token that will be attached to each request that will be made to the Data Lake Store.

So what is a service principal? It is basically an impersonation identity of your application which is registered in Azure Active Directory. Using a service principal is preferred instead of running under your own credentials, because:

  • You can assign permissions to the app identity that are different than your own permissions. Typically, these permissions are restricted to only what the app needs to do.
  • You do not have to change the app credentials if your responsibilities change.
  • You can use a certificate to automate authentication when executing an unattended script.

You will receive a service principal object next to an application object when you register an Azure AD application in the Azure portal. You can create multiple service principals in case you have a multi-tenant application. This is not the case for the example I describe in this blog post but I advise you to read the following article to fully understand the concept of applications and service principals, and the relationship between them: Application and service principal objects in Azure Active Directory (Azure AD)

Now let me guide you through the following steps:

  1. Registering an application in Azure AD
  2. Granting the service principal access to an Azure Data Lake Store
  3. Use the service principal in a Logic App to connect with the Azure Data Lake Store and create a folder.

OverviewLet’s first start with what is required to configure the Azure Data Lake connector in the Logic App to use a service principal:
00

  • Connection Name: Display Name of the API Connection that will be created
  • Tenant: Directory ID property of your Azure Active Directory tenant
  • Client ID: Application ID of your Application Registration in Azure Active Directory
  • Client Secret: Key of your Application Registration in Azure Active Directory


Find out your Azure Active Directory ID (Tenant)

Navigate to the Azure portal, then go to the Azure Active Directory page and click Properties. Now copy the “Directory ID” and store it (Tenant property).
0

Register Application in Azure Active Directory

Navigate to the Azure Active Directory page again. On the Overview page, click “App registrations”
1

Click “New application registration”.
2

Enter a name for your App registration, choose select Web App / API for application type and enter a dummy Sign-on URL as we don’t intend to use Web Authorization.
3

You now probably see a long list of existing App registrations. Search for the one you just created and click it.
4

Copy and save the “Application ID” value, we need it for the Client ID property of the Logic App Data Lake connector.
5

Now click “All settings”, followed by “Keys”.
6A

Enter a key Description and choose a duration. A new key Value will be created. Copy and save it. We need it for the Client Secret property of the Logic App Data Lake connector.
6B

Grant the Application Registration (service principal) access to the Azure Data Lake Store

In this example I will grant the Application Registration write and execute permissions, so the Logic App will be able to create a folder in the Azure Data Lake Store.

In the Azure portal, navigate to your Azure Data Lake Store. Click “Overview” and then “Data Explorer”.
7

Click “Access”.
8A

Click “Add”.
8B

You can now search for Active Directory users or groups, but also for Application Registrations. Search on your App Registration name and click it.
9

To be able to create folders, Write and Execute permissions are required. In this example I grant access to “This folder and all children”, in my case “this folder” is the root (/ Folder). I add these permissions as “An access permission entry”, this means the permissions are applied to the currently existing files and folders. “A default permission entry” will also be applied to folders and files that are added to this folder in the future.
10

The permissions have been granted to the App Registration (service principal).
11

Create the Logic App and use the service principal to connect

Create a new Logic App from the Azure portal. From the online Logic App Designer search for the Azure Data Lake Actions. Choose “Create Folder”.
12

Click “Connect with Service Principal”.
13

Configure the connection using the values we saved in the previous steps.
15

Now enter your Azure Data Lake Store Account Name. This is just the name of your Azure Data Lake Store. Also enter a name for the folder in the Folder Path box.
16

You can now click the Run button to test your Logic App which should finish successfully.
17

When you now check the Data Explorer of your Azure Data Lake Store you will see the new folder has been created.
18

Azure Data Lake Store encryption using Azure Key Vault for key management

Case
You want to create an encrypted Azure Data Lake Store (ADLS)
with a master encryption key that is stored and managed in your own existing Azure Key Vault.

Using this setup, which is showed in the diagram below, all data in your Data Lake Store will be encrypted before it gets stored on disk. To decrypt the data, a master encryption key is required.
In this scenario a “customer managed” key will be used, this means the key is created and managed in your own Azure Key Vault. This as an alternative to a key that is managed and owned by the Data Lake Store service, which is the default. Managing keys in the Key Vault gives additional possibilities like revoking access to the key for the ADLS service identity or even permanently deleting the key from the Key Vault.

clip_image001 

Solution
In this blog post I’ll guide you through to the 3 steps below, all in an automated way using PowerShell scripting and an Azure Resource Manager (ARM) template to create your encrypted ADLS. I plan to blog later about the possibilities that Visual Studio Team Services offers to perform these deployment tasks.

1.       Create new “customer managed” key in existing Azure Key Vault

2.       Create a new ADLS with data encryption enabled

3.       Grant ADLS service principal access to Azure Key Vault and enable Key Vault managed encryption using your “customer managed” key

 

Prerequisites

·       Create Azure Resource Group I have created one named “adls-keyvault-demo” (akd)

·       Create Azure Key Vault if you do not already have one. I have created one named “akd-keyvault”

·      AzureRM 4.1.0. Module from the PowerShell Gallery. Required since we will use the new Enable-AzureRmDataLakeStoreKeyVault PowerShell function

 

PowerShell script

Executing the PowerShell script below creates the new key in your existing Azure Key Vault, it then creates a new ADLS using an ARM template (see below) and finally it will enable Key Vault managed encryption for your new ADLS. The comments in the script give further explanation and messages during execution will be written to the Windows PowerShell console to inform you on what’s happening. Make sure you have at least AzureRM 4.1.0 installed and the account you will use have sufficient permissions.

The following variables are used:

·       subscriptionId – Azure Subscription ID

·       rg – Azure Resource Group name

·       keyVaultUri – Key Vault DNS Name. Check your Key Vault Properties in Azure Portal.

·       keyName – Name of Key Vault key that will be used for the ADLS

·       armTemplateFileAdls – Path of your ADLS ARM template JSON file. You can find the definition below the PowerShell script, copy/paste it into a JSON file and store it on disk

·       adlsName – Name of your ADLS


# Variables; modify 
$subscriptionId = “00000000-0000-0000-0000-000000000000”

$rg = “adls-keyvault-demo”

$keyVaultUri = https://akd-keyvault.vault.azure.net/”

$keyName = “akd-adls-key”

$armTemplateFileAdls = “C:\CreateEncryptedADLS.JSON”

$adlsName = “akdadls”

 

#Authenticate to Azure and set the subscription context

Login-AzureRmAccount

Set-AzureRMContext -SubscriptionId $subscriptionId

 

Write-Host “Get Key Vault Name from URI $keyVaultUri

$keyVaultHost = ([System.Uri]$keyVaultUri).Host

$keyVaultName = $keyVaultHost.Substring(0, $keyVaultHost.IndexOf(‘.’))

 

Write-Host “Creating software-protected key $keyName in Key Vault $keyVaultName

$adlsKey = Add-AzureKeyVaultKey -Destination Software -Name $keyName -VaultName $keyVaultName

 

#Get current Version identifier of key which will be used for the creation the ADLS using the encryptionKeyVersion parameter

$adlsKeyId = $adlsKey.Version.ToString()

 

Write-Host “Create new encrypted ADLS by deploying ARM script $armTemplateFileAdls in resource group $rg

New-AzureRmResourceGroupDeployment -ResourceGroupName $rg -TemplateFile $armTemplateFileAdls `

-DataLakeStoreName $adlsName -KeyVaultName $keyVaultName -DataLakeStoreKeyVaultKeyName $keyName -DataLakeStoreKeyVaultKeyVersion $adlsKeyId

 

#Get the ADLS account and it’s Service Principal Id

$adlsAccount = Get-AzureRmDataLakeStoreAccount -Name $adlsName

$adlsAccountSPId = $adlsAccount.Identity.PrincipalId

 

Write-Host “Grant ADLS account Service Principal $adlsAccountSPName required permissions on the Key Vault”

#Grant ADLS account access to perform encrypt, decrypt and get operations with the key vault

Set-AzureRmKeyVaultAccessPolicy -VaultName $keyVaultName -ObjectId $adlsAccountSPId -PermissionsToKeys encrypt,decrypt,get -BypassObjectIdValidation

 

Write-Host “Enable ADLS Key Vault managed encryption”

Enable-AdlStoreKeyVault -Account $adlsAccount.Name

 

Write-Host “ADLS $adlsName is now encrypted using key $keyName in Key Vault $keyVaultName


 

ARM Template ADLS


{

  “$schema”: https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#”,

  “contentVersion”: “1.0.0.0”,

  “parameters”: {

    “DataLakeStoreName”: {

      “type”: “string”,

      “minLength”: 1

    },

    “KeyVaultName”: {

      “type”: “string”,

      “minLength”: 1

    },

    “DataLakeStoreKeyVaultKeyName”: {

      “type”: “string”,

      “minLength”: 1

    },

    “DataLakeStoreKeyVaultKeyVersion”: {

      “type”: “string”,

      “minLength”: 1

    }

 

  },

  “resources”: [

    {

      “type”: “Microsoft.DataLakeStore/accounts”,

      “name”: “[parameters(‘DataLakeStoreName’)]”,

      “apiVersion”: “2016-11-01”,

      “location”: “North Europe”,

      “tags”: {

        “displayName”: “Datalake Store”

      },

      “identity”: {

        “type”: “SystemAssigned”

      },

      “properties”: {

        “encryptionState”: “Enabled”,

        “encryptionConfig”: {

          “type”: “UserManaged”,

          “keyVaultMetaInfo”: {

            “keyVaultResourceId”: “[resourceId(‘Microsoft.KeyVault/vaults’, parameters(‘KeyVaultName’))]”,

            “encryptionKeyName”: “[parameters(‘DataLakeStoreKeyVaultKeyName’)]”,

            “encryptionKeyVersion”: “[parameters(‘DataLakeStoreKeyVaultKeyVersion’)]”

          }

        }

      }

    }

  ]

}


 

After you successfully execute the PowerShell script, navigate to the Azure portal to check if everything is OK.

Data Lake Store à Settings à Encryption

image

The account is successfully encrypted using the Key Vault key. The ADLS account has a generated Service Principal named “RN_akdadls” which we granted permissions to the Key Vault in the PowerShell script.

 

Key Vault à Settings à Keys

clip_image005

The key has been created and is enabled.

 

Key Vault à Settings à Access policies

clip_image007

The ADLS Service Principal has an access policy that we set with the PowerShell script.

Opening it shows the key permissions:

clip_image009


Special thanks to my Macaw colleague Simon Zeinstra for working together on this solution!

Azure Data Lake Analytics U-SQL decryption possibilities

The following information applies to the situation as of October 2016.

One of the things that make U-SQL so powerful is C# integration. It gives you the possibility to create your own C# classes and methods and then use them in your U-SQL scripts. Recently we executed a couple of big data projects at Macaw using the Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). During one of these projects we had to decrypt some custom Rijndael (AES) encrypted JSON content in the ADLS and store the result as a CSV. These CSV files would then be encrypted by default by enabling encryption at rest for the entire ADLS.

We were able to support two decryption scenarios: decrypt the entire file or decrypt a part of a file (particular sensitive string values). I will go into detail about each scenario.

Scenario 1: Decrypt entire files
Because we were dealing with JSON files and U-SQL does not offer a native JSON extractor yet, we used the JsonExtractor C# sample that is shared on GitHub by Michal Rys
.
We modified the Extract method of the JsonExtractor class to include a CustomDecryptor that will decrypt the input.BaseStream (= entire file). Next to that we set AtomicFileProcessing to true.


using System.IO;

using System.Linq;

using System.Collections.Generic;

using Microsoft.Analytics.Interfaces;

using Newtonsoft.Json.Linq;

using Custom.Framework.Utilities.Encoding;

 

namespace CustomADLALibrary

{

 

    [SqlUserDefinedExtractor(AtomicFileProcessing = true)]

    public class CustomExtractor : IExtractor

    {

        /// <summary/>

        private string rowpath;

 

        /// <summary/>

        public CustomExtractor(string rowpath = null)

        {

            this.rowpath = rowpath;

        }

 

        /// <summary/>

        public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)

        {

 

            CustomDecryptor Decryptor = new CustomDecryptor();

            using (var reader = new StreamReader(input.BaseStream))

            {

                var jsonAsString = reader.ReadToEnd();

                var decryptedString = Decryptor.Decrypt(jsonAsString);

                var root = JToken.Parse(decryptedString);

 

                foreach (JObject o in SelectChildren(root, this.rowpath))

                {

                    this.JObjectToRow(o, output);

 

                    yield return output.AsReadOnly();

                }

            }

        }

 

        /// <summary/>

        private static IEnumerable<JObject> SelectChildren(JToken root, string path)

        {

            if (!string.IsNullOrEmpty(path))

            {

                return root.SelectTokens(path).OfType<JObject>();

            }

            var o = root as JObject;

            if (o != null)

            {

                return new[] { o };

            }

            return root.Children().OfType<JObject>();

        }

 

        /// <summary/>

        protected virtual void JObjectToRow(JObject o, IUpdatableRow row)

        {

            foreach (var c in row.Schema)

            {

                JToken token = null;

                object value = c.DefaultValue;

 

                if (o.TryGetValue(c.Name, out token) && token != null)

                {

                    value = JsonFunctions.ConvertToken(token, c.Type) ?? c.DefaultValue;

                }

                row.Set<object>(c.Name, value);

            }

        }

    }

}


The modified Extract method is using a CustomDecryptor object in which the actual decryption takes place:


using System;

using System.IO;

using System.Security.Cryptography;

 

public class CustomDecryptor

{

 

    private readonly byte[] _key = { — }; //YourKey

    private readonly byte[] _iv = { — }; //YourIv

 

    private readonly ICryptoTransform _decryptor;

 

    public CustomDecryptor()

    {

        var myRijndael = new RijndaelManaged { Key = this._key, IV = this._iv, Padding = PaddingMode.PKCS7 };

 

        this._encryptor = myRijndael.CreateEncryptor(myRijndael.Key, myRijndael.IV);

        this._decryptor = myRijndael.CreateDecryptor(myRijndael.Key, myRijndael.IV);

    }

 

    public string Decrypt(string input)

    {

        // Create the streams used for decryption.

        using (MemoryStream msDecrypt = new MemoryStream(Convert.FromBase64String(input)))

        {

            using (CryptoStream csDecrypt = new CryptoStream(msDecrypt, _decryptor, CryptoStreamMode.Read))

            {

                using (StreamReader srDecrypt = new StreamReader(csDecrypt))

                {

                    return srDecrypt.ReadToEnd();

                }

            }

        }

    }

}


From U-SQL we can now easily decrypt entire files. Make sure you publish the Microsoft.Json assembly and your custom assembly (CustomADLALibrary) that contains the modified JsonExtractor that is used before you try to execute the U-SQL job. See my previous blog post for instructions on how to programmatically register U-SQL assemblies using PowerShell.


REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [CustomADLALibrary];

 

DECLARE @INPUT_FILE string = @”somePath/encryptedFile.json”;

DECLARE @OUTPUT_FILE string = @”someOtherPath/decryptedFile.csv”;

 

@decryptedExtract =

    EXTRACT column1 string,

            column2 string

    FROM @INPUT_FILE

    USING new CustomADLALibrary.CustomExtractor();

@result =

    SELECT *

    FROM @decryptedExtract;

OUTPUT @result

TO @OUTPUT_FILE

USING Outputters.Csv(quoting : false);


Scenario 2: Decrypt string values
In this case the files were not completely encrypted but only particular sensitive string values in the files. Therefore the custom extractor was not needed and the decryption could take place directly in the U-SQL SELECT statement as shown in the example below.


REFERENCE ASSEMBLY [CustomADLALibrary];

 

DECLARE @INPUT_FILE string = @”somePath/stringEncryptedFile.csv”;

DECLARE @OUTPUT_FILE string = @”someOtherPath/decryptedFile.csv”;

 

@extract =

    EXTRACT column1 string,

            column2 string

    FROM @INPUT_FILE

    USING Extractors.Csv();

 

@result =

    SELECT column1 AS unencryptedColumn,

           new WeakObjectEncryptor().Decrypt(column2) AS decryptedColumn

    FROM @extract;

 

OUTPUT @result

TO @OUTPUT_FILE

USING Outputters.Csv(quoting : false);


These decryption examples show one of the many possibilities of the new powerful new U-SQL language. It’s quite easy to use and understandable for both people with a SQL background and C# programmers. In this case I worked together with my colleague Luuk Gortzak who helped with the C# scripts. Credits to Luuk for helping me out!

Use AdlCopy to generate U-SQL jobs that copy data between Azure Blob Storage and Azure Data Lake Store

AdlCopy is a command-line tool (it runs on the user’s machine) that allows you to copy data from Azure Storage Containers or Blobs into Azure Data Lake Store.

You can use the AdlCopy tool in two ways:

  • Standalone, where the tool uses Data Lake Store resources to perform the task. This can be a cheap and more ad-hoc/manual option to move data from blob storage to ADLS compared to using Data Factory or ADLA U-SQL jobs. I assume only ADLS transaction costs will be made if your blob storage resides in the same region as your ADLS.
  • Using a Data Lake Analytics account, where the units assigned to your Data Lake Analytics account are used to perform the copy operation. You would typically use this option when the data to be moved is in the range of gigabytes and terabytes, and you want better and predictable performance throughput.

Another advantage is that on execution the AdlCopy tool generates an ADLA job that contains U-SQL code for each source file from your blob storage that will be copied to the ADLS. You can reuse the code of this job to schedule execution of the copy process on a regular basis. Requirement for this scenario is that your blob storage source files have static paths and file names, as the U-SQL code of the generated job contains hardcoded references to the source files locations.

Download AdlCopy: http://aka.ms/downloadadlcopy

Instructions: https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-copy-data-azure-storage-blob/

Copy all data from blob storage container to ADLS folder using a generated U-SQL job that can be reused:

1. Start Command Prompt

2. Navigate to dir where AldCopy is installed:

%HOMEPATH%\Documents\adlcopy

3. Run the following command (standalone, copy is performed by ADLS):

AdlCopy.exe /Source https://yourblob.blob.core.windows.net/yourcontainer/ /Dest swebhdfs://yourdatalakestore.azuredatalakestore.net/yourfolder/ /SourceKey –

4. Data is copied:

clip_image001

5. Run the following command to execute the process using the Data Lake Analytics service:

AdlCopy.exe /Source https://yourblob.blob.core.windows.net/yourcontainer/ /Dest swebhdfs://yourdatalakestore.azuredatalakestore.net/yourfolder/ /SourceKey – /Account yourDataLakeAnalyticsAccount /Units 2

6. Now because the ADLA service has been used a U-SQL job was generated and executed, which can be seen in the azure portal:

clip_image003

7. The U-SQL code shows that for each file in the blob storage EXTRACT from blob storage and OUTPUT to ADLS statements were created:

clip_image005

8. You can now choose “Duplicate Script”, delete the DROP ASSEMBLY IF EXIST and CREATE ASSEMBLY statements which you don’t need again, and save your new job. You can now execute/schedule the copy process again without the need of the AdlCopy tool.

clip_image006