Azure Data Lake Analytics U-SQL decryption possibilities

Posted by

The following information applies to the situation as of October 2016.

One of the things that make U-SQL so powerful is C# integration. It gives you the possibility to create your own C# classes and methods and then use them in your U-SQL scripts. Recently we executed a couple of big data projects at Macaw using the Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). During one of these projects we had to decrypt some custom Rijndael (AES) encrypted JSON content in the ADLS and store the result as a CSV. These CSV files would then be encrypted by default by enabling encryption at rest for the entire ADLS.

We were able to support two decryption scenarios: decrypt the entire file or decrypt a part of a file (particular sensitive string values). I will go into detail about each scenario.

Scenario 1: Decrypt entire files
Because we were dealing with JSON files and U-SQL does not offer a native JSON extractor yet, we used the JsonExtractor C# sample that is shared on GitHub by Michal Rys
.
We modified the Extract method of the JsonExtractor class to include a CustomDecryptor that will decrypt the input.BaseStream (= entire file). Next to that we set AtomicFileProcessing to true.

using System.IO;
using System.Linq;
using System.Collections.Generic;
using Microsoft.Analytics.Interfaces;
using Newtonsoft.Json.Linq;
using Custom.Framework.Utilities.Encoding;

namespace CustomADLALibrary
{

        [SqlUserDefinedExtractor(AtomicFileProcessing = true)]
        public class CustomExtractor : IExtractor
        {
                // summary
                private string rowpath;

                // summary
                public CustomExtractor(string rowpath = null)
                {
                        this.rowpath = rowpath;
                }

                // summary
                public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
                {

                        CustomDecryptor Decryptor = new CustomDecryptor();
                        using (var reader = new StreamReader(input.BaseStream))
                        {
                                var jsonAsString = reader.ReadToEnd();
                                var decryptedString = Decryptor.Decrypt(jsonAsString);
                                var root = JToken.Parse(decryptedString);

                                foreach (JObject o in SelectChildren(root, this.rowpath))
                                {
                                        this.JObjectToRow(o, output);

                                        yield return output.AsReadOnly();
                                }
                        }
                }

                // summary
                private static IEnumerable<JObject> SelectChildren(JToken root, string path)
                {
                        if (!string.IsNullOrEmpty(path))
                        {
                                return root.SelectTokens(path).OfType<JObject>();
                        }
                        var o = root as JObject;
                        if (o != null)
                        {
                                return new[] { o };
                        }
                        return root.Children().OfType<JObject>();
                }

                // summary
                protected virtual void JObjectToRow(JObject o, IUpdatableRow row)
                {
                        foreach (var c in row.Schema)
                        {
                                JToken token = null;
                                object value = c.DefaultValue;

                                if (o.TryGetValue(c.Name, out token) && token != null)
                                {
                                        value = JsonFunctions.ConvertToken(token, c.Type) ?? c.DefaultValue;
                                }
                                row.Set<object>(c.Name, value);
                        }
                }
        }
}

The modified Extract method is using a CustomDecryptor object in which the actual decryption takes place:

using System;
using System.IO;
using System.Security.Cryptography;

public class CustomDecryptor
{

        private readonly byte[] _key = { -- }; //YourKey
        private readonly byte[] _iv = { -- }; //YourIv

        private readonly ICryptoTransform _decryptor;

        public CustomDecryptor()
        {
                var myRijndael = new RijndaelManaged { Key = this._key, IV = this._iv, Padding = PaddingMode.PKCS7 };

                this._encryptor = myRijndael.CreateEncryptor(myRijndael.Key, myRijndael.IV);
                this._decryptor = myRijndael.CreateDecryptor(myRijndael.Key, myRijndael.IV);
        }

        public string Decrypt(string input)
        {
                // Create the streams used for decryption.
                using (MemoryStream msDecrypt = new MemoryStream(Convert.FromBase64String(input)))
                {
                        using (CryptoStream csDecrypt = new CryptoStream(msDecrypt, _decryptor, CryptoStreamMode.Read))
                        {
                                using (StreamReader srDecrypt = new StreamReader(csDecrypt))
                                {
                                        return srDecrypt.ReadToEnd();
                                }
                        }
                }
        }
}

From U-SQL we can now easily decrypt entire files. Make sure you publish the Microsoft.Json assembly and your custom assembly (CustomADLALibrary) that contains the modified JsonExtractor that is used before you try to execute the U-SQL job. See my previous blog post for instructions on how to programmatically register U-SQL assemblies using PowerShell.

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [CustomADLALibrary];

DECLARE @INPUT_FILE string = @”somePath/encryptedFile.json”;
DECLARE @OUTPUT_FILE string = @”someOtherPath/decryptedFile.csv”;

@decryptedExtract =
    EXTRACT column1 string,
            column2 string
    FROM @INPUT_FILE
    USING new CustomADLALibrary.CustomExtractor();

@result =
    SELECT *
    FROM @decryptedExtract;

OUTPUT @result
TO @OUTPUT_FILE
USING Outputters.Csv(quoting : false);

Scenario 2: Decrypt string values
In this case the files were not completely encrypted but only particular sensitive string values in the files. Therefore the custom extractor was not needed and the decryption could take place directly in the U-SQL SELECT statement as shown in the example below.

REFERENCE ASSEMBLY [CustomADLALibrary];

DECLARE @INPUT_FILE string = @”somePath/stringEncryptedFile.csv”;
DECLARE @OUTPUT_FILE string = @”someOtherPath/decryptedFile.csv”;

@extract =
    EXTRACT column1 string,
            column2 string
    FROM @INPUT_FILE
    USING Extractors.Csv();

@result =
    SELECT column1 AS unencryptedColumn,
           new WeakObjectEncryptor().Decrypt(column2) AS decryptedColumn
    FROM @extract;

OUTPUT @result
TO @OUTPUT_FILE
USING Outputters.Csv(quoting : false);

These decryption examples show one of the many possibilities of the new powerful new U-SQL language. It’s quite easy to use and understandable for both people with a SQL background and C# programmers. In this case I worked together with my colleague Luuk Gortzak who helped with the C# scripts. Credits to Luuk for helping me out!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s