Finally, at Ignite Azure Data Factory Version 2 is announced! A giant step forward if you ask me. In this blog post I will give an overview of the highlights of this exciting new preview version of Azure’s data movement and transformation PaaS service. ADF V2 pricing can be found here.
First good thing to mention is the documentation that has been written by Microsoft, there is a lot and it is very detailed. You can find it here: Azure Data Factory V2 Preview Documentation. On the left side you will find links to the different articles and on the main page links to the excellent step-by-step tutorials.
So, whats new?
Execute SSIS packages
ADF V2 builds upon the original Azure Data Factory data movement and transformation service. While there were only minor changes since the initial release, mainly new connectors, this release is huge. Not only because ADF V2 contains a lot of new and changed functionality, but also because it is now possible to run SSIS packages in Azure, integrated with ADF, meaning you can orchestrate the execution of your SSIS packages. This enables lift and shift scenarios from on-premises to Azure and makes all those nice fine grained SSIS data transformations and rich set of connectors available from within ADF. SSIS packages can be moved to the cloud using a SSIS “Integration Runtime” (IR), which is a managed hosting environment, available with ADF V2 (more info below). Using the SSIS IR you will have the ability to execute, manage, monitor and deploy SSIS packages to Azure. Key here is that ADF acts as orchestrator and that, in my opinion, ADF-SSIS is mainly interesting for lift and shift scenarios of existing on-premises solutions to the cloud.
Integration Runtime (IR)
The Integration Runtime or IR is the compute infrastructure used by ADF V2, it determines where your activity runs on, or gets dispatched from. There are three IR types:
- Azure: Public Azure (PaaS) network, with public accessible endpoints.
- Self-hosted: On-premises machine or Azure VM (IaaS) inside a private network (VNet).
- Azure-SSIS: Can be provisioned in both public Azure network or private Azure VNet network. On-premises data is accessible if you connect your VNet to your on-premises network (VPN).
More activity types
There are now three types of pipeline activities available that determine what you can do with ADF V2 in high level:
- Data movement: Move data between on-premises/cloud data stores.
- Activity dispatch: Execute and monitor a data transformation activity on Azure compute like Azure SQL Database, Azure Data Lake Analytics, Azure Machine Learning or Azure HD Insight. You are now also able to trigger Logic Apps or Azure Functions using the new Web Activity, which is described in more detail below.
- SSIS package: Execute SSIS packages that run on managed Azure compute.
Next to the Data Flow, ADF V2 comes with a separated Control Flow to provide more orchestration capabilities, similar to what the SSIS Control Flow offers.
- Branching/chaining activities in a sequence using activity dependencies.
- Conditional expressions (similar to SSIS precedence constraints) using the If Condition Activity.
- Parameters that can be used on pipeline and activity level (consume). It is now also possible to pass the output state of an activity to another activity.
- Execute Pipeline Activity to execute another pipeline.
- ForEach Activity that enables foreach looping.
- DoUntil Activity for do until looping.
- GetMetadata Activity that can be used to validate or retrieve metadata. In combination with a conditional expression and a do-until loop you can trigger a pipeline when data is ready/available.
- Web Activity to call REST endpoints. You can pass datasets to be consumed by the activity. This enables HTTP triggering of Azure Logic Apps and Azure Functions. This is huge! An example of a pipeline with a Web Activity that calls a Logic App can be found here.
- Lookup Activity retrieves a JSON data set from an external data source. Currently file, SQL, Table-storage is supported. Useful to retrieve a list of files, records or tables that need to be processed by other activities.
This is actually a design pattern for an incremental load using a watermark by using the Lookup, Copy and Stored Procedure activities together. It enables you to load data incrementally after an initial load. A tutorial can be found here.
On-demand triggering of pipelines
A minor change, but something a lot of people have been asking for. ADF V1 is based on time slices, for example to run daily or hourly. This works fine if you load data accordingly, but what if you want to run at specific times, or on-demand? I have blogged a workaround a while ago which worked, but was far from ideal. Good to see pipelines can now be triggered on-demand or by wall-clock time.
Monitoring of your pipelines can now be done from Operations Management Suite (OMS), next to Azure Monitor and the Azure portal.
Key Vault integration
You can now securely store credentials used by ADF in Azure Key Vault. ADF will retrieve credentials when activities need them. This works with the ADF service identity (service principal) that is created when a new ADF is created. A service principal is an App Registration that is registered to Azure Active Directory and represents a specific App (ADF).
Expressions and functions
The amount of available expressions and functions has increased dramatically. They can be used anywhere in a JSON string value.
Although this is not an ADF V2 feature, I still want to mention ADF code generation using BIML for the ones that don’t know about it yet. It has shown it’s value to me during some recent projects in which we generated pipelines for hundreds of incrementally loaded tables in minutes! Lets hope it will be made compatible for ADF V2 soon!