Overview: Azure Data Factory v2

The introduction of Azure Data Factory (ADF) was originally released as an Azure platform service in the cloud environment in 2015 — the same year it became generally available to end users. The service was released to be the leading resource for all data orchestration activities in the cloud. Whether the requirement is to simply copy data from source to destination or kick off a Data Lake Analytics transformation job, ADF is the answer. The ADF service is a fully managed cloud service built for complex data hybrid Extract, Transform, Load (ETL), Extract, Load, Transform (ELT) and data integration processing.

At the end of 2017, Azure publicly released ADF version 2, which introduced various enhancements and incorporated customer feedback through the initial version release.

Let’s walk through some of the key enhancements — including SQL Server Integration Services (SSIS) capabilities (finally!).

Azure Data Factory Version 2 Capabilities

The overall concept of data sets, activities and pipelines remains intact within version 2; however, the new version brings a few changes. A few highlights include:

Control flow concepts: Common ETL concepts such as chaining activities, branching activities, pipeline parameterization, custom state passing, looping containers, delta workflows, get metadata, web activity and many more. Let’s dig deeper into a few of these concepts:
- Chaining activities: The introduction of a new activity property, dependsOn, allows the user to set the value equal to the name of the dependent activity.
- Branching activities: Similar to programming languages, the new If-condition property allows the user to evaluate a Boolean expression to determine downstream activity processing based on an output of True or False.
- Looping containers: The ForEach activity is a repeating control flow in the pipeline that will iterate over a specified collection of activities in a loop. Similarly, an Until activity loops until the provided condition equals True.
- Delta workflows: Common in ETL processing, delta loads are implemented to only read in data that has changed since the last execution. ADF version 2 includes a lookup activity that allows for the natural implementation of delta loads.
SSIS functionality: Much anticipated through customer feedback, users can “lift and shift” SSIS solutions from on premises to the cloud with ADF version 2. The service allows the user to spin up an Azure-SSIS Integration Runtime — essentially a fully managed cluster of Virtual Machines (VMs) dedicated to running the packages. SSISDB can be hosted on a Platform as a Service (PaaS) instance of Azure SQL Database to orchestrate project deployments. Within the ADF authoring visual tool, a pipeline can be created that contains a stored procedure task that executes packages in SSISDB. The pipeline can then be tied to a trigger, which will schedule it for all future executions.
Flexible scheduling: Triggers contain properties that determine when pipelines need to be kicked off and executed. There are two types of triggers used in ADF version 2:
- Schedule Trigger: triggers based on wall-clock schedules
- Tumbling Window Trigger: triggers that operate on a periodic interval while retaining state
Visual authoring and monitoring: In early 2018, based on customer feedback, Microsoft released rich interactive visuals to the authoring and monitoring of ADF pipelines. This allows users to publish pipelines without writing a single line of code. Integration with Visual Studio Team Services Git for source control allows for full transparency.

Differences between Azure Data Factory V1 and V2

The following table introduces high-level differences between the two ADF services.

Feature	ADF version 1	ADF version 2
Data sets	Data sets refer to source and destination data stores — tables, files, folders, etc.; availability property refers to the processing window time slice (hourly, daily, monthly)	Removed the availability feature —triggers replace this need
Linked services	Connection string information for external resources	Added functionality of connectVia property to use Integration Runtimes
Pipelines	Logical grouping of activities with properties for start time, end time and paused state	Removed the properties for start time, end time and paused state —rather use the triggers
Activities	Actions to perform within a pipeline; data movement and transformation	Added control flow activities
Hybrid data movement	Data Management Gateways orchestrate on-premise to cloud data transfer	Integration Runtimes: Azure (cloud only) Self-hosted (hybrid on-premise & cloud) Azure-SSIS (SSIS execution)
Parameters	N/A	Key-value parameters passed to pipeline activities via manual execution or triggers
Expressions	Built-in system variables and functions	JSON values that are evaluated at runtime and return another JSON value
Pipeline runs	N/A	Single instance of pipeline execution assigned unique GUID value
Activity Runs	N/A	Instance of activity execution within pipeline
Trigger Runs	N/A	Instance of trigger execution
Scheduling	Pipeline start & end time, dataset availability	Trigger executions

One of the larger changes is the transfer from the concept of time slices and data set availability to a more traditional ETL approach scheduling process. Instead of waiting for a data set to become available for an activity when a pipeline is executing, the pipeline itself is triggered and kicks off the activity, regardless of the state of the data set.

The Integration Runtime (IR) is the compute infrastructure used by ADF version 2 for data movement, activity execution and SSIS package executions. The IR provides the bridge between the linked services referenced in the activity and the activity itself. The IR is referenced by the linked service, which then provides the compute environment where the activity will be run in the nearest region to provide the most efficient performance based on the target data store.

Azure IR instances, as mentioned in the table above, can perform activities between cloud data stores only. Azure IR is a fully managed, serverless compute in Azure.
Self-hosted IR can perform activities between cloud data and private network data stores. This can be installed on premises or in a virtual private network.
Azure-SSIS IR instances are built specifically for executing SSIS packages. If the IR is installed in the public cloud, to read on-premise data, the IR must join a Virtual Network (VNet) that connects to on-premise data.

The future of Azure Data Factory

The introduction to native SSIS capabilities in ADF version 2 was a key addition for the cloud data orchestration service. It provides a stepping stone for clients to get off their on-premise servers and move to a cloud-first strategy rather than completely rearchitecting their existing data integration process from SSIS to ADF version 1. Along with the SSIS integration, many other features, such as control flow tasks and triggers, allow for greater flexibility in pipeline executions. As the new ADF version 2 service approaches general availability and is no longer in public preview, users can submit feedback to Microsoft for further enhancements to the service.

Jeff Dudenhoeffer

Data Solutions Consultant

Jeff is a member of the data solutions team in the Nashville branch of Insight. He has comprehensive experience across the complete business intelligence lifecycle while also leading various clients in their migration from on-premise solutions to the cloud using Microsoft Azure services.

Blog Overview: Azure Data Factory v2

At the end of 2017, Azure publicly released ADF version 2, which introduced various enhancements and incorporated customer feedback through the initial version release.

Azure Data Factory Version 2 Capabilities

Differences between Azure Data Factory V1 and V2

The future of Azure Data Factory

Jeff Dudenhoeffer

Related posts

Solution brief Public Sector: RADIUS™ Migrate & Modernize

Datasheet Microsoft Cybersecurity Solutions for Federal Agencies

eBook Insight Services Engagement Offers

Datasheet Copilot 365: Simplify policy and bill analysis