A Quest For Low-Code Architecture with Azure DataFactory

During almost a decade of hands-on DevOps experience, the largest obstacle I met as a leader is slow product delivery. Although this is an umbrella term that covers a plethora of nuances, in this article I’ll focus on one I believe is crucial for any process – architecture.

Selecting and crafting architecture is not a single-person game, yet the burden of final decision (and enormous responsibility) lays almost always on a single person.

And here’s your ticket to our journey – 🎟️ .

The beginning

Some time ago, I was a part of the team that worked with a great and forward-thinking company. The project had a classic three-tier architecture–front-end, back-end, and databases–layered on… four production environments.

That was because of clients’ globally dispersed teams each operated in different countries with their own restrictions, regulations, and compliance standards. On top of that, the sensitive business data that was at the core of the client’s business was ‘locked’ within these countries.

In a year of successful app development, the client asked the team to craft some data visibility that should be integrated with their PowerBI.

I should admit that at this moment I was pondering a lot with low-code platforms as a choice. The market is rapidly growing, the adoption rates are higher every month, and it seemed as a good interim solution for our needs. But more on that later.

The first architecture concept we thought about was built around Azure Functions: letting it pull the data from databases and merge with PowerBI’s DB. Although Azure Functions was definitely a cost-saving solution, however we did not chose it because:

It was not scalable given the 10-min plan limit which won’t allows us to run long-term processing jobs
The service is pretty hard to monitor as well as maintain
1,5GB RAM limit is below of our actual needs

Then, we gave Azure Analysis a try because of its almost-real-time load capabilities and estimated it for a week of development. The logic was the same: Analysis Service pulls the data from databases and merge into PowerBI’s DB in real time. Unfortunately, Analysis Service only supports a single data source.

Our next choice fell on Stream Analytics. We expected the following flow: Azure Function gets the data from database sources, pushes it into Event Hub and then Stream Analytics gets the data from Event Hub and merges data into PowerBI’s DB. We were challenged with transforming relational database’s data from CSV to JSON, as Stream Analytics only supports JSON. This issue could increase development time, project expenditures and overall maintainability.

In the context of this story, the good-to-go budget benchmark was around a few hundreds of dollars.

The final and, frankly, most successful idea popped up when we were tinkering with ETL (extract, transform and load approach) services and the data factory concept. On a blueprint, the flow was like: extracting only the data with latest changes from the production database, transforming it and loading data into PowerBI’s DB. Cost-efficiency, data refresh intervals requirements (an hour!), and multiple data sources support made me feel like Indiana Jones who finally found the holy grail.

I must confess that this approach had no clear deployment strategies, while data optimization seemed like a mess. Fortunately, we were able to craft our own data optimization pipelines which I will mention further, but this path still had pitfalls.

Data-Factory-as-a-Salvation

The first challenge was in optimizing costs, because loading all the data regularly would result in a waste of time, money, and engineering resources. We set data refresh intervals to 1 hours, which was a cheaper (sic!) solution compared to 5 or 10 minute intervals. Data merging was another challenge: getting it from multiple Azure SQL databases looked like a threat to the overall system performance.

We started with the following architecture. First, we add the replica databases into the production ones to avoid unnecessary system load. Then, we pull the data from replicas through Azure Data Factory, and Azure Function pull the data to perform a transformation. We decided to go with deltas of the latest changes from the last committed transactions and transform them into the SQL database. Straight from the oven, this data was transmitted to PowerBI’s DB(Reporting DB).

We had two pipelines: one for single databases, and one for multiple ones. ‘Multiple’ pipeline was used to start the iteration of data transformation for all four production databases. The ‘multiple’ pipeline had three activities:

– `GetFunctionAppKey` the key from Azure Key Vault and the key of our Azure function, where we stored all the secrets

– `SetCurrentTimestamp` set current time and date- `ExecuteDatabaseETL` run the pipeline

A ‘single’ pipeline has a bit more steps. We used `GetCurrentVersion` and `GetLastVersion` to get the current version in the destination database, the current version of the committed transaction, pull the last version and compare these versions. That’s how we understood that the last transaction in the source database was higher than the current transaction in the destination database.

After comparison, we invoke Azure Function and put `GetCurrentVersion` and `GetLastVersion` into the Azure Function. The result of Azure Function execution is script from the source database with the latest data changes. Basically, the function generated the script and produced an output for the next stage, which is `GetTablesWithChanges`.

Our goal was to understand what tables are being changed because of generated scripts. Then, we needed to get the tables copied from the source database to the destination database in temporary tables. That’s why we were copying only the tables with deltas. Given that during the process some data could be deleted or changed, we built a script to merge all the data without any conflicts.

When Azure Data Factory was in the implementation phase , we started tinkering with CI/CD approaches. The first approach we considered is Microsoft’s. It implies Azure Data Factory, where all the development is being done, working with pipelines towards the main branch. On top of that, we have a separate folder called ‘adf_publish branch’ used for ARM templates. So, the DataFactory works with ARM templates, while pipelines are stored as JSON files.

At the core of this approach is the deployment of the Data Factory from the ARM template. Though I don’t think it’s a great approach, nobody’s perfect either.

When you have these changes from the Data Factory into GitHub, you can deploy these ARM templates. Under the hood it works as follows: a PowerShell script can stop the trigger of the database, deploy the template, and then resume the trigger. This way the deployment is done in development, test, and production environments, where each ARM template has specified parameters for each.

There’s also another approach called the JSON, which has the same development logic (devs make changes directly in the Data Factory UI), where the Data Factory makes commits in the main branch. The deployment itself initiates from the main branch, instead of the previously mentioned adf_publish branch. It’s as simple as that: we create the package in Azure DevOps, release it, and execute a PowerShell script so deployment happens in multiple environments. Although the JSON approach is not officially recognized by Microsoft, I believe it is pretty good.

I also considered Terraform deployment and Infrastructure-as-a-code approach. The concept is cool, as well as Terraform’s options to integrate with Data Factory and Azure Pipelines. Yet the issue is that you must switch to the approach that Microsoft recommends: define pipelines as JSONs, make changes in the UI, commit to the main branch, and then pull this JSON and transmit to the Terraform code. For me–a person that loves simplicity–it’s too much to tinker with.

At the end, we successfully delivered these migrations on time, the client was happy, as well as our team.

Concluding thoughts

For our case, the ETL scenario and Azure Data Factory was the best approach: get data from Azure SQL, process the data within Data Factory and Azure Function, then push to the database and, finally, PowerBI. Data pulled into the system that can represent this data – ETL at its best.

However, every solution has its trade-offs, and these are the ones I want you to know beforehand:

– Paying for every single step, especially real-time processing

– Writing custom Azure functions and pipelines

– Adopt UI-first change approach as it is a low-code platform

– Understand CI/CD practices that differ from classic SDLC ones

Frankly, the main trade-off for me as a leader was simply another low-code tool that has its approaches that differ from what I used earlier.

Also, I recommend to consider the following things:

– Code branching strategy (Data Factory flow differ from Gitflow, GitHub Flow, and others)

– Test, development, and production environment’s deploy flow

– CI/CD tool for Data Factory

– Deployment method (ARM template, PowerShell script, Azure Add-ons and so on)

– Artifacts (ARM templates, JSON-based pipeline and so on)

WIth all that said, I believe low-code apps are good for startups and companies who want faster time-to-market. Without the brilliant code burden, you can deploy the best suited architecture, kickstart the development process and set up a decent CI/CD process. Feel free to drop me a line about this and that at LinkedIn or Facebook.

Used Materials:

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

The beginning

Data-Factory-as-a-Salvation

Concluding thoughts

Share this:

Свързани

Leave a comment Cancel reply