Investigating the possibilities of Dataform, we were missing the option to implement modern CI/CD workflows. But looking a bit further into how Dataform works, we found a pretty elegant solution nonetheless. Here’s what we did.
What this is about and the initial situation
CI/CD stands for “continuous integration/continuous deployment” and describes the practice of “enforcing automation in building, testing and deployment of applications”. It is simply a formal workflow or pipeline that both deploys your code changes to environments and runs tests (unit/integration) automatically, thus ensuring and enforcing quality. For a couple of years this has been regarded as best practice but as usual, working with data hasn’t been up to speed with the development and often tailormade solutions had to be implemented which cost both time and money and might not even be of the desired quality. Within modern software engineering, on the contrary, a plethora of tools has evolved into maturity of which CircleCI is one of the most popular.
A typical setup of environments would be the following: developers do their work locally and on branches. When satisfied with the changes, they commit and push their code back to master where it is picked up by CircleCI and deployed to the development environment. Seeing that compilation and unit tests succeed, the code can be deployed to a test environment to run integration tests, confirming the new version won’t break any dependencies to other services. Finally, the changes may be deployed into production, making them available to the users. Visually, such a workflow looks like this:
In data warehousing the integration of common development standards is getting more and more popular and Dataform — a platform to build and manage data workflows in cloud data warehouses — takes some steps towards this by simulating dependencies within workflows and “compiling” the code on the fly. This applies only to the active work of the data engineer, though, not the system integration done afterwards. The tight incorporation to GitHub is another argument for Dataform but only the first step into real CI/CD — and unfortunately also the last one at the time of writing this text. While I appreciate the work that the people at Dataform are doing in this direction, they obviously can’t get everything in place at once — so what options do we as developers have currently? Looking a bit closer it turns out that there are quite a few possibilities if you’re willing to dig a little deeper.
Finding a working solution using Dataform and CircleCI
How CircleCI works
Whenever you push code to the master branch, CircleCI will run the workflow defined for that repository. A workflow in CircleCI is the definition of what shall be done following a commit — what code to deploy where, which tests to run and so on. But you can use any commands available, e.g. to create a Git tag which uniquely names that latest version of the code. Now, this is where it gets interesting for us since Git tags can be used by Dataform. So if we could somehow get back to Dataform from within the Circle Workflow and report which tag was just created and where to run it, this could be used to automatically deploy our new code to the different environments.
Looking at the workflow pictured above and lifting the view a little, this is what we wanna do:
To break it down into single steps, we need functionality to:
- create a new version tag in Git
- write the tag to the right place in Dataform’s environments.json file
- commit and push the updated file to the master branch in Git
A closer look at the steps needed…
The first step goes beyond the scope of this article but you might find git semver to be quite helpful in this task. Another tip would be to persist the version by writing it to a file in your Workspace so you have access to it from jobs further down the Workflow.
The next part is where it gets a bit difficult: you have to write the Git tag back into your environments.json file, commit and then push it. And preferably you can do that for each environment with the possibility of controlling manually when to deploy. The challenge here is not so much administrative as that part is really straightforward, but technical.
Let’s assume a setup as discussed above, with development, test and production, then Dataform’s environments.json file could look like this:
{ "environments": [ { "name": "development", "configOverride": { "schemaSuffix": "_dev" }, "gitRef": "0.0.15" }, { "name": "test", "configOverride": { "schemaSuffix": "_test" }, "gitRef": "0.0.14" }, { "name": "production", "configOverride": {}, "gitRef": "0.0.12" } ] }
Each environment has a gitRef node defining which version of the code to run using a tag. This is what we want to update automatically whenever code is pushed to master. We need to be able to define which tag should be set (as mentioned above, this may be persisted to a file to be available later on) and to which environment. Here are the 2 main steps that do the trick:
steps: - run: name: Update Dataform environments.json command: | export TAG=$(cat version.txt) # Write TAG to environments.json (this requires gnu sed) # replace only n-th occurence of gitRef; n being between 1-3 (dev to prod) if [ -f "environments.json" ]; then git branch --set-upstream-to=origin/${CIRCLE_BRANCH} ${CIRCLE_BRANCH} git pull --no-edit LINE=$(grep -n gitRef environments.json|cut -f1 -d":" |tr "\n" " "|awk '{print $<< parameters.environment_no >>;}') sed -i "${LINE}s/\"gitRef\"\:.*/\"gitRef\"\: \"${TAG}\"/1" environments.json fi - update_git: path: environments.json
The run step does the magic. It first reads the Git tag from a file called version.txt into a variable. Then, if environments.json is present, it creates a new branch to work on followed by the change in the file itself:
- row 11 simply extracts the file’s relative line number of the wanted environment gitRef; e.g. in our example file shown above the test environment would return “15”
- note the usage of parameters.environment_no (1 = development, 2 = test, 3 = production); it defines which occurrence of the found line numbers to pick
- sed is then used to replace the value of gitRef in that line using a regular expression
After that, a custom update_git step is used to commit and push the changes back to master. This is a nice little self-contained CircleCI command that is quite handy which I recommend you implement as well; it configures git, adds the file(s) sent in as a parameter to a changeset, commits them and pushes to master. As the commit message is hardcoded to start with [ci skip] CircleCI won’t run a new Workflow following the push — otherwise we would end up with an incalculable amount of Workflows.
…and putting it all together
At this point we have all the tools needed to get our workflow running, so let’s go through how this could look like to work elegantly, giving the developer control over where his code is deployed.
To recapitulate shortly, what we want is to create a new Git tag and write it as the gitRef value of the development environment. When the developer is happy with the tests performed, he should be able to manually approve a deploy of the code to test and production. Have a look at the following and I’m sure you’ll find your way directly:
workflows: deploy_dataform: jobs: - create_git_tag: filters: branches: only: master - write_tag_to_environments_file: environment_no: 1 name: set_dev_git_ref requires: - create_git_tag - test_deploy_approval: type: approval requires: - set_dev_git_ref - write_tag_to_environments_file: environment_no: 2 name: set_test_git_ref requires: - test_deploy_approval - prod_deploy_approval: type: approval requires: - set_test_git_ref - write_tag_to_environments_file: environment_no: 3 name: set_prod_git_ref requires: - prod_deploy_approval
So, the Workflow in the example is called deploy_dataform and consists of 6 jobs. The first one creates the Git tag and saves it in a textfile as described before, in order to be able to read from that file and use the tag further down. This job is limited to the master branch, so commits on feature- or bugfix-branches will not trigger CircleCI. Next comes the update of Dataforms environments.json file. Note the usage of the parameter environment_no which is set to “1” indicating that we want to change the gitRef value for the development environment. The requires key defines that this job can only start if the previous one, create_git_tag, has succeeded, thus implementing sequentiality. Finally, we see a decision job for the first time with test_deploy_approval being defined as “type: approval”. This means that the developer has to manually approve that the following jobs should be executed — thus giving him control over where and when to deploy the changes.
After that, write_tag_to_environments_file is used again but this time with “2” as environment_no triggering a deploy to the test environment. And after another approval by the developer the same job also takes care of getting the code out into production.
This concludes this little article about trying to get CI/CD going on Dataform with a little help from CircleCI. I really hope you found an idea or two useful. But here’s a little secret: the above won’t solve all your problems! That’s why I’m already working on part two where I’ll show you some more helpful tricks. Just two words? “Datamodel changes”
Fler insikter och blogginlägg
När vi stöter på intressanta tekniska saker på våra äventyr så brukar vi skriva om dom. Sharing is caring!
A summary of the most interesting AI Use Cases we have implemented.
Composable commerce skapar förmågan att möta kunders ändrade förväntningar snabbt och framgångsrikt.
Data Mesh is a strategy for scaling up your reporting and analysis capabilities. Learn more about the Google Cloud building blocks that enable your Data Mesh.