CI/CD stands for Continuous Integration and Continuous Delivery. It means that we automatically test, validate and deliver new versions of our data pipelines so that improvements are available quickly, securely and reliably.
At Blenddata, we want to make data engineering as reliable and scalable as modern software development. CI/CD helps us streamline and accelerate the development, testing and delivery of data pipelines. This means new features are available faster and with fewer errors. Automation and standardisation allow us to move faster, detect errors earlier and customer data stacks can benefit directly from improvements to our modern data stack template.
A common problem at companies not yet working with CI/CD is that small errors or unexpected problems are only discovered after going live. This can cause unreliable data or even downtime of important processes. With CI/CD, we prevent this by automatically validating and testing every change, ensuring that everything works as expected.
CI/CD ensures that with every change to our data pipelines, we automatically check that everything is still working as expected. To do this, we use GitLab as our CI/CD platform, in which we have created reusable building blocks (templates) for our CI/CD processes. This allows us to easily set which steps (such as testing, building and deploying) should be performed automatically for each project. Every project running on our Modern Data Stack makes use of this.
With every new version of a data pipeline, we automatically run tests. This ensures that new additions do not destroy existing functionality and that the data expected by the customer is still correct. New feature? Then that calls for a new test.
For enterprise customers, we connect to the tools they already use, such as Azure DevOps. This way, we make sure that our working method always fits the customer’s needs and processes.
Our CI/CD pipelines usually run the following steps:
Our CI/CD templates are centrally managed in their own repository. Projects use these templates in their .gitlab-ci.yml:
YAML:
1include:
2 - project: blend-data/cicd-templates
3 ref: 2.x.x
4 file: /customer-projects/mds-with-terraform.yml
As a result, all projects benefit directly from improvements and bug fixes in the templates. Updates to templates are automatically offered as merge requests in customer projects through renovate and copier. For this, we have a configuration in our renovate bot:
JSON:
1"packageRules": [
2 {
3 "matchManagers": ["copier"],
4 "updatePinnedDependencies": true,
5 "groupName": "Copier Project Template Update",
6 "rebaseWhen": "behind-base-branch",
7 "automerge": false,
8 "rangeStrategy": "replace",
9 "commitMessagePrefix": "fix(copier): ",
10 "commitMessageAction": "Update project template",
11 "commitMessageTopic": "{{currentVersion}}",
12 "schedule": [
13 "* 6 * * 1",
14 "* 6 * * 4"
15 ]
16 }
17],
With this, we set the following:
We find it especially important that our CI/CD go through all critical aspects of a project (such as the data pipelines). That way, we have confidence that each new modification works well and does not cause unexpected problems. Our modular approach allows us to take exactly the right steps for each project. We omit unnecessary steps: for example, in a project without Python, we do not run Python tests, and in an infrastructure project we only run Terraform tests, for example, and no data tests. Conditional jobs ensure that only relevant steps are executed. This keeps the process clear, fast and reliable.
Since we have been using CI/CD, we notice that errors are detected much earlier. This gives peace of mind: we know that what we deliver works as it should. We can also easily revert to a previous version if necessary, and we automatically keep a detailed changelog.
Our security has improved because we use tools like Renovate Bot. This automatically keeps our used libraries and packages up-to-date, which helps prevent security problems. We also have a template that we synchronise to all customer projects via CI/CD. This way, all our customers benefit directly from improvements in our Modern Data Stack.
We make data engineering reliable by applying good software best practices. Thanks to our CI/CD approach, our data pipelines are stable, secure and always up-to-date. Want to know how we can accelerate and improve your data project? Feel free to get in touch.
Or are you curious about how to run your tests automatically, independent of the environment?
Feel free to get in touch! We will be happy to help you further.