CI/CD | Reliable Data Pipelines

The purpose of CI/CD

At Blenddata, we want to make data engineering as reliable and scalable as modern software development. CI/CD helps us streamline and accelerate the development, testing and delivery of data pipelines. This means new features are available faster and with fewer errors. Automation and standardisation allow us to move faster, detect errors earlier and customer data stacks can benefit directly from improvements to our modern data stack template.

A common problem at companies not yet working with CI/CD is that small errors or unexpected problems are only discovered after going live. This can cause unreliable data or even downtime of important processes. With CI/CD, we prevent this by automatically validating and testing every change, ensuring that everything works as expected.

Why is it important?

CI/CD ensures that with every change to our data pipelines, we automatically check that everything is still working as expected. To do this, we use GitLab as our CI/CD platform, in which we have created reusable building blocks (templates) for our CI/CD processes. This allows us to easily set which steps (such as testing, building and deploying) should be performed automatically for each project. Every project running on our Modern Data Stack makes use of this.

With every new version of a data pipeline, we automatically run tests. This ensures that new additions do not destroy existing functionality and that the data expected by the customer is still correct. New feature? Then that calls for a new test.

For enterprise customers, we connect to the tools they already use, such as Azure DevOps. This way, we make sure that our working method always fits the customer’s needs and processes.

What steps do we follow in our CI/CD pipeline?

Our CI/CD pipelines usually run the following steps:

Code validation: Linting and build-checks with tools like Nox and pre-installed Docker images (Python + UV). We check that the code is written cleanly and according to the agreements.
Functional validation: Automatic tests (unit/integration) run locally as well as in the pipeline, thanks to shared Nox configuration. New features always get new tests.
End-to-end data pipeline tests & data validation: Full data pipeline runs with dedicated test databases per client, including data validation after transformations (dbt, Dagster). We check that the data is still correct and complete.
Build: Production versions of pipelines are automatically built and tagged. We create a new production version of the pipeline.
Infrastructure as Code deployment: With Terraform, infrastructure changes are automatically deployed to Azure, fully versioned.
Automatic dependency updates: Renovate Bot ensures libraries and packages stay up-to-date, minimising security risks.
Changelog & versioning: Automatic changelog generation and semantic version tagging via semantic-release. We automatically maintain a detailed changelog.

Tooling in detail

Nox: For defining and running test sessions, both locally and in CI.
Docker: Ensures consistent, isolated and reproducible environments for building, testing and running data pipelines. By capturing all dependencies and configurations in a Docker image, we avoid ‘works on my machine’ problems and are sure that pipelines run the same everywhere: locally, in CI and in production.
UV: Fast, modern Python package manager, standard in our images.
Terraform: Infrastructure as Code, with automatic validation and deployment.
Terraform docs: Based on Terraform code, documentation is written.
Renovate Bot: Automatic dependency updates and security patches.
Semantic-release: Automatic version management and changelog generation.

Our CI/CD templates are centrally managed in their own repository. Projects use these templates in their .gitlab-ci.yml:

As a result, all projects benefit directly from improvements and bug fixes in the templates. Updates to templates are automatically offered as merge requests in customer projects through renovate and copier. For this, we have a configuration in our renovate bot:

With this, we set the following:

Renovate looks at the version of the template repo maintained through the .copier-answer.yaml (which is in each repo after creating the repo with copier).
On Monday and Thursday mornings at 06:00 we check whether the current version of the repo is older than the version of the template, if this is the case we create a merge request with the new changes.

What are efficient CI/CD pipelines?

We find it especially important that our CI/CD go through all critical aspects of a project (such as the data pipelines). That way, we have confidence that each new modification works well and does not cause unexpected problems. Our modular approach allows us to take exactly the right steps for each project. We omit unnecessary steps: for example, in a project without Python, we do not run Python tests, and in an infrastructure project we only run Terraform tests, for example, and no data tests. Conditional jobs ensure that only relevant steps are executed. This keeps the process clear, fast and reliable.

The advantages of CI/CD implementation

Since we have been using CI/CD, we notice that errors are detected much earlier. This gives peace of mind: we know that what we deliver works as it should. We can also easily revert to a previous version if necessary, and we automatically keep a detailed changelog.

Our security has improved because we use tools like Renovate Bot. This automatically keeps our used libraries and packages up-to-date, which helps prevent security problems. We also have a template that we synchronise to all customer projects via CI/CD. This way, all our customers benefit directly from improvements in our Modern Data Stack.

We streamline your software

We make data engineering reliable by applying good software best practices. Thanks to our CI/CD approach, our data pipelines are stable, secure and always up-to-date. Want to know how we can accelerate and improve your data project? Feel free to get in touch.

Want to be able to update your platform without stress?

Or are you curious about how to run your tests automatically, independent of the environment?
Feel free to get in touch! We will be happy to help you further.

Contact