For many data engineers, it is a frustratingly recognisable scenario: you set up a new data pipeline, everything runs smoothly and the end user gets his or her data nicely. But then, a few weeks later, things start to falter. What was once a quick 20-minute job now drags on for hours. Stakeholders ask during the day, “can you pull up this data quickly?”, but are met with a nasty surprise. What has happened? Why is something that used to work so efficiently now suddenly so slow?
In this article, we focus on data pipelines that perform transformations within a data warehouse with structured data. Think customer data, transactions or operational logs. These transformations often build on each other, so small delays can accumulate quickly.
Fortunately, something can be done about this. An effective way to tackle this is to deliberately choose between `views`, `tables` and `incremental` models. This way, you avoid unnecessary recalculations and you can return pipelines to workable runtimes, without drastic refactors. That this makes an impact is shown by a practical example from dbt Labs: one smart materialisation choice saved 90 minutes of runtime there.
To materialise strategically, it is important to fully understand the three most commonly used forms:
Note that as soon as you materialise something, you introduce the risk of data drift. A situation where your materialised data no longer matches the actual source data. In `incremental models` this happens, for example, when a source dataset is later supplemented with data from a day you have already processed. We will not go into this in this article, but it is good to realise that materialisation always requires good monitoring and validation.
Notice that your pipelines are getting slower? Then it is worth checking whether you have made any of the following choices:
Recent research by Bram Reinders (TU Eindhoven) at Blenddata shows that intermediate layers can also benefit from materialisation. His analysis shows that materialising certain intermediate tables, even if they are not directly accessed by end users, can significantly reduce recalculations in downstream models, thus improving overall pipeline performance.
By applying these guidelines, you prevent your pipeline from becoming sluggish as your data grows and you can quickly deliver back to your stakeholders with minimal intervention.
By thinking strategically about how you materialise your data (`view`, `table`, `incremental`), you can improve performance, optimise storage and get much faster pipelines. So smart materialisation saves time, reduces frustration and ensures satisfied stakeholders.
Are you working on a data pipeline that seems to be getting slower and slower? Or do you want to future-proof your existing setup? We are happy to think with you.
Contact us to make your pipelines faster and more robust, or check out our vacancies if you enjoy working on these kinds of challenges yourself.
Or would you like to spar about your data structure or transformations? Feel free to contact us. We will be happy to help you further.