Materialisation | Read it here

Materialization Strategies for Cutting Runtime in Data Pipelines

AuteurMarie StolkFunctieData Engineer

Tag(s)

Transformation Expert

For many data engineers, it is a frustratingly recognisable scenario: you set up a new data pipeline, everything runs smoothly and the end user gets his or her data nicely. But then, a few weeks later, things start to falter. What was once a quick 20-minute job now drags on for hours. Stakeholders ask during the day, “can you pull up this data quickly?”, but are met with a nasty surprise. What has happened? Why is something that used to work so efficiently now suddenly so slow?

In this article, we focus on data pipelines that perform transformations within a data warehouse with structured data. Think customer data, transactions or operational logs. These transformations often build on each other, so small delays can accumulate quickly.

Fortunately, something can be done about this. An effective way to tackle this is to deliberately choose between `views`, `tables` and `incremental` models. This way, you avoid unnecessary recalculations and you can return pipelines to workable runtimes, without drastic refactors. That this makes an impact is shown by a practical example from dbt Labs: one smart materialisation choice saved 90 minutes of runtime there.

Views, Tables and Incremental Models

To materialise strategically, it is important to fully understand the three most commonly used forms:

Views do not store data, but rather re-execute the underlying query with each run. Ideal for small datasets or models that are rarely accessed. Less suitable with growing volumes, as runtimes then add up quickly.
Tables physically store the results of a query. This costs more storage, but speeds up downstream queries considerably. Suitable for frequently used datasets or when multiple models depend on the same intermediate layer.
Incremental models work like tables, but only add new or changed records. Perfect for large datasets that refresh daily. However, they do require good logging and robust logic to prevent errors or data drift.

Note that as soon as you materialise something, you introduce the risk of data drift. A situation where your materialised data no longer matches the actual source data. In `incremental models` this happens, for example, when a source dataset is later supplemented with data from a day you have already processed. We will not go into this in this article, but it is good to realise that materialisation always requires good monitoring and validation.

Advice

Notice that your pipelines are getting slower? Then it is worth checking whether you have made any of the following choices:

You are using a `table` where a `view` suffices. For example, in a lightweight model that is not consulted by end users. In such a case, materialising mainly leads to unnecessary storage and delayed builds with no performance gain.
You use a `view` or `table` where an `incremental` model is needed. Especially with large, growing datasets, fully recalculating each time can push the runtime unnecessarily high. An incremental approach avoids this, if set up properly.

Guidelines for faster pipelines

Use `views` for models that are rarely used or serve only as an intermediate layer, and where storage space is more important than query time.
Use `tables` when multiple downstream models use the same intermediate layer or when end-use performance is crucial.
Use `incremental models` with large datasets that grow daily. Here, ensure reliable logging and logic to prevent data drift.

Recent research by Bram Reinders (TU Eindhoven) at Blenddata shows that intermediate layers can also benefit from materialisation. His analysis shows that materialising certain intermediate tables, even if they are not directly accessed by end users, can significantly reduce recalculations in downstream models, thus improving overall pipeline performance.

By applying these guidelines, you prevent your pipeline from becoming sluggish as your data grows and you can quickly deliver back to your stakeholders with minimal intervention.

Conclusion

By thinking strategically about how you materialise your data (`view`, `table`, `incremental`), you can improve performance, optimise storage and get much faster pipelines. So smart materialisation saves time, reduces frustration and ensures satisfied stakeholders.

Are you working on a data pipeline that seems to be getting slower and slower? Or do you want to future-proof your existing setup? We are happy to think with you.

Contact us to make your pipelines faster and more robust, or check out our vacancies if you enjoy working on these kinds of challenges yourself.

Or would you like to spar about your data structure or transformations? Feel free to contact us. We will be happy to help you further.

Materialization Strategies for Cutting Runtime in Data Pipelines

Tag(s)

Views, Tables and Incremental Models

Advice

Guidelines for faster pipelines

Conclusion

Tag(s)

Do you have a question?

Vincent Fokker

Navigation

Sectors

Solutions

Follow us

Address