Materialized Views, probably one of the more sought-after features of BigQuery, recently was announced to be generally available. I took a deep-dive into the feature and would like to present you with some of the use-cases and benefits they provide but also point out what to look out for.

Googles’ interpretation of Materialized Views

Traditionally, a materialized view was a SELECT-query whose results were physically written to disk and regularly refreshed. While this is true also within BigQuery, there are some aspects that differ from the traditional approach which Google points out prominently in its announcement.

The two features all developers will love are Zero Maintenance and Fresh Data: no more uncertainty if the data is up to date with the source tables, no more refreshing before using. Google guarantees that all changes downstream at all times will be reflected in the Materialized View, automatically and without any further actions needed.

To me, having to consider the state of the data in a Materialized View has always been a hassle big enough to make me hesitant to use them in the first place and seeing this addressed certainly feels like a big step towards maturity.

The 3rd feature is Smart Tuning which basically means that Materialized Views are considered by the BigQuery optimizer. So whenever someone writes a query against the source table that can be expressed via the Materialized View, this will be recognized and the precomputed data will be used instead, saving both time and processing (and in the long run even money).

Features that you want to use…

Of course, the first thing that comes to mind is Performance! Having physically available, pre-computed aggregates of huge fact-tables just in the granularity needed for your computations will save you a lot of processing. Say you have a sales table containing around 5 million rows per day, you can create a Materialized View aggregating per article. Assuming a portfolio of 5000 articles this table would condense the fact-table-size by a factor of 1000 or in other words: to get the complete January sales of that one popular article BigQuery would need to scan 155.000 instead of 155 million rows! And that is without any further tuning…

As mentioned in the introduction, the Query Rewrite feature makes life pretty easy. Materialized Views are known to the optimizer which can identify and leverage them to reroute queries from the source table. Users don’t even need any knowledge of the existing Materialized Views and their aggregations; whenever the requested data can be provided by the view it will be used instead, with all the performance gains described earlier.

Google addresses another common problem with the traditional implementation of Materialized Views: data staleness. In order to reflect the latest data in the source table, the Materialized View would have to be refreshed which also means that in between two refresh operations the data was increasingly out of date. BigQuery keeps track of changes in the source and can incorporate them into the results. This of course requires scanning and aggregating the source-table resulting in processing time and costs so you need to give your refreshing strategy some thoughts as discussed later in this text.

Finally, Partitioning and Clustering are very useful techniques when dealing with huge tables and so it’s great news that they’re possible in the context of Materialized Views as well. Partitioning is restricted to the column used in the source table allowing for partition alignment which ensures efficient data refresh – only changed partitions in the source need to be refreshed in the Materialized View. Clustering can be done on any column and is independent of the source table. Both techniques can further boost performance of your queries as explained in my previous article.

…and what to look out for!

Materialized Views as in Googles’ interpretation are a tool to use in a data warehouse and are designed with that in mind. This leads to limitations on what you can do compared to the more traditional RDBMS flavours of Materialized Views.

Concretely this means that at least one aggregation is a must (note that a group by clause remains optional). This might lead to some problems: assume you have a table containing orders with all the order line items (the articles contained in the order) in a child table. The order date might reside in the parent-table but is quite essential when aggregating the daily sales per product. You would have to incorporate the order-date into the order line item table already when creating it meaning you need to plan ahead from the moment you’re ingesting new data or will need to do some refactoring later on. And refactoring in databases has always been fun, hasn’t it…

Also, Materialized Views are limited to a single source table and can’t be used to build big data products gathering the information of several source-tables in one joined approach; instead their main intention is clearly to break up fact-tables to pre-computed, smaller units.

As the name already reveals, Materialized Views are written to disk and as such are subject to additional storage cost. As you’ll typically aggregate huge amounts of data to considerably smaller tables, the cost involved may be relatively small but it should be kept in mind nonetheless. Always weigh the gain you have on the processing side against the additional costs – which actually leads over to the last consideration:

Have an eye on the maintenance costs that are directly related to the Materialized Views. The automatic refresh job that makes them so handy does cause billable activity such as processing data in order to write changes in the source table to the Materialized View. This can be affected by changing the option refresh_interval_minutes of the view:

CREATE MATERIALIZED VIEW [project].[dataset].[materialized_view_name]
OPTIONS (enable_refresh = true, refresh_interval_minutes = 60) AS
SELECT ...

-- or in case the Materialized View already exists:
ALTER MATERIALIZED VIEW [project].[dataset].[materialized_view_name]
SET OPTIONS (refresh_interval_minutes = 60)

The default refresh interval is set to 30 minutes.

Alternatively you may disable the automatic refresh completely. This will be especially useful when you have fixed intervals updating the source table, e.g. nightly batches in the form of ETL pipelines.

CREATE MATERIALIZED VIEW [project].[dataset].[materialized_view_name]
OPTIONS (enable_refresh = false) AS
SELECT ...

-- or in case the Materialized View already exists:
ALTER MATERIALIZED VIEW [project].[dataset].[materialized_view_name]
SET OPTIONS (enable_refresh = false)

Simply run a manual refresh at the end of the batch job and you’re done.

Note that this does not affect the fact that you’ll always get correct data as per the source table! Querying a non-refreshed Materialized View simply means that the source table still needs to be considered, rendering higher cost and latency.

Conclusion

Materialized Views are a very helpful tool when working with data warehouses and Googles’ interpretation of them includes some fresh, powerful ideas which have the potential to set standards. They come at the cost of some compromises that developers should be aware of. The use-case is clearly optimization of reporting on huge fact-tables and this should be kept in mind, especially when diverging from this ideal.

More insights and blog posts

When we come across interesting technical things on our adventures, we usually write about them. Sharing is caring!