Using advanced Liquid templates to optimize query performance
In today’s post, Zoran Stipanicev, Senior Data Engineer - Business Intelligence, deep dives into using Liquid templates in the Looker data platform to improve query performance.
I’ve been working as a Data Engineer at GetYourGuide for more than three years, and I‘ve been using Looker since the start of the implementation more than two and a half years ago. Looker is a unified platform for data that delivers actionable business insights to every employee at the point of decision. In this time, Looker has become a vital part of the everyday work for a lot of my colleagues at GetYourGuide, and I have learned a lot about the many features that Looker offers. Here, I will share my learnings on Liquid templates, an open-source template language.
The Looker documentation is great, but they cover only the basic usage of Liquid templates, like accessing dimension or parameter value and modifying generated query with a Liquid IF-ELSE or injecting filters into derived tables.
In this post, I will showcase the usage of Liquid filters to manipulate parameter values entered by end users to inject constant values into a generated query to improve query performance.
Our use case is generating the best performing query for analyzing AB experiments. To better understand our use case, I will first explain our setup.
This would be a pretty straightforward model with Looker, but, since our Events table is very big, we want to enable partition pruning. This only works if we have constant values in the generated query. We also want to enable partition pruning on the Assigned Visitors table, even though it’s a lot smaller than the Events table. Here, we have the same issue: we need constant values in our generated query. The good thing is that both tables are partitioned by the date column, so we can use the same values for both tables.
Since we are analyzing experiments, we need to get the start and end date from our Experiment Config table and insert them as constants into our generated query. It would also give us some added benefits to remove the join to the config table. Even though it’s a small table, joins are expensive performance-wise.
So, how can we do that with Looker? The only way to inject user-entered value into the generated query is by using parameters. One way to achieve this is to expect users to enter all required information from the Experiment Config table. This, however, would provide a bad user experience. Ideally, users could select the experiment by name from a list, and we would somehow get all the required values as constants without the user entering it in manually.
If you read Looker’s documentation on parameters, you know we can suggest values for the parameter from an Explore, which in this case would be based on our Config table. That being said, it isn’t possible to display one value, i.e. experiment name, and get some other value i.e. experiment start date as a return from a parameter when using explore-based suggestions. Luckily, Liquid templates have a “filter” feature which allows us to manipulate values. Liquid filters are like functions; we can do a substring or split a string into an array and extract specific elements. Since this is not explained in Looker documentation, let’s see how it works:
The first step is to generate a suggest string containing all the values we need for a single experiment. Our suggest string would look like:
If you are loading suggestions from an explore, the parameter must be of the type “string”. If you only want to extract one value of a constant length, you can do a substring, which looks like this:
In the above example, “slice” would extract the last 10 characters from a parameter value. Though it isn’t obvious from the code, it’s important to note that Liquid also counts single quotes. If, in the above example, the second argument was 11, the return value would look something like “012345689’”. This would be the same if you extracted a substring from the start of the string — character 0 would be the single quote.
If we want to extract multiple values which are not of constant length, we will split the string into an array and then extract the value:
The expressions in this example have 2 parts. First, we assign the return value of the split to a variable called array_var. Then, we access a specific item of the array. Again, be mindful of the opening and closing single quotes as those will be included in the value of the first and last array element, respectively. An easy fix for the closing quote is to add a delimiter at the end of the string. Of course, you can add one at the beginning, but it might look awkward. Also, please note that it isn’t possible to reuse variables defined in one field inside other fields, not even if you add the other field in the required fields property. You must copy the expression assigning variables to each field. If you use a derived table, you only have to assign the variable inside the SQL property for that table once, and then you can use it multiple times throughout the query.
You can also use the return values from slice and split to modify a generated query. Here is an example using Liquid CASE WHEN (you can find IF-ELSE examples in the Looker documentation):
In the above example, we first assign an array produced by split to a variable, and then we use the second value of the array in the case statement to determine which code will be generated.
Let’s see how this applies to our Experiments use case. If we apply the described techniques, we will inject hard-coded values into our generated query to get something like:
In our generated query, we will inject all of the values we need for partition pruning on both tables (dates) and we will also eliminate the join to the Config table because we have injected values for the Experiment ID and Experiment name.
One more tip for the end: if you build a native derived table which depends on a parameter, you must use the bind_filters property to pass its value into the native derived table. Here’s an example of how to use bind_filters with parameters:
You might ask why we used the same name for both fields. The reason is, the from_field value is the parameter name from the Explore we are joining to our native derived table, and the to_field value is the name of the parameter inside the native derived table, which, in our case, is the same parameter. Therefore, we set the same name for both properties.
Before applying the above-described methods, we had a hard-coded date filter which we could only hope wouldn’t disturb any of the experiment analysis while we tried to improve our query performance by enabling Spark to prune partitions. Since Spark would scan more data than needed, this was not optimal. It also wasn’t 100% safe because if people wanted to analyze an experiment starting further in the past than our hard-coded filter, they wouldn’t get the correct data. Even worse, they might not even be aware of it. With this new solution, we generate optimal query and we are certain our users receive complete data every time.
Have fun optimizing your queries.