Learning to Rank journey: the logbook
In today’s post, Felipe Besson, Data Engineer, discusses how the Search team applied Learning to Rank to ensure customers find the activity they’re looking for.
Helping travellers find the best things to do in their destination is the main goal of GetYourGuide's Search team. This is an important and difficult task. Travellers consider many different variables when choosing the right activity to do in-destination like activity duration, price, availability, language, and more. It’s crucial that our ranking algorithms learn to rank these activities and show travellers exactly what they’re looking for. In this article, I outline our journey of applying Learning to Rank (LTR) to improve our search at GetYourGuide. If you want to dive deeper into the details, check out this video from our talk at Berlin Buzzwords 2018.
What is Learning to Rank and why use it?
LTR consists of applying Machine Learning (ML) to rank, or sort, the most relevant items for a particular query. In our case, items are activities and a query is a search for one of these activities. For example, a customer creates a query for “things to do in Paris” and receives the item “Eiffel Tower Tickets” in return. As with any other ML problem, based on a training set, an algorithm can "learn" what items and which properties of those items are most relevant for a given query.
The key challenge of applying LTR is to define relevance. Although travelers have different preferences, they might converge on what are the best (more relevant) things to do in a destination. Then, in LTR solutions, the relevance of the item, also called judgment, is measured based on user behavior. On top of user preference, relevance is also impacted by business constraints. A best-seller activity is not only good for the business, but is also a good indicator of quality and popularity for the traveler.
Given a well-defined criteria of relevance and a good database of user behavior, an LTR solution can improve our search features and bring relevant items to travelers based on their own preferences, which is exactly what we need.
What do we want to achieve?
On GetYourGuide's website, we have many paths through which travelers can discover awesome places to travel. One very important pathway is the Location Page. Many new visitors interact with GetYourGuide via Location Pages, which are very similar to regular landing pages. The key difference, however, is our Location Pages also function as a search page.
Bringing relevant products to travelers is the key challenge on a Location Page. Since this page is about a location, we already know where the traveler is going, So, we might be able to make some inferences about their intentions. These pages are also used to drive travelers from web search engines to our website.
Given the importance of Location Pages, for travelers and for GetYourGuide, we decided to apply LTR on Location Pages. Of course, this led to some challenges. Let's consider an important Location Page: the Eiffel Tower page. We know travelers on this page want to travel to Paris and, probably, visit the Eiffel Tower. But, do travelers want a ticket to the top, a cruise around the tower, or to eat in the Eiffel Tower restaurant?
We need to extract the real user intention on those pages. One option may be to use Search Engine Marketing “SEM” keywords to find out what travelers want. Those keywords are the search terms driving travelers from the search engine to our website and we could use them to diversify our results. Other approaches might exist to accurately collect traveler intentions but SEM keywords look promising.
How we do LTR
To apply LTR on Location Pages, we began by defining which steps we needed to follow to be successful. To learn (and also fail) quickly we decided to follow very standard steps.
A judgment is a measure of how relevant a document is for a query. In our case, a judgment is how relevant an activity is for a particular location. In LTR, those judgments can be extracted by looking into historical data or they can be defined by domain experts. Although the second option is less scalable than getting the judgments from data, experts might be able to help you define initial criteria of relevance (judgment) when the definition of relevance is still unclear.
It is also a good approach when the data is inconsistent or incomplete. To help us start on LTR -and also have initial criteria of relevance, we received our judgments from domain experts. Each judgment had this shape:
In this example, documents related to the query "Eiffel Tower restaurant" are judged on a scale of 0 - 3 based on their relevance. In our project, the query was based on the SEM keyword intention, and the documents were activities belonging to the location "Eiffel tower". We created a dataset consisting of high-traffic keywords. Then, domain experts (internal GetYourGuide stakeholders) judged each example. It’s important to mention we collected around 30k samples, which serve as the basis for our LTR model to learn. This judgment list will be our ground truth about what the model needs to learn.
Building training sets
After getting a judgment list, we also needed to add features to make a training set. Since we are trying to solve a ranking problem, our features will be attributes from our documents and the relevance of these attributes against the queries We had 3 types of features:
Query-document: the similarity between the query and the document. It is very common to use text-similarity metrics like BM25 against document attributes like title, description, for example.
Business metrics: how good a document is based on business performance metrics. Good metrics are conversion rate, #of bookings, # of clicks, etc.
Document: activity attributes like #of reviews, price, duration, etc.
We could have easily ended up having hundreds of features in a training set, which is not always desirable. The process of picking the right features should be interactive and able to be reproduced. We built data pipelines to support us in this task. We currently use Databricks to run the pipeline and Elasticsearch as the search engine with a LTR plugin that allows us to keep the feature we want to try inside Elasticsearch.
A judgment list and a configuration consisting of which features we want to try are inputted into our pipeline. Then, the feature values are calculated by sending queries to Elasticsearch. Everything is then grouped together to create a training set. After training and validating a model, we can upload it inside Elasticsearch to calculate the LTR ranking in real-time. Then, when a traveler sends a query (e.g., "Eiffel Tower Tickets"), Elasticsearch runs the LTR model to calculate the most relevant results.
Our LTR model
With a training set in hands, we can train a model. We use Ranklib, a very well-known tool for training models in LTR. It supports linear, pointwise, and pairwise models. A pointwise model considers a document individually for each query: Is this activity relevant for the keyword? On the other hand, a pairwise model evaluates a part of documents against a query: Is activity A more relevant than activity B for the query? Very successful pairwise models used in LTR projects are Mart, RankNet, and LambdaMart.
The goal of these models is to optimize the ordering of items (activities) for a particular query. The exact scores each item gets isn’t important, but rather its relative ranking position among all the other items. To measure the output of models, we used a metric called Discounted Cumulative Gain (DCG). The idea behind DCG is that highly relevant documents in a lower-ranked position on the search result list should be penalized. So, relevance value is reduced logarithmically proportional to the position of the result:
We used NDCG@10, which is a normalized version of DCG and applied it to the top 10 results in the list. We also developed data pipelines running on top of Databricks to perform a Grid Search and optimize hyperparameters used by our model. After many interactions of training and validating models, we got the best candidate: a LambdaMart model with NDCG@10 = 0.9282 and the following features:
Business metrics like # of clicks and # of bookings
Document features like # of reviews and deal price.
Query-document features like activity title and highlights
However, after offline evaluations, we concluded we couldn't deploy our model in production.
Lessons learned so far
The rank generated by the model didn't include our best sellers (top-rated and most-booked activities) which is very important for the business. After investigating the model results in details, we had some very important learnings. As in many ML problems, having good data is the key. There are two main takeaways regarding our data.
What does relevance mean for us?
Let's recap our use case: Location Pages. These pages might be our first point of contact with travelers, so they are not yet familiar with GetYourGuide. Also, travelers probably land on Location Pages from search engines. We asked domain experts to judge how good a document is for a particular query BUT there are two main issues with that:
The experts already knew GetYourGuide and have a great knowledge about Location Pages
The judgment didn't include business metrics
Although using domain experts’ labeling can be a good starting point for a new LTR project, it might not work for us. Since many travelers are visiting Location Pages for the first time and those pages are important for the business, the issues above are hard to solve using the domain experts’ labeling. So, should we then get them from our historic data? An interesting paper about LTR for e-commerce studies the importance of including business metrics as relevance factors for LTR. Basically, the relevancy should increase based on the perceived utility of an item for the query. And, this utility could be measured based on the performance funnel of this item:
Metrics like Click-Through Rate (CTR), Conversion Rate (CR) and conversion might then be good factors to be added to the default relevance score.
Do you know real user intentions?
Since we don't have an explicit user query on the Location Page, SEM keyword intentions provide us with a good proxy of what the user was looking for before finding GetYourGuide. However, this approach has many limitations. In many cases, we don't get what the user really typed into the search engine, but rather what was matched. In other cases, the keyword that brought a traveler to GetYourGuide might have only been a trigger for starting a discovery journey and not automatically leading the user to the final destination. In this case, traveler interaction (click behavior) is a more valuable way to judge the quality of those keywords.
Although we have had some drawbacks along the way, we won't stop now. Our infrastructure to build new and awesome models is ready, so the journey continues. Our next effort will be to concentrate on improving our training set. We will focus on:
Expanding SEM Keywords: they are a good proxy but need to be enriched to better identify traveler needs
Picking Judgements from data: As discussed, we need to extract our judgments from user-click behaviour to include traveler intention
Adding business metrics on Relevance: on top of default relevance, business metrics will be good signals to define what is relevant for us
An awesome LTR model is coming. Stay tuned and keep searching!