Taming Data Science: discovering best practices
Data Scientist, Maximilian Jenders, and Senior Data Scientist, Ivan Rivera, share best practices for building Machine Learning models.
A team tasked with bringing machine learning models into production can only perform well if the requirements for both Data Scientists and Machine Learning Engineers are met. A few months ago, our Machine Learning Engineer Sam highlighted the importance of a good engineering foundation to any reasonably complex platform. A strong foundation enables Data Scientists to continuously come up with new machine learning models and easily bring their results online. But a foundation alone is not enough - it also requires best practices to be followed on the Data Science side. With learning being one of our core values, we reviewed our recent experiences building Machine Learning models in GetYourGuide’s Performance Marketing group and collected our most important insights:
Thou shall make strategic investments into automation: when to go slower in order to be faster
All development requires a trade-off between quality and speed. We initially had a manual evaluation setup we ran for a model. The manual evaluation setup was able to quickly produce a few metrics, but we soon realized automating this step would be tremendously helpful. Though it would mean longer development time, automation would allow us to remove errors due to copy and paste inconsistencies for a new model, provide a unified view of the performance of all variants, and enable us to easily re-run evaluation steps. Given the trade-off, we found automation to be a good investment in many cases. However, it does not make sense to start automating steps right away when you might not know if they will be repeatedly relied upon. As a rule of thumb, we like to wait until we have seen three to five uses to make sure the code is flexible enough to serve our use cases while actually being in demand.
Thou shall document your progress: version your models
While iterating on new models, it’s typical to add more features either by connecting new data sources or doing additional feature engineering. At some point, we noticed one feature had been calculated incorrectly and occasionally included default values instead of real values. After fixing the issue, we needed to re-run the models trained on this data and had to spend time chasing down the affected model and parameter versions. The process would have been made quicker if we had built up central documentation with an overview of which model ran, when it ran, and what data was used - in other words, a version control for model iterations.
Thou shall keep your stakeholders well-informed: evangelise the working cycles of ML
As a Data Scientist, it’s easy to lock yourself into a research mode where you explore possible data sources and sensible feature-engineering techniques. Unlike regular engineering work, it is impossible to plan out required steps in this stage. While the desired outcome (an improvement in business metrics) is clear, the path is shrouded in fog. This phase can be stressful for stakeholders who aren’t accustomed to this ambiguity.
In our case, Product Managers were reading the teams’ weekly status updates and asking questions about next steps and the overall timeline. We noticed they were concerned when our answers were, “I don’t know yet, I’ll just have to try out more approaches.” That being said, we knew once a good solution was identified, mapping it to workable code would be easy.
We learned that to assure stakeholders of a project’s progress and maintain morale when things were moving slowly, we needed to time-box the individual project stages. Defining specific steps (“in 2 weeks we’ll try out the best model so far on a live test set”) helped keep things on track and put us in control of the project. We also scheduled regular updates and brainstorming meetings with stakeholders to explain the progress made and provide a space for exchanging tips and feedback amongst those with great domain knowledge.
Thou shall test and verify in production: it only works if it works in production
After having completed a few iterations, we arrived at a new model that significantly improved our error metric on the test data. Before starting a full A/B test, we like to first run a small-scale crash-test to ensure the new data pipelines are working as intended and filled on time. It also gives us some quick insights. Even if the subset to test on is small and has high variance, we can still eyeball the data. In one case, the crash-test showed a deteriorating performance in real life even though the metric improved. This taught us that our error metric was not 100% reliable and we should invest time to refine it.
To sum up, we learned that while an agile, fast process is key to iterate on possible solutions, we should take time building resilient testing pipelines and document them to avoid issues in the future. Keeping stakeholders informed and engaged builds trust and sustains motivation, while occasionally testing solutions on small production subsets helps ensure we will run a smooth A/B test it the end.