Focusing on the tools of the trade: Building a Time Machine with Docker.
In this blog post, Senior Backend Engineer Daniel Huguenin writes about a focus-day project called Time Machine; why it was needed, how it was created, and what it accomplished.
Focus day: Building a time machine with Docker.
Learning is one of GetYourGuide’s Core Values. Monthly “focus-days” are one reflection of this part of our culture. The goal of these focus days is to get things done that feel important to the teams but that have no priority in regular sprints. They provide a way for us to explore and experiment with new technologies and to drive adoption by solving real problems with these new technologies. At the same time, they also present an opportunity to work with people from other teams more closely than usual. Insights into previously unknown parts of the system are often one of the fruits of such cooperation. In this article, I’m going to talk about a recent focus day project called Time Machine, which I built along with my colleague Alex Eftimie from the DevOps team.
The problem: Time travel.
It’s the nature of our business that some problems only become apparent weeks or even months after the damage was done. When a customer books an activity and something goes wrong, they might only notice once the activity takes place - for example when a barcode on the ticket is not recognized by the provider of the activity. More time passes before the problem is reported to us and eventually makes it to the engineering department.
As a developer, the first thing I want to do when investigating such a case is to check if there are any errors related to that booking. GetYourGuide is a data-driven company and a large part of our data is in log files. Unfortunately, these files are not easily accessible to developers after a while: Our Sentry and Kibana systems in production no longer index log files older than one week for performance reasons. However, those files are available on a long-term archival server. Analyzing them is a cumbersome process that involves connecting to a server, decompressing a range of log files into a temporary file, and then searching through that file to find the line you’re looking for - if it actually exists. If only we had noticed the problem three weeks ago when the errors were still in Kibana…
The solution: A dockerized time machine.
At this point, you are probably wondering if we have really built a time machine - and in some sense, we have: Our time machine provides developers with a copy of our usual ELK (Elasticsearch-Logstash-Kibana) stack and feeds it with log files from a selected time period. This allows developers to work with them in the same way as if they were still in the main Kibana system.
Docker was our flux capacitor of choice for this undertaking: As GetYourGuide is moving towards a containerized future, Docker is becoming an omnipresent technology for our developers. One of the best aspects of Docker in my view is that it allows you to do work in an isolated environment that can be discarded at virtually no cost without affecting the host system - and recreated in a reproducible fashion. That makes Docker an excellent tool for experimentation. It also reduces the DevOps resources required to run applications, as containers can be operated by developers.
With Docker, we can run a containerized twin of our ELK stack without interfering with the main stack. To make the system accessible to developers, we chose to build a web interface - the controller - that allows its users to select the data they want to look at and handles creation and teardown of the associated containers.
One container to rule them all.
Ideally, such a controller would be powered by a container as well. When working with Docker containers, we usually interact with the Docker client. It handles the communication with a service called the Docker daemon. The latter actually runs the containers:
In our case, we want a container to manage other containers. Therefore, there will be a Docker client running inside that container. Note that all containers are still operated by the same Docker daemon: Having a separate Docker daemon running inside the controlling container (i.e. Docker inside Docker) is a bad idea.
So, how do you make a container talk to the Docker daemon that’s running it? By default, the Docker daemon only listens to a local socket at /var/run/docker.sock, and the client connects to that socket by default. Mounting this socket into the container is, therefore, a simple way to establish the required communication channel:
In practice, we encountered an additional problem with permissions, which we solved quickly with:
The second part is to actually install the Docker client inside the container. This is not as simple as it sounds because the Docker setup on Linux is quite tedious in general. On top of that, we chose php:7.0-apache as the base image for our controller, but it does not have all the packages required to execute the commands in the setup description. For example, we had to manually install lsb-release. There is also an official Docker image that contains the Docker client already, but of course, it lacks Apache and PHP - so we chose to manually install the lesser evil.
Reap what you sow.
To display data in Kibana, it has to be fed into Elasticsearch using a harvester that is run by logstash. Given that our log files are in GZIP-format, we only had to write a script that uses gunzip to decompress and print them to stdout. On top of that, we touch-ed a marker-file to prevent future invocations of the harvester from delivering the same records again. Instead, it kills the calling process on the second execution, effectively terminating the harvest as no additional records need to be delivered. This works well in our case because it’s a one-time import.
In addition to that, we also had to configure logstash to read the occurrence timestamp associated with the events from the event data itself. The default timestamp corresponds to the time of harvest, which is in the present and obviously not what we want. This was a simple change to the corresponding configuration file for logstash that we mount into the logstash container.
Enabling concurrent use and collaboration.
Most companies that are more than a one-man show need to cater for concurrent use of their internal applications. The same applies to our project: If developer A is looking at some old log file, that shouldn’t block developer B who would like to look at a different one and ideally, we wouldn’t display both log files in the same Kibana instance to minimize interference. The simple solution to this problem is to spin up a fresh set of containers for each use case. For lack of a better name, we decided to call such a set a cluster.
Users can start and stop their own clusters from the web interface. They are also provided with links to the associated Kibana instances, which will be listening on different ports. To start a container, the user selects a type of log file, a date range, and picks a name for the cluster. Internally, the controller then generates an env_file that defines an environment variable $FILES and this variable lists the associated log files to be imported. The list generation is trivial as our logs are filed according to occurrence date. For example, the path to the general error logs from October 21, 2015, would be /error/2015/10/21/*.gz. By means of another environment variable, we can dynamically reference the environment file from a docker-compose.yml that defines our ELK stack setup. The harvester’s configuration relies on $FILES to tell the harvester which files to import.
The controller starts a cluster by using docker-compose. The -p parameter is used to specify a project name, which consists of a fixed prefix and the cluster name chosen by the developer. Distinct project names allow us to start multiple clusters from the same docker-compose file, and the prefix helps us identify the containers that are part of the time machine. We process the output of:
to identify running instances and the ports that have been assigned to them.
The final result.
Our final solution provides our developers with a web interface where they can select the type of log file to analyze and the date range they are interested in. After defining a cluster name, the needed containers are started and the user is provided with a link to the corresponding Kibana instance. After the analysis is done, the cluster can be killed from the web interface. In practice, it looks like this:
The focus day was a successful project and came in useful one week after it had been released when it helped identify a bug that caused a problem report. It was a success on a personal level too as we enjoyed working on the problem as a team. We combined our strengths to build something bigger than what we could have achieved individually and turned my daydream into a reality! I can’t wait for the next focus day - and luckily I have a time machine now!
Thank you, Daniel, for sharing your work. Would you like to join our Engineering Team? We're hiring.