Building a logging infrastructure
In today’s post, we hear from Alex Eftimie, Software Engineer DevOps, about the motivation behind and the process of building a logging infrastructure.
Log files, or short logs, are documents storing events generated by various applications. For instance, a load balancer will generate access logs with details about each processed request, while a web application could generate payment logs, recording every payment detail. Nevertheless, a Linux system will log system events such as authentication attempts. At GetYourGuide logs are a first class citizen, playing a key role in development but also in the way we get to know our customers better and drive our business.
Developers use logs for debugging, DevOps for assessing the infrastructure’s health and security and the Data team generates meaningful reports and fuel data products. Therefore, it’s important for us to have complete logs and to trust them. Live inspection is a bonus and always comes in handy.
Prior to this project, we had a logging system in place based on quite an old version of Logstash and Redis as a temporary storage. Although it did the job, it used a lot of resources, had many single points of failure and was difficult to extend or scale. Owning the data was also a requirement, so the decision was taken to replace it with a newer, modern system.
We planned a pipeline which would have all our servers at one end and persistent storage plus live inspection at the other end. In the middle we wanted a service that could temporarily store the data and act as a buffer between the producing and consuming ends.
On a high level perspective, it looks like this:
We believe in Free Software and the benefits of Open Source, and we contribute back as much as we can. Our choice of tools for this project were the ELK stack (Elasticsearch, Logstash, Kibana), Apache Kafka, Consul and HAProxy.
We wanted a fast, low footprint agent for collecting the logs on the nodes, so we picked Filebeat as the harvester. Its job is to tail files and forward the data to a Logstash instance we called logmaster. The logmaster parses data, decides what topic it belongs, then forwards it to Kafka.
Kafka stores the data as a rolling buffer in separate topics, with different retention policies. For example, we keep system logs for one month and access logs only for a week.
Kafka is great for parallel consumption of data, guaranteeing that a message is only consumed once, within the same consumer group. Our current consumers are a log archiver to S3, which we call centralog, and a log indexer to Elasticsearch. On top of Elasticsearch we run Kibana, sporting meaningful visualisation and customizable dashboards.
Kibana - image source: https://www.elastic.co/products/kibana
Resilience to failure
You may ask yourself, what happens if the logmaster goes down? Or Kafka crashes? Or centralog stops uploading data to S3?
To avoid that, we started by multiplying the nodes in different availability zones. By having a full mesh connection between logmaster nodes and Kafka nodes we are sure that data will reach the storage, as long as one node is still up. On the producer side, Filebeat pushes data to a local HAProxy instance and that load balances the connection to all known healthy logmaster nodes. We get health information using Consul and we update the HAProxy configuration fully transparent to the Filebeat process.
Filebeat harvests files and produces batches of data. For a batch to be accepted as published in Filebeat, it needs to follow the flow: HAproxy -> Logmaster -> Kafka. If Kafka refuses to commit the message, or the intermediary services are down, the batch is marked as unpublished. Therefore, the files act as a buffer, and in the case of a logging infrastructure failure, the producers will simply resume from where they were left when the infrastructure comes back online.
On the consumer side, independent processes will connect to Kafka and process the data for various purposes. One of these processes is logindexer which parses messages, converts them into structured, searchable data and stores it in Elasticsearch. On top of that, we run the powerful Kibana interface allowing developers to drill down information and generate reports.
Rolling it out
Prior to this setup, we already had a log aggregation system in place. Since the new system used a different tool for harvesting, we were able to run it in parallel with the old one for about a month, until we decided it was stable enough and could go into production.
To be on top of everything, we set up monitors collecting metrics from various parts of the system. These monitors produce data into the local StatsD forwarder, and based on this we defined alerts in DataDog. A few examples of these metrics are:
The first challenge that we faced after going live was that we quickly ran out of disk space. There were two main reasons for that: we didn’t know how much data was coming in and Kafka’s cleanup process didn’t kick in. Kafka works as a rolling buffer, so old data is usually removed to make space for the new one. This only happens when a certain size limit is reached per topic or the data is old enough. Because we had multiple topics, the default size limit was not enough; also, increasing the number of partitions changed the date attributes on the files, so the age condition also did not apply. We finally enabled compression, and also configured clear data retention rules based on size and age, and that fixed the disk space problem.
Another thing that we noticed shortly after starting to run real traffic on the infrastructure was that almost all of the consumer traffic was going against a single Kafka node. Increasing the number of partitions per topic did not solve the issue. The solution was to manually assign a different node as the partition leader.
Maybe the most difficult challenge was that, after a few months of having the system in place, we detected that, randomly, the first line in the aggregated log file was broken. It took a lot of debugging at every possible level in the chain, and some help from our friends at Elastic, to finally detect the culprit: inode reuse and a broken Filebeat configuration. When harvesting files, Filebeat keeps a registry of filenames, inodes and offset. In our case, this registry kept accumulating, also due to an in place naming convention that generated timestamped files every day. With such a large registry, eventually the inode of an older file (rotated and removed from this) was reused and then mistaken with the inode of a new file, thus resulting in starting the harvest from a wrong offset.
How we upgraded
At the beginning of November last year, we decided to migrate from various versions of Filebeat (1.2), Elasticsearch (2.4) and Kibana (4.0) to the unified 5.0 release. The migration was needed due to bugfixes (and a complete rewrite of Filebeat) present in the 5.0 release and also because the new Elasticsearch release opened more doors for metrics aggregation. Along with the ELK Stack migration, we upgraded from Kafka 0.9 to 0.10. It was a major version migration for all of the involved tools, one that required reconfiguration and restart, and would usually involve a bit of downtime.
As with the above-described design, the migration couldn’t have run more smoothly:
First we shut down the Kafka cluster; in the meantime, Filebeat automatically paused pushing data to logmaster
Then we upgraded Kafka node by node
We started access to the Kafka cluster again, and caught up
We did the same for Elasticsearch; in this case the indexer process simply waited for the cluster to come back online and then started indexing the accumulated data
Kibana, Logstash and Filebeat processes followed, one by one
There was no data loss, just a short interruption in the flow of live data.
There’s always a debate if one should build services (mashup, re- invent the wheel) or go for managed services (there are a few great ones out there). In the Logging Infrastructure project, we took the first option - for reasons like flexibility and full control over our data. We learned a lot in the process, and there are still more opportunities awaiting. One direction we can improve is monitoring: looking at frequency and volume, or even trying to detect anomalies. Another one would be to add more consumers directly to Kafka (and shortcut going through S3 or Elasticsearch). Finally, it would be great if we could link data coming from different sources, based on the relation of the application logic behind the logs - for example link requests logs with the associated database logs.