PART 1: Close the Gates! Avoiding Bad Web Traffic at Scale.
In the first of this two-part article about fighting bad web traffic, DevOps Engineer Eleanor Berger describes what bad traffic is and how to identify it.
At GetYourGuide we see our mission as serving great content to all users quickly and efficiently helping them to discover and book the best experiences anywhere, anytime. From an operations perspective, this means responding to and processing each web request quickly and accurately. All web requests are welcome and all are treated equally – we are proud to be able to serve many millions of them every few hours, and we take great care to ensure that none are missed or rejected. None, that is, bar a few.
While almost all web requests arriving at our servers represent customers looking for activities, a small proportion originates with sources that are not nearly as nice and cheerful. As a service on the open internet, we are exposed to some upsetting, and potentially harmful behaviour: attackers, hooligans, scanners, scrapers, you name it. They are out there on the internet, and a popular service like ours gets its share of unwanted traffic. Such traffic has the potential to make life hard for us as we strive to give the best service to the 99.99% of our users who are honest and well meaning, so we always put some effort into identifying and protecting our systems from such bad traffic.
In this article, I’d like to give you a glimpse into what goes behind the curtains of a popular web application that receives its share of unwanted traffic and explain some of the tools and techniques we’ve adopted for making sure that attacks never harm our systems or any of our legitimate users.
Identifying malicious requests
On the surface, all web requests look similar; they are mostly well-structured chunks of HTTP-protocol traffic. No request ever comes with the tag “mean and harmful - please avoid”, but some of these requests are, in fact, mean and harmful. We need to identify and avoid serving them. How do you separate the chaff from the wheat then?
First, we need to understand what constitutes bad traffic. This can be different from one service or application to another, but we roughly identify the following patterns of behaviour that we would like to avoid.
Scanners / Script Kiddies
Some sources on the internet scan vast ranges of the address space, sending common attacks in the hope of finding servers that exhibit well-known vulnerabilities open to exploitation. We make sure that our systems are not vulnerable to any such exploits by maintaining up-to-date systems, but traffic from such sources is a bad sign - we can know for sure that these people don’t mean well.
GetYourGuide features content from many thousands of users, tour operators, partners, and employees. This content represents the essence of hundreds of thousands of working hours. Some of the lazier people on the internet want to copy that content and make use of it for their own commercial purposes without permission or any investment into its creation. Since much of the content is openly available to the public on the web, they try to copy it en masse by pretending to be legitimate users or search engines.
Some sources try to overwhelm our servers with a very high volume of requests in the hope of taking our servers down, slowing them enough to degrade the service offered to legitimate customers, or simply making it more expensive for us to operate.
Now that we know what kind of requests we don’t want to serve, we need to find ways to identify such requests as they arrive at our servers and before we expend any effort on processing them. Once they are identified, we can simply avoid responding to them.
As mentioned before, web requests, good or bad, don’t look all that different from one another, and since we’re dealing with an extremely high volume of requests, we need to establish automated ways to identify which is which.
The easiest way to identify bad requests is by looking at where they’re coming from. Some IP addresses or ranges have already been identified as attackers, and we can simply avoid responding to requests from them.
Identifying Content in Request Payload
Some requests contain identifying marks in the request payload itself which easily tag them as bad requests. For example, some bots include user-agent headers that we can filter by, and some scanners make requests to URLs which we don’t serve and can identify as already-published vulnerabilities. This tells us that the request is made with a bad intention and that we should avoid responding to it.
Suspicious Behaviour Over Long-Running Sessions
While some bad traffic can be identified simply by looking at individual requests, most of the traffic we want to avoid is made of many requests that look perfectly good when you look at them one-by-one, but over time reveal themselves to be bad. The rapid high volume of traffic of DOS attackers is made of many requests, each looking legitimate, but the intervals between each request are so small, and the number of requests is so high that they couldn’t possibly originate from a real user browsing the application. Scrapers attempting to copy huge amounts of content from the site send requests that look exactly the same as requests from users, but they issue them in predictable series that make their intentions obvious. By tracking browsing sessions over time, we can identify behaviours that are unwanted.
Be sure to read the second part of this article to learn more about how we fight bad web traffic.
Thanks again to Eleanor for sharing her insights into fighting bad web traffic. Want to join our team? We're hiring.