Bots, crawlers & spiders
12 Jan 2020 » MSA
If you have a website, sooner or later it is going to be found by bots. There is no way you can prevent this from happening, so you need to be ready to deal with them. This is the first of a 2-part series on this topic.
Let’s start with the basics: what is a bot?
A device or piece of software that can execute commands, reply to messages, or perform routine tasks, as online searches, either automatically or with minimal human intervention (often used in combination)
Since this blog is about digital marketing and this post focuses on the World Wide Web, we are going to restrict this definition to web bots, also called spiders or crawlers:
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).
In layman terms, you can think of it as a computer program that behaves similarly to a human with a browser.
Those are the bots that you want to visit your website. The most famous is GoogleBot, which, you guessed it, belongs to Google. This is how your website ends up in Google’s index. All search engines and other similar application have their own bot, so expect to see a lot of traffic generated by these bots. In my case, in the last 3 days, I have received 400K requests from friendly bots.
In general, you want these bots to easily access your website. I will not get into the details of how to deal with them, you need an SEO expert to help you with it. I once tried to be one, but it was too much for me. Suffice to say that, if you get it wrong, you website can be banned by Google, which means your website will stop appearing in Google results.
This is where things get hairy. Unfortunately, where there is hard-earned money, someone will want to take a share of it (and I am not talking precisely about the government). So, almost since the inception of the web, people have created bots to steal your information or your money. It goes without saying that you need to protect your website against these bots. However, it is completely impossible to prevent all unfriendly bots, as new ones are created continuously. This means it is a task that never ends and you need to stay alert.
Here you have some examples, some from my own experience.
Denial of service (DOS) attack
You will also see it named as distributed denial of service (DDOS) attack. Web servers are design to allow a maximum number of requests per time unit. The DOS attack is as simple as having thousands or millions of bots around the world accessing your website at the same time, with the objective of bringing it down. Usually, the reason is to prevent the website from working, with translates in loss of money.
There is even a a website dedicated to show, in real time, these attacks: https://www.digitalattackmap.com/.
Long time ago I worked with a company that owned a directory. One of the main concern they had was being cloned. All the information is available in plain sight and it costs a lot of money to compile that information. On the other hand, it is very simple to create a robot that reads all the content and clones it, at a fraction of the cost. Obviously, there was a significant effort to prevent these bots from succeeding.
I remember one bot we found, which was trying to mimic a regular browser. We identified a spike in traffic as coming from a User-Agent stating that it was Internet Explorer 7 under Windows 7. However, this combination was impossible, as Windows 7 did not support IE7. This made it clear that it was an unfriendly bot.
In any case, this situation can be found with any website. Common applications like
wget allow you to mirror a website very easily. Without going too far, I have seen how some of my posts have been mirrored without my permission.
Another client I had, an airline, did not want third parties to sell their tickets. They implemented all sort of techniques to prevent any unauthorised transactions. On the other hand, some unscrupulous travel agencies created bots to behave similarly to a human visitor and sell tickets directly. I remember reviewing the Adobe Analytics data feed and seeing how these unauthorised transactions were actually happening. My host at this company did not initially believe it, but, in the end, he had to admit it.
I would argue that this is the most common bot. There is always someone trying to get more privileges to perform unauthorised activities. An example I have about this was, when visiting a bank, they told me that they were hammered 24/7 with attacks trying to access the clients’ accounts.
This blog gets its share of this type of attack, where I can see people trying to log in to the admin section. Since I use WordPress, some people have created automated scripts targeting known vulnerabilities. The following Wordfence report shows last week’s attempts:
What can you do?
Well, the first thing is to have a good IT team that knows how to:
- Be very welcoming to the friendly robots
- Prevent unfriendly robots from achieving their goals
As you can see, this is a very tricky situation, as these seem to be totally opposite requirements. However, it is possible to make sure friendly bots access the public section of the website and unfriendly bots are kept at bay. You just need the resources.
Finally, you can use the Adobe Experience Cloud to help you deal with bots. I will talk about it in my next post.