The User-Agent (Part I)

09 Dec 2018 » MSA

The User-Agent parameter is a piece of information that all browsers attach to all HTTP(S) requests they make. In today’s post, I will demystify this HTTP parameter and explain how it works. There will be a second part, where I will explain how this parameter is used in Adobe products.

Let’s start with the basics. What is a user agent? According to the W3C:

A user agent is any software that retrieves, renders and facilitates end user interaction with Web content, or whose user interface is implemented using Web technologies.

You will have noticed that I have been using “user agent” and “User-Agent” as two distinct concepts. In case it was not clear, the former is the software (as defined above) and the latter the HTTP parameter and I will use this convention in the rest of this post.

Types of user agents

In the most usual case, this software is your browser, the application you are using to read this post, but there are other cases. Think of it as a software that gets the HTML and all the assets from the website, renders the page and executes the JavaScript, in order for you to be able to interact with the website.

The other typical case is any client software that interacts with a remote web server. Nothing prevents a developer from creating such piece of software. Examples of it are:

  • Web crawlers. These are programs, used by search engines, to systematically browse the whole World Wide Web, in order to index it. The best example is Googlebot, used by Google.
  • Malicious bots. Any piece of software trying to gain advantage of a website.
  • Applications like curl or wget.
  • Programming language libraries, like Java’s Apache HTTP Client or Python’s http.client module.

HTTP and the User-Agent

The HTTP protocol established an HTTP request parameter, aptly named “User-Agent”, for the user agent software to populate. It is meant to be used as an indication of what type of software is making the request. The server receiving the request can then know which client is connecting to it. However, and crucially, this parameter can be any string.

For browsers, the typical format is something like:

Mozilla/[version] ([system and browser information]) [platform] ([platform details]) [extensions]

Note that, for legacy reasons, it always starts with “Mozilla”, irrespective of the software vendor: Google, Microsoft, Mozilla, Opera… Other types of software use a simplified version of the previous format.

For those of you who are curious about the User-Agent value of your browser, I suggest you try WhatIsMyBrowser.com. As an example, the browser I am using right now shows:

Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63

Let’s parse it:

  • Mozilla/5.0. It basically means it is an advanced browser.
  • X11. I am using the X windows system.
  • Ubuntu. Obvious, but also notice that it does not tell which version of Ubuntu I am using.
  • Linux x86_64. I am running a 64-bit Linux operating system.
  • rv:63.0. I am using Firefox 63.0.
  • Gecko/20100101. Gecko is the Firefox rendering engine.
  • Firefox/63. Again, stating that this is Firefox 63.

In summary, there is a lot of information we can gather from this simple string. Depending on the User-Agent, we know which browser is being used and we can use it to tailor the experience.

Limitations

It is precisely the limitations of the User-Agent HTTP parameter that causes trouble to digital marketers. The main limitation is that you can just fake it. There are no enforcements on its format or content. Let me illustrate it with an example:

$ curl -v -o /dev/null http://yahoo.com
* Rebuilt URL to: http://yahoo.com/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 2001:4998:58:1836::10...
* TCP_NODELAY set
* Connected to yahoo.com (2001:4998:58:1836::10) port 80 (#0)
> GET / HTTP/1.1
> Host: yahoo.com
> User-Agent: curl/7.58.0
> Accept: */*
> 
< HTTP/1.1 301 Moved Permanently
< Date: Sun, 09 Dec 2018 16:49:50 GMT
< Connection: keep-alive
< Via: http/1.1 media-router-fp1014.prod.media.bf1.yahoo.com (ApacheTrafficServer [c s f ])
< Server: ATS
< Cache-Control: no-store, no-cache
< Content-Type: text/html
< Content-Language: en
< X-Frame-Options: SAMEORIGIN
< Location: https://yahoo.com/
< Content-Length: 8
< 
{ [8 bytes data]
100     8  100     8    0     0     37      0 --:--:-- --:--:-- --:--:--    37
* Connection #0 to host yahoo.com left intact

As you can see, by default, my curl version identifies itself correctly and adds the version number. However, nothing prevents me from doing the following:

$ curl -H "User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63" -v -o /dev/null http://yahoo.com
* Rebuilt URL to: http://yahoo.com/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 2001:4998:c:1023::5...
* TCP_NODELAY set
* Connected to yahoo.com (2001:4998:c:1023::5) port 80 (#0)
> GET / HTTP/1.1
> Host: yahoo.com
> Accept: */*
> User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63
> 
< HTTP/1.1 301 Moved Permanently
< Date: Sun, 09 Dec 2018 18:04:05 GMT
< Connection: keep-alive
< Via: http/1.1 media-router-fp1014.prod.media.gq1.yahoo.com (ApacheTrafficServer [c s f ])
< Server: ATS
< Cache-Control: no-store, no-cache
< Content-Type: text/html
< Content-Language: en
< X-Frame-Options: SAMEORIGIN
< Set-Cookie: B=98s0s5le0qm8l&b=3&s=9s; expires=Mon, 09-Dec-2019 18:04:05 GMT; path=/; domain=.yahoo.com
< Location: https://yahoo.com/
< Content-Length: 8
< 
{ [8 bytes data]
100     8  100     8    0     0     24      0 --:--:-- --:--:-- --:--:--    24
* Connection #0 to host yahoo.com left intact

Now, curl is telling to the whole world that it is a Mozilla Firefox browser! What is more important is that there is nothing the server can do to verify the identity of this “browser”. You may have also noticed that there is another interesting consequence of this change. Yahoo, when it detected it was curl, it did not set any cookie. However, when tricked into thinking that it is “talking” with a real browser, it sends back a cookie in the response (line 22).

Another notable limitation is what we see with Apple devices. They do not include the actual hardware version, just the iOS version. For example, the following User-Agent string:

Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1

Just tells that it is an iPhone running iOS 11.4.1. There is no way of knowing which version of iPhone this Safari browser is running on.

As I said at the beginning, this is just the first part of a 2-part series. In my next post, I will explain how Adobe products use the User-Agent string and the consequence of the limitations. Stay tuned!



Related Posts