Analyzing user behaviors via traffic dumps despite encryption

I have always heard that it was possible to guess the content of "HTTPS" requests based on traffic patterns. For example: a large long download from youtube.com, is, well, very likely to be a video :).

But how easy is it to go from a traffic dump, captured with something like tcpdump to the list of pages visited by an user? Mind me: I'm talking about pages here - URLs, actual content. Not just domain names.

Turns out it's not that hard. Or at least, I was able to do just that within a few tries. All you need to do is crawl web sites to build "indexes of fingerprints", and well, match them to the https traffic.

In this post, I'd like to show you how some properties of HTTPS can easily be exploited to see the pages you visited and violate your privacy - or track your behavior - without actually deciphering the traffic.

The post also describes a naive implementation of a fingerprinting and matching algorithm that I used on a few experiments to go from traffic dump, to the list of full URLs visited.

Let me get this straight from the beginning: there is no new attack described in this document, no new discovery about TLS and its implementation. It's all about playing with information that has always been available in cleartext on TLS (or SSL) streams.

If you are curious about the conclusions, you can jump there and skip all the gory details of the article.

Note that given the recent news that the US Senate voted to kill various privacy rules, one of the widely given advices to protect your privacy was to use HTTPS.

This is great advice: you should always use HTTPS. However, it is unlikely to achieve the goal you may think it achieves. For example, your ISP will still be able to see which sites and pages you visit most of the times, even if you use HTTPS and strong encryption.

Note also that the recipes in here apply to HTTP/1.0 and HTTP/1.1 only, no HTTP/2.0.

The basic ingredients, or what you need to know

When your browser makes an HTTPS request, both your request and reply are encrypted.

However, there are a number of things that are not encrypted, and easily visible to an attacker through a traffic dump obtained with tools like tcpdump or wireshark.

From this data, and I'll show you examples briefly, one can easily tell:

the IP address of the server you connected to. This is basic TCP/IP: there must be a cleartext IP address in every packet your computer sends and receive for the packet to make it to the server and back to your computer.
the number of connections opened to the remote server, and which packet belongs to which connection. Again, this is basic TCP/IP, an attacker just needs to look at the port numbers, and know how TCP works.
the domain name of the server. This is generally in clear text in the HTTPS request thanks to SNI. It is important to allow the remote host to pick the correct certificate. If you use an old browser that does not support SNI, it is still very likely that your computer made a DNS request earlier with the name in cleartext.

the size of your request, and the size of the reply. This might be a bit more surprising, but no major browser enables pipelining over the HTTP protocol by default. This means that whenever the browser requests a page, it will wait for the reply to come back before sending another request over a given TCP connection. Sure, there are multiple TCP connections, but for one connection, there's always one - and only one - request in flight, and one reply.

If an attacker can observe the traffic (again, a traffic dump, a MITM proxy, ...), than it is trivial to write a couple lines of code that will count how many bytes were sent over a connection until a response starts, count the bytes of the response until the next request starts, and so on.

You can see an example of such code here. But once you remove the empty ACKs, this is what a tcpdump of an HTTPS session looks like (with some edits for readability and anonimity):

# TCP handshake: SYN, SYN & ACK, ACK
10.11.24.36.49302 > 192.168.0.1.443: Flags [S], seq 3076221335, win 29200, length 0
192.168.0.1.443 > 10.11.24.36.49302: Flags [S.], seq 2430086164, ack 3076221336, win 28960, length 0
10.11.24.36.49302 > 192.168.0.1.443: Flags [.], ack 1, win 229, length 0

# Browser starts TLS handshake. Sends a bunch of stuff, including the
# name of the host it is connecting to. Dump obtained with "-s0 -xX" flags.
10.11.24.36.49302 > 192.168.0.1.443: Flags [P.], seq 1:290, ack 1, win 229, length 289
       [...]
       0x00f0:  0018 7777 776d 7972 616e 6f6d 7465 7374  ..www.myrandomte
       0x0100:  7369 7465 6374 2e63 6f6d ff01 0001 0000  stsite.com......

# TLS handshake continues, server returns its certificate.
192.168.0.1.443 > 10.11.24.36.49302: Flags [.], seq 1:1449, ack 290, win 939, length 1448
192.168.0.1.443 > 10.11.24.36.49302: Flags [.], seq 1449:2897, ack 290, win 939, length 1448
192.168.0.1.443 > 10.11.24.36.49302: Flags [P.], seq 2897:4097, ack 290, win 939, length 1200
192.168.0.1.443 > 10.11.24.36.49302: Flags [.], seq 4097:5545, ack 290, win 939, length 1448
192.168.0.1.443 > 10.11.24.36.49302: Flags [P.], seq 5545:5871, ack 290, win 939, length 326

# TLS handshake still, client negotiates the keys to use.
10.11.24.36.49302 > 192.168.0.1.443: Flags [P.], seq 290:365, ack 5871, win 342, length 75
10.11.24.36.49302 > 192.168.0.1.443: Flags [P.], seq 365:416, ack 5871, win 342, length 51

# TLS handshake still, server pretty much says he's happy with the keys.
192.168.0.1.443 > 10.11.24.36.49302: Flags [P.], seq 5871:6113, ack 416, win 939, length 242

# YAY! HTTP Get request encrypted over TLS
10.11.24.36.49302 > 192.168.0.1.443: Flags [P.], seq 416:533, ack 6113, win 364, length 117

# And the web page from the server
192.168.0.1.443 > 10.11.24.36.49302: Flags [P.], seq 6113:7518, ack 533, win 939, length 1405
192.168.0.1.443 > 10.11.24.36.49302: Flags [.], seq 7518:8966, ack 533, win 939, length 1448
192.168.0.1.443 > 10.11.24.36.49302: Flags [.], seq 8966:10414, ack 533, win 939, length 1448
192.168.0.1.443 > 10.11.24.36.49302: Flags [.], seq 10414:11862, ack 533, win 939, length 1448

# Until pretty much the TLS connection is closed.
192.168.0.1.443 > 10.11.24.36.49302: Flags [P.], seq 33376:34223, ack 533, win 939, length 847
10.11.24.36.49302 > 192.168.0.1.443: Flags [P.], seq 533:564, ack 34223, win 817, length 31
192.168.0.1.443 > 10.11.24.36.49302: Flags [P.], seq 34223:34254, ack 564, win 939, length 31
192.168.0.1.443 > 10.11.24.36.49302: Flags [F.], seq 34254, ack 564, win 939, length 0

Now, whenever your browser fetches a remote page and tries to display it to you, there are a few other things to keep in mind:

any web page is made by tens of resources: images, stylesheets, javascript, sometimes fetched from other remote sites (eg, bootstrap, google fonts, analytics, ads, ...).

Some of those resources will be cached by your browser, some won't.

For example: how often do you visit your bank web site? or your broker? Do you think the images/css/js are in your cache? Note that some web sites will instruct your browser not to cache the content, especially if any of those resources are dynamically generated.
some web pages fill themselves up with data retrieved via javascript, ajax is pretty common.

For example, let's visit https://www.duckduckgo.com, a very privacy conscious search engine. As you type in the search box you will see suggestions underneath.

Do you know how that works? It's quite simple: for every character you type in, a bit of javascript will send an HTTPS request to the duckduckgo servers for the letter you typed, and get back a reply with autocompletion suggestions. So every letter results in a request, and a reply.
Most users "click on links" to browse the internet. Generally, they start off a search engine, or a bookmarked page, or an initial url they remember. But from then on, they keep moving around the web site by clicking links. This is important in the next few sections.

How do we put all of this together

So, where am I going with all of this? How do I glue all of this together?

Well, the basic idea is really simple: if you forget about caching, if you fetch the same HTTPS static resources twice you'll see two requests for the same page which have the same size, and receive two responses of roughly the same size, and roughly fetch the same set of resources the page is made of (eg, css, js, ...).

When I say "roughly", I actually mean "almost the same" under many circumstances, if you take into account a few glitches. Of course real life is not that simple, but we'll get to that later.

All of this in turn means that if you account for those "facts of life", maybe, and I say maybe, an attacker (or an ISP...) can:

Record your traffic. As a starting point, the size of each request and reply, some timing, and some information that can be extracted from TLS headers. Basically, a fancier, better written, and faster version of this. Note that any ISP can do this easily.
Browse the same HTTPS domains you visited with some sort of crawler, and for each page build some sort of fingerprint based on the request and response size, which other resources each page requires, and so on.
Go over the collected requests and replies, and try to match the fingerprints identified to those collected, to determine which sites you visited, and what you did on those sites.

Of those steps, we established that 1 is simple to do. A very fast version in C / C++ to run on an embedded device to produce a "summary" to send to some "collection and processing pipeline" would probably take less than a day of work to write by an experienced programmer.

Building a crawler, like the one in 2, also does not sound that hard to do. It's been done before, hasn't it? Note also that an attacker doesn't have to crawl the entire internet beforehand. As long as the content has not changed, he can do so whenever, based on the observed domains you browsed.

An entrepreneurial kind of attacker could also build an index and sell it as a service. He could even sell packages with different kinds of indexes based on which behaviors you want to track of your users. For example: want to know what your users are doing financially? Here's an index of popular banks and financial sites. Want to know their sexual preferences? Here's an index of popular sites for dating, forums, or with sexual content. And so on.

The hard part seems to be mostly in 3: matching the fingerprint (and of course, building it). Or is it that hard? Let's start talking about the facts of life we temporarily set aside earlier.

The facts of life

There are several problems with using the size of a request and/or the size of a reply as a fingerprint:

The size must be unique. Well, unique enough. Is it? Well, it depends on the site. A catalog, for example, where every page is exactly the same except for some description and has million of entries won't work well. But by crawling a bunch of web sites (mostly financial institutions, but also a dating site, and an airline) I found that > 80% of pages have sizes that are different by at least 4 bytes, while only < 3% have exactly the same size. This, after manually adjusting for things like tracking images or some internal urls that really nobody would care about being able to tell apart in HTTPS traffic.

I also tried crawling under different usernames some parts of the sites that require registration. The finding are similar: although there are small variations in size due to dynamic conent (things like "hello Mr. Doe", or "your total is $0.0 vs $10.0"), the sizes are highly distinctive of which parts of the web interface you are browsing.

These statistics mean that if an attacker could see the exact size of the reply, and had a reasonable way to match it, by just looking at that size he would be able to guess the page correctly > 90% of the times.

Unfortunately, this is not always the case. But if we combine this with matching the request size as well, and being smart about which resources can be navigated to at any given time, the confidence of guessing correctly increases significantly. However...
Some encryption schemes will round the sizes, add some padding. Block ciphers will, with a typical rounding of 16 bytes (128 bit blocks). Stream ciphers, like AES in GCM mode (one of the most common) will not, they will just add a constant to the size, SIZE = LEN(CLEARTEXT) + K, with K = 16, for AES-GCM, generally.
Some standard features of the HTTP protocol will actually mess with the size. And this is probably the hardest problem to solve.

So, let's go over those HTTP features that mess up my grand scheme to snoop on your HTTPS transactions.

Chunked encoding - reply size

First, we have a problem with chunked encoding. Well, sometimes we do.

Let me explain. With old HTTP/1.0, or before chunked encoding was introduced, HTTP was a simple "open connection", "send request", "get reply back", "close connection", "open new connection", "send new request", and so on.

At a certain point, someone figured out this was expensive, so they looked for a way to reuse the same connection for multiple requests. The main problem was figuring out when a reply ended. So they introduced chunked encoding. With this encoding, a reply is made of a few bytes that indicate how big the "next chunk is", followed by that many bytes of reply itself, then another few bytes for the next chunk. If the size of the chunk is 0, the reply is complete.

For example, to return the string "Hello world", a chunked reply with 2 chunks could look like: {5}Hello{6} world{0}.

The problem this poses is that the encoding is encrypted, so the number of chunks returned and size markers themselves are not visible. Each size marker takes two bytes, and the number of chunks can be a constant, or change all the time depending on the server implementation.

However, many web servers or HTTP frameworks just return the whole reply as a single chunk.

The reason is simple: HTTP requires that cookies and errors must be returned in headers, before anything else. So if you have php / node / java / ... code generating the content, and your framework or web server wants to support setting a cookie or returning an error at any time, it must buffer the whole reply, and flush the buffer only when it is sure no error or cookie can be set anymore. Once we have the whole reply in a buffer, it is cheaper to just send a single chunk.

Reverse proxies, though, don't have this problem. Quite the opposite: they have an incentive to use small buffers to reduce memory consumption and latency. Depending on the implementation and configuration, and on network or backend load conditions, you may end up with a varying number of chunks - different for each reply.

If I had to guess, most web servers end up returning a single chunk. Either way, it is easy to detect the different behaviors at crawling time, and look at how many chunks are returned on average for replies.

For each chunk, the size will increase by 2 bytes, and our algorithm will need to work on a range of sizes.

Cookies

Second, we have a problem with cookies: they can affect both the request and the response size.

On the request side, the browser might be passing a cookie. On the response side, the server might be setting (or changing) a cookie.

There are two common ways cookies are used:

They maintain a unique identifier, used to retrieve (and store) information related to the user.
They maintain state, often signed and encrypted, to prevent tampering, that the web site uses to determine where you stand in a complex interaction.

Either way, cookies can be observed while crawling. The first case is generally simple: your browser will get a coookie once, and keep returning it. So it's a constant added to every request size to the server, and to responses setting it.

The second case is more complex: the size of the request and reply will keep varying based on the state. A few crawling sessions could be used to narrow down a range of minimum cookie size and maximum, which would once again give us a range, and be enough for the algorithm described.

Referrer

Every time you click a link on a web page, the browser has to load the new page. When loading this new page, it generally passes down a "Referrer" header indicating the URL of the page that contained the link (well, depending on the referrer policy, and a few other rules).

This means that the size of the request to load a page changes based on the URL that contianed the link the user clicked on.

If we use the request size to tell pages apart well, this is a problem.

However, there are a few important things to note that actually help the steps described above:

When crawling a web site, the crawler knows both the pages containing links, and the pages linked. It can then compute a "normalized" request size, by removing the size of the referrer header, which becomes a simple property of an arc connecting the two pages in a graph.
If the algorithm to match pages works well, it generally knows the page visited before by the user, and can guess the length of the referrer header. Note that some web pages (notably, google) set an "origin" policy, so the "Referrer" header is always set to the domain name of the original site.

The problem is thus most visible on the first HTTP transaction tracked of a user, where the origin is unknown, and the content of the referrer header can't be guessed.

The algorithm described below can however take this into account, by broadening the range of sizes considered acceptable in those cases, or by only looking at the response size.

Others

There are several other parameters that can affect the request size and response size, or what is observed on the wire.

Things like different browsers sending different headers, negotiating gzip encoding, language settings, or caching behaviors sending headers with 'If-Modified-Since' or CORS causing OPTIONS requests.

The web server can also have dynamic pages with advertisements or banners that are changed server side, or that react to any of the client provided headers.

My experiments here were very limited: just what I could simulate easily with the browsers on my laptop.

However, I believe that with relatively modest efforts it would be possible to add heuristics to take into account those differences, by, for example, crawling multiple times simulating different browser behaviors, increasing the size ranges accepted, and so on.

Algorithm

To take into account all of the above, the algorithm I used does something like this:

Collect traffic dumps. For each connection, extract the domain name of the server visited and size of ciphertext requests, size of ciphertext replies, and timestamp of when each started, and when each ended.
If the domain name has been crawled before, use the already generated index. If not, crawl the web site and generate the index.
The index is made of 4 main data structures:
1. A graph, with each node being a web resource (a .html page, a .css, a .js, and so on with relative URL), "load arcs" indicating which other resources it loaded, and "link arcs" indicating which other pages this page may lead to. For example, a "test.html" loading a "test.css" and "test.js" file would have 3 nodes, one per resource, and 2 "load arcs" going from "test.html" to "test.css" and "test.js". If the page linked to "foo.html" and "bar.html", it would additionally have 2 "link arcs" going to those pages.
  
  Each node also has a response size and requset size range tied to it. For example, if "test.css" was retrieved from 3 different pages, and one of the responses or requests (adjusted for the referer) was larger, it should keep track of the range size of "test.css".
2. A response range index, a range index using the response size as the key. For example, if the measured response size is 3756 bytes, and we estimate there might be an error of 16 bytes, it should return all nodes (from the graph above) that could fit in 3740 and 3772 bytes.
3. A request range index, a range index, same as above, but using the request size as the key. Note that the request size varies based on the referrer header, the size should be adjusted based on that (normalize at crawling, and adjust at matching based on the possible origin nodes).
4. (optional) A delta range index, a range index, using the delta between the size of the base request and the size of each derived request as the key, and the list of archs that caused that delta as a value. For example, if a "test.html" requst was 372 bytes, which resulted in a "test.css" request which was 392 bytes, the arch connecting the two would be indexed under the key 20 in the delta range index.
As the crawling happens, the crawler adjusts the range index used as the key above based on the referer, cookies set, behavior with chunked encoding, or the fact that different content is returned for the same url.
Sort the requests and responses by the time they started.
For each request in turn:
1. lookup the response size in the response range index, and create an array of "possible response nodes". This might yield to one result already.
2. to verify the correctness of the guess, or to further narrow down a set of possibilities, lookup the estimate of the request size in the request range index, and create an array of "possible request nodes". Then, compute the intersection between "possible response nodes" and "possible request nodes".
3. repeat from 1 for the next request. To verify the correctness of the guess, or to further narrow it down, limit the set of nodes to those connected to the set of nodes computed from the previous request (recursively).
This should lead to the exact set of nodes visited. If not at the first request response observed, after a few iterations.

Note also that if the browser request size is significantly different from that used by the crawler, the algorithm can use the delta range index instead. Eg, looking up the delta between subsequent requests. It should also be relatively straightforward to assume a + or - K error on the measures, and either increase the set of nodes, or backtrack the algorithm to add (or remove a constant) and verify if a graph can be built then.

Some fun with autocomplete

Note that this algorithm also works well when trying to understand what an users is typing on an input form, where ajax is used for autocomplete.

Those sites are relatively simple: as the user types in more characters, more HTTPS requests are sent to the server to narrow down the list of possible responses.

Unfortunately, the same approach I described before can be used: by measuring the request size and response size, and by querying ("crawling") the web site once by myself and registering the request and response size, the algorithm could successfully guess the string that was being looked up.

I tried two web sites: on one, users were supposed to enter the source and destination airport, where whatever the user typed was the prefix of either the city or airport code. On the second web site, the autocomplete was used to fill in the address.

What I did, in both cases:

Created a list of possible prefixes, sorted from most likely to least likely. For example, I got the full list of airports worldwide, the list of addresses of a large city, and created a list of likely prefixes.
Queried the server exactly as if the browser was doing it, for the first letter only, capturing request and response size.
Matching the size of the request and responses given to the user to the ones I queried myself, to guess the first letter.
Repeated the process for the second letter and so on, caching the results, so that following requests would be decoded correctly.

Again, modulo a few glitches caused by "what happens in real life", it worked.

Conclusions

So where, does the approach described here work well? Well, within a few experiments of tuning the algorithm, by "crawling" a small set of web sites and then analyzing the tcpdump of a random browsing session, I was able to tell with reasanoable certainty that I had been looking on my bank's web site for a mortgage, and tried to sell stocks on my trading account.

I was also able to guess which destination I was buying an airplane ticket for with the autocomplete trick I described before, or which category of images I was browsing in a couple random web sites I tried.

Not all pages could be guessed due to the facts of life I explained earlier, and this was an extremely limited experiment with a very naive and rough implementation. Still, I am pretty sure that things can be improved by using using more advanced data mining algorithms, heuristics, a better crawler, and in general by spending more time on this.

Which is all kind of scary, and brings up the next question.

What can I really do to protect my privacy? I don't have great advice at this point. You should keep using HTTPS as much as possible, that's still great advice. Note that your credit card number or SSN are still secure, what you type cannot be deciphered with the approach I described - an attacker can only make informed guesses. However, you should assume that an attacker knows exactly which pages you are visiting, how long you spend on them and how you got there - despite HTTPS.

The general advice is to use a VPN to increase your privacy, but I am personally not very fond of this suggestion.

On one side, it makes it very easy to tell that you are purposedly trying to hide your traffic. On the other, it does not really solve the problem. The VPN provider himself could be profiting from your data: they have your credit card on file, they know the IP you are coming from, they see your traffic going out, they even know you are a person that for some reason is willing to pay to protect his own privacy, and they might just be profiling you and selling your data just like your ISP.

And once we established that even with VPNs there's nothing technically protecting your privacy, the only thing that's left is the user agreement you "signed", which I am sure you read, right? And you are sure it's better than the one you signed with your ISP?

If I was in charge of running any sort of law enforcement or spy agency, I'd probably monitor traffic from/to VPN providers more closely than that of any other random ISP.

The next suggestion would be "go for tor!", which I sometimes use. But it is often slow, and certainly does not give you the best browsing experience, especially if you disable javscript, cookies, and are seriously protecting your privacy.

Some web sites, like duck duck go (and kudos to their developers) return X-Something headers filled with random data to break the autocomplete analysis I described.

It should be trivial to create a browser extension to add random data in request headers, but preventing reply analysis requires work on the server side.

Debian releases - stable, testing, unstable From PDF to interactive map