
Modern spam filters and end-to-end encryption
- Transfer

Hi
Trevor ( per: - as far as I understand, this is about Trevor Perrin ) asked me to write my thoughts about ... spam filters and end-to-end encryption so that all information is collected in one message and not scattered throughout the forum . In particular, he asked to dump my knowledge on the following topics:- How do spam filters work in large email services now?
- How will the widespread adoption of end-to-end E2E encryption affect?
- What can be transferred to the client (as well as the ensuing pros and cons)?
- Is this realistic with email?
- What will change when switching from email to other asynchronous systems (for example, chats) or new protocols; that is, spam problems are email protocol problems or a global system flaw?
I will briefly describe my experience in this area in order to clarify my competencies: I ( transl .: - Google+ page ) have been working on Google for 7 and a half years. Of these, 4.5 years I worked in the GMail security team, which is very tightly connected with the anti-spam team (they use the same applications, the same warning systems).
Somewhere in 2010, we gave a good rebuff to spammers, as a result, they could not earn money using old methods. Some of them switched to hacking accounts on an industrial scale using compromised passwords. Then spam was sent from these hacked accounts. I was the technical lead for the new anti-theft team. We spent 2.5 years fighting theft of accounts. At the beginning of 2013, weannounced their victory , and a few months later, Edward Snowden published information that the NSA / GCHQ was listening to the security system that we developed.
Since then, everything seems to have calmed down. We can say that from the point of view of GMail, victory over spam was won ... at least for the moment.
If you prefer video, then a few years ago I made a presentation at the RIPE64 conference in Ljubljana: ripe64.ripe.net/archives/video/25
In January, I left Google to devote all my free time to Bitcoin. I am currently working on a project for a P2P crowdfunding application that will allow me to find financing using a decentralized structure.
So let's go.
A brief history of spamming war
In the beginning was ... a regular expression. Gmail does support regular expression filtering, but only as a last resort. It’s easy to make mistakes with them. Once we blocked a letter from an unfortunate Italian named “Oli via Gra dina” ( trans .: Did you see a hint of sildenafil?). Plus, this approach does not support internationalization well and is easily bypassed by randomization.Then the Email community began to list and share bad IP addresses. So Spamhouse appeared. This approach paid off because the resources for which spammers paid money were devalued. But fierce battles were fought around the lists, because the blacklist keepers became judges, jury and executioners in matters of letter flow. It turned out that the question “What is spam and what is not” is very controversial. Many mailing lists did not consider themselves spammers, but in the absence of a clear definition, they were sometimes blacklisted.
To get around the RBL (Realtime Blocking List), spammers started using botnets. In response, spam fighters built an Internet map and formed a “PBL, Policy Block List” - ranges of IP addresses that were tied to residential subnets, and therefore, in principle, should not be involved in sending mail. Botnets generate incredible amounts of spam, but this spam is the easiest to filter. During my time at GMail as a team of spam and security threats, very little time was spent on combating botnets.
So, there appeared GMail-type web mailers. The very first version of GMail simply used spamassassin. But this approach was quickly recognized as not good enough, and we built our own filter. The filter architecture at GMail was described in a 2006 scientific article:Sender Reputation in a Large Webmail Service .
I will briefly recount the essence of this article. The main technique of the new filter was a heuristic attempt to guess the domain of the sender of the letter (domains are more difficult to obtain and they are more stable than IP addresses), and then calculate the reputationfor him. Reputation is scores from 0 to 100, where 100 is an ideally good reputation, and 0 is definitely spam. That is, if the sender has a reputation of 70, then about 30% of the cases, we consider the letter to be spam, and in other cases we skip it. Reputation is a moving average calculated on the basis of accurate counting of manual reviews using the "Report Spam / Not Spam" buttons and automatic response from the filter itself. Obviously, manual complaints have a much greater weight for the system and allow filters to self-correct.
This approach has another advantage - it eliminates all controversial issues around the exact definition of the concept of "spam." The new definition is as follows: spam is all that our users call spam. You can’t argue against such a definition. Moreover, it is very easy to put into practice, and it is flexible enough to adapt to the new ideas of spammers.
It is worth noting a few points:
- The reputation system should be able to read all letters. It’s not enough to see only spam, because reputation will not be able to self-correct. The Do Not Spam button is just as important as the Report Spam button. Most of the “non-spam” markings occur implicitly when the “spam” mark is simply not set.
- Reputation needs to be calculated quickly . If you received a letter with an unknown reputation, you have no choice but to allow this letter to pass. This encourages spammers to try to get ahead of the training system. The first version of the reputational system used MapReduce and calculated reputations in batches. The delay was calculated in hours. As a result, it was replaced by an interactive system that calculated points on the fly. This system is an incredible, impressive piece of engineering skill. This is essentially a global peer-to-peer real-time learning system. There are no central nodes. The filter is distributed around the world and can survive the loss of several data centers.
I am scared to think about how to build such a system outside of a well-controlled environment. Even within a proprietary / centralized environment, I had to pretty much smash my head ... - Reputation spreads between domains. If we know that a particular link is bad, and it appears in a letter from an IP address with an unknown reputation, then this IP address also gets a bad reputation. And vice versa. It turned out that this is an important point. As the number of criteria for determining reputation grows, it becomes more difficult for spammers to change them all at the same time. This is especially true for botnets, where precise control of the sending machines is difficult. If the spammer fails to randomize even one micro aspect in all their letters at the same time, then all their links and IP addresses will be automatically compromised and they will lose money.
- Reputations are inherent to natural problems. You need a large number of users. Therefore, accounts must be free. If they are free, then spammers can register a lot of such accounts, mark their own letters as “not spam” and make a Sybil attack . And this is not a hypothetical problem.
A reputation system was designed to calculate reputation based on a number of features in letters other than the sender’s domain. One of the features is the clickable link domains in the text. Links have become a critical battlefield, for which battles have been actively fought for several years. The reason is clear: spammers need to sell something. So they need to bring the user to their store. No matter what they name their product, the link to the final site should work. The fights were as follows:
- It all started with simple links in the HTML code of the letters. Filters began to block emails with such links.
- Spammers began to confuse (obfuscate) links and asked users to manually collect and enter the link in the address bar. But this method did not work well. Most users did not want or could not do this. Revenues were falling.
- Spammers started buying and creating random domains in batches. Top-level domains, such as
.com
, are expensive, but there are others - cheaper. And the reputation of an individual top-level domain fell below the baseboard (for example.cc
) - When the registrars began to tighten their screws, the spammers ran out of top-level domains. They began to trade in theft of reputations . For example, they have created blogs on sites that are allowed to register a domain:
*.blogspot.com
,*.livejournal.com
and others. Shortened link services have become spammers' best friends. Literally every URL reduction service became a battlefield for operators against spammers for domain reputation. - Spammers started hacking websites. But this approach did not always work well, because a rare web site could offer legal mail with a good reputation. And it’s also a good source of passwords.
- Large content hosting sites, such as Google, combine a spam filter with a hosting engine. And as soon as the reputation of the user URL falls, the hosting for it is automatically closed. The first versions of such systems were too slow. One of my projects at Google involved building a real-time system to automatically remove such content.
Between 2006 (opening registration) and 2010, a spam filter was built when registering accounts. We did a very good job, despite the fact that I praise myself. Look at the prices of “free” accounts of web mailers on the website buyaccs.com (Russian account store). Note that hotmail / outlook.com accounts cost $ 10 per thousand, and GMail is already an order of magnitude more expensive. When we first started, GMail accounts cost $ 25 per 1000 pieces. And we managed to increase the price by 4 times. It’s already hard to improve performance further, since all major websites use phone number verification to rule out false positive registrations, and at the current price level, it becomes more profitable to buy SIM cards in large quantities.
A large amount of magic is used to combat mass registrations. For example, I created a system that generates randomly encrypted JavaScript that opposes reverse engineering attempts. This script can detect automatic registration programs and mow them [1] .
How will end-to-end encryption affect all of this?
The following conclusions can be drawn from my stories above:
- Large amounts of data are really important both for blocking spam and for identifying good emails.
- The reaction speed of the system is important. Many battles with spammers boiled down to who’s faster. If your reputation is determined in 3 minutes, then you are overtaken.
- It is important to patrol your users. Reputation cannot be calculated if there is no trust in user actions. This creates a theoretically paradoxical situation: free accounts still cost money (if you need a large number of them).
The first problem with E2E cryptography is that the reputation database requires data from all emails. We can imagine a mail client that decrypts and analyzes the letter, and then sends a good / bad report to some hypothetical central repository. But in the end, this central repository will study not only information about who you are talking to, but also links in letters. This is extremely valuable information. The more factors you have to analyze, the more acute this problem becomes.
The second problem is if the central repository cannot read your letters, then it cannot be sure of the veracity of your reports. In the case of unencrypted emails, this is not a problem, because the spam filter itself extracts the necessary information from the emails. If spammers want to beat the system, they still have to send real letters to themselves, which leads to an increase in cost. In a world where spam filters can’t read letters, spammers can freely send completely fictitious reports about “good letters”. It’s even worse, because spammers can start to compete and send false negative reports. We’ve seen something similar with our AdWords system.
The third problem is that spam filters rely heavily on security through obscurity ( trans:- the same "security through obscurity"), because it works well. Some of the factors used in the analysis are widely known (for example, the IP address of the sender, links), but there are many others that are covered in mystery. If the filtering logic is transferred to customers, then spammers will be able to see what exactly they need to randomize in order to confuse the end-to-end reputation system.
Perhaps these two problems can be solved with the help of trusted computing (Trusted Computing). With their help, you can run encrypted programs on personal data and the hardware can "prove" to the central server that the program was really running. But it will be difficult to combine security through obscurity and end-to-end encryption - if your letter passes through a black box, this box could theoretically steal the contents of the message. You will have to rely on something that will calculate secret criteria based on your messages. Then why not just trust GMail today?
The fourth problem: anonymity and spam filters do not mix well. Essentially, you need to cut spam to the bud at the point of sending the letter. Account destruction is a fundamental anti-spam tool. All major web mailers and social services force users to pass a phone number check if the security filter goes off. Usually a random code is sent in an SMS message or a phone call is made to verify the reality of the user. This approach works because phone numbers cost money, and almost all of us have at least one number. But in many countries it is forbidden to have an anonymous phone number, and operators are forced to check identity documents before selling a SIM card. The fact that you can be “punched through the base” with complete impunity ( transl .:- there is such a legal term “plausible deniability”), which means that even if you do not transfer your personal data during registration, the government can force you to open your location and / or identity at any time. You do not need to do anything special for this. If they can intercept your password, they can cause suspicion on the site’s security system, wait for the user to enter a phone number, and pull out all the metadata they need (I never encountered such situations, but theoretically it is possible).
And the last problem: spam filters are demanding on CPU resources and disk storage. Many users today work with mail exclusively.via mobile phones. Smartphone resources are limited, and the harder they are, the faster the battery will drain. Just turning on the radio module and downloading the message takes some battery power. Even if you try to run outdated methods of fighting spam from the 90s on your phone, the phone will most likely be doomed. Only a revolutionary breakthrough in battery technology can save him.
As a result, I do not see a realistic way to return to filtering spam entirely on the client side.
What happens if everyone moves from email to other messaging systems?
In general, SMS spam is a good example. It is not much, because telephone companies act as spam filters. And the government is trying to participate by introducing penalties for SMS spam in order to discourage future violators. So to say, send a message to potential criminals. Email spam experienced a boom long before the government began to respond to it. Therefore, it is interesting to observe the difference in approaches of these two systems.
It doesn’t seem like apps like WhatsApp are affected by spam. But I think that this is more a demonstration of the good work of their spam / abuse department. They are in the most advantageous position. It's a million times easier to fight when there is a single center from which you can control everything and change anything at any time. You can kill accounts and control the flow of registrations. Without a single control center, you have to rely only on incoming filtering and suffer silently if spammers find how to get around your protection. In addition, you usually simply do not control customers.
General thoughts and conclusions
If you look at how the war with spammers was won, we will see incredible efforts made over several years. The war analogy comes to mind: there were two warring parties and many interesting fights, clashes of tactics and weapons. I could continue to poison the bikes all day, but then this letter will stretch out a lot.
Trying to replay this war in the context of total encryption will be like trying to fight blindfolded and handcuffed. You will be crushed in a minute.
Therefore, I think that we need a fundamentally new approach. The first idea that comes to mind is to introduce a fee for sending letters. But this is a bad idea for several reasons: the most obvious, free global communications is the greatest achievement of mankind, comparable to the delivery of a person to the moon. A person from rural China can send me an email in a few seconds, for free, and I can answer, for free! Think for a second.
Another reason for the failure is the fee for emails erases the difference between spammers and honest mass mailings. Many companies send large volumes of letters that users are waiting for. Take, for example, Facebook. If each letter cost money, some honest and helpful companies would not be able to work.
Another approach is to make some cash deposit. There is a protocol that allows you to donate part of bitcoins as a commission to miners. That is, you can prove that you spent the money by signing a call to the box that did the same. This will allow you to very accurately legalize anonymous mailboxes, from which you can then send as many letters as you like. There is a way to calculate reputations. In spam / non-spam reports, only proof of sending can be stored. And based on these reports, then determine the value of reputations. Letters whose sender does not yet have a reputation can be kept until they are checked by volunteers. Another option is to allow cross-signature. A member with a good reputation can temporarily certify the letter in order to raise his reputation and cause a reciprocal growth of reputation.
For this reason, I am interested in the project at the junction of Bitcoin and E2E messages. In my opinion, these are fundamentally related things.
To summarize. I am known in the Bitcoin community for their radical ideas. For example, I suggested that there is a trade-off between privacy and malicious behavior. Many people in cryptographic communities passionately reject this idea and (unfortunately) the person who ventured to express it. I hope that my stories described above show how I came to such conclusions. I think that striving for perfect privacy without taking into account abuses of such privacy is a bad way for any system that wants to achieve wide distribution.