Unfinished Spam Article

    It just so happened that I had to deal with the problem of spam. Here, in fact, what happened to deal with. There is a lot of text, mostly of a general nature.

    Spam Unpleasant associations are connected with this short word, and
    even thrills some system administrators. I think that in our time, every
    computer user has encountered spam and knows firsthand what it is.

    So what do we call spam?

    Spam is the sending of messages that users don’t expect to receive to a large number of recipients at the same time.

    Mass is the main sign of spam. Typically, spammers are not original in this approach. They only follow the path worked out by nature for millions of years of evolution.

    A simple calculation shows that each cedar tree gives an average of one million viable seeds per life. However, the number of trees on our planet is not increasing, which means that the level of natural selection is approximately one in a million. It is this level that ensures full reproduction and stability of the species. (http://kedr.forest.ru/culture.html)

    For example, in mackerel (Scomber scombrus), about 99.9996% of individuals die in the first 50-70 days of their life. Thus, out of a million swept eggs (and one female mackerel produces at one time up to half a million small eggs that swim in the water column) only a few individuals can survive to maturity. Nevertheless, mackerel remains a very common fish, as evidenced by its presence on the shelves of fish shops. (http://elementy.ru/news/430696)

    With one line with them we can supply spammers who, in an effort to deliver a letter to the recipient, rely on mass production. And, most sadly, this approach justifies itself.

    According to reports, the percentage of users who made purchases advertised in spam messages (for example, using the “Click here” link, which is found in most messages), in 2005 amounted to 11% (http://www.technewsworld.com/story /44655.html), in 2006 - about 6% (http://www.yale.edu/its/email/spam/whyspam.html), and in 2008 - 29% (http://www.marshal.com /pages/newsitem.asp?article=748&thesection=news).
    This is an unbelievable number, but given that only 622 people were interviewed in the latest study, the results are likely to be far from accurate. But even if this result is overstated, it is still hundreds and thousands of people who every day prove that spam generates income to its owners.

    On the other hand, there are figures of a completely different nature (http://habrahabr.ru/blogs/spam/44353/) - "Real spam CTR is 0.000008%." This is probably also not the whole truth of life. But there is still income.

    From this point of view, the only way to defeat spam is to not respond to it. Or make sure that spam messages do not reach those unconscious recipients who still respond to them. That at this stage of the development of our civilization, alas, is impossible.

    What is spam (http://ru.wikipedia.org/wiki/Spam)

    Any spamming mailing has a specific goal - otherwise the costs of it
    simply lose all meaning. Depending on the purpose, the content of the letters also changes.

    The vast majority of spam messages are advertisements (for comparison, phishing or virus emails in total amount to no more than two percent of the total number of spam).

    Among the advertisements, the most popular (September statistics) are: Adult spam - 28%, Medications; goods and services for health ”- 19%,“ Education ”- 12%,“ Replicas of luxury goods ”- 6%,“ Leisure and travel ”- 6%.

    Advertising of actual spam services has recently been floating around the 5% mark.

    Admire what spammers themselves say about spam:
    “Some people think that sending spam is an unethical method of advertising, but as
    practice shows, many at least out of curiosity look at letters of this
    kind. And always among them is a potential client who is
    interested in the offer. Since spamming is a fairly
    inexpensive service, many customers, evaluating the effectiveness of such a lever of
    influence on potential consumers, become regular customers and
    recommend this method of advertising to other colleagues. ” (http://www.direct-mail-reklama.ru)


    Technically, spam is distributed mainly by email. The disadvantages of the mail protocols developed at the end of the last century, the simplicity of the implementation of mass mailing software, the availability of address databases give a wide field for activity here. For example, any beginner can read the forum.antichat.ru/thread58130.html spam FAQ or receive a more or less tolerable answer to the forum.antichat.ru/thread72829.html question of interest .

    Actually, the mailing in the process of technological progress also changed - from direct mailings by spammers themselves to hacking users' computers and creating specialized botnets (http://www.viruslist.com/en/spam/info?chapter=156608519)

    Recently, spam has been spread by IM, on forums and blogs. For example, on www.xakep.ru there is a note entitled “Microsoft takes 5th place in the list of the most spam-tolerant providers”

    An example with classmates. Personally, I received a message from a classmate.
    Hello, please help, there is no money in the account (vote for me please send an SMS to 3649 with the text “XX 222761” (10 rubles costs)

    In addition to various media, spammers demonstrate an amazing wealth of
    imagination in the form of messages - from sending messages to pictures to littering text to disrupt automatic filters.

    Spam Issues

    The main types of harm caused by spam (http://beskov.ru/2006/05/16/spam-harm/).

    1. Load on network channels - according to the latest data, about 80% of
    messages sent on the Internet are spam and viruses. Increased load
    leads to increased risks of failure and the cost of transmitting obviously unnecessary
    2. Clogging up space on mail servers - in many cases, spam can be detected and deleted, but this does not always happen.
    3. The load on the computing power of mail servers engaged in spam filtering.
    4. Costs of maintenance personnel to configure servers, clean up spam and configure anti-spam filters. (loss of working time)
    5. Clogging of space on users' machines - if a user uses a client program to collect mail, in many cases spam comes directly to the machine and is stored there until it is deleted.
    6. The cost of the user viewing and removing spam from his mailbox.

    In particular, for the period September 15-21 , 2008 www.spamtest.ru/document.html?context=15946&pubid=208050461 for spam amounted to 80.5%.

    Also, do not forget about the moral side of spam. When every third
    letter in the mailbox is spam, it can have a depressing effect on the psyche, at least - it reduces the mood of users. A poorly working filter forces them to
    regularly clean their inbox from spam messages that have leaked there, as well as
    browse the junk for false positives.

    For Russia, the total damage of all spam victims exceeds $ 200 million a year, and the income of spam companies, according to the most immodest estimates, can amount to several
    million dollars a year.

    Spam fighting

    How to resist spam? So far, progressive mankind has come up with not so
    many ways. By the method of "working" with spam, they are divided into blocking and filtering. Blocking means ignoring any messages from the blocked host. Filtering is ignoring messages that fall under the definition of spam after analyzing its content (that is, the recipient must still receive some part of the message).

    By the method of tuning, methods can be conditionally divided into local and distributed. Distributed methods allow you to "learn from the mistakes of others."

    Any method of countering spam can be distinguished by a number of characteristics by which we will evaluate and compare these methods.

    - Efficiency. The most important parameter. It is expressed as a percentage. It is determined as the percentage of spam messages correctly identified and hidden from the user. If this value is taken away from a hundred, then we get a parameter called "false-negative" responses, that is, the percentage of spam messages that still reach the user.
    - False positives. “Clean” emails, defined as spam or not reaching the user.
    - Influence on network channels (as far as the method allows to reduce the load by reducing the number of spam messages).

    So, we briefly list the main methods of combating spam. Blocking includes:

    Blacklists.Radically solves the problem of traffic, it turns out to be reduced by about half. However, the effectiveness of this method is far from ideal - if the filter removes about 50 percent of spam, then the proportion of deleted real messages is about 30 percent. It harms honest companies.

    At the other extreme, whitelists ignore all other mail. The traffic problem does not solve. Efficiency 100% :) The false-positive component, unfortunately, also strives with this limit.

    There are also gray lists as the next step in the evolution of black lists:
    The gray list method is based on the fact that the "behavior" of the software designed for spamming is different from the behavior of ordinary mail servers, namely, spammers do not try to resend the message when a temporary error occurs, as required by the SMTP protocol. More precisely, trying to bypass the protection, in subsequent attempts, they use a different relay, a different return address, etc., therefore, for the receiving party, they look like attempts to send different letters. www.redcom.ru/isp/ispNews/netNews/ni1170203597
    The level of false positives decreases to a few percent, which is very good.

    All methods listed below are filtered.

    Officially, the next group of methods did not receive a name. Such methods answer the question of whether the letter is spam, respond by indirect signs, by the behavior of the letters.

    There are such varieties:

    Vaccination : the server immediately delivers an email to only one user who can report that it is spam (for example, by clicking on a button in the interface of his email client). The server will “train” its main filter, and harmful email will not fall into the second wave of delivery. Sometimes called the "voting" method.

    Trap (honeypot). The method, by the way, (I don’t know whether it was the first time or not) was proposed by K. Kaspersky in his book “Notes of a computer virus researcher”. A certain number of "random" mailboxes that do not belong to real users are created. Spammers find this address either somewhere specially highlighted on some forum, or simply by brute force (abc@mosglavprodsnab.com). Based on the fact that real mail never arrives at such a mailbox, we can say with almost 100% certainty that this is spam. In practice - “In PC Magazine tests, SkyScan service showed a spam detection rate of 96% with a false positive rate of 0.48%” (http://www.lexa.ru/articles/distributed-antispam-2.html)

    Checksum method.All mail passing through the mail system is scanned, checksums of letters are sent to the central server. If the flow of letters with the same checksums exceeds a certain threshold value, then the server considers this a sign of spam, which he happily informs mail servers in response to their requests.

    By the way, all these methods give an effect only when using a distributed architecture. In particular, Spamorez uses services such as Razor and DCC to determine the mass of letters.

    The main way to counteract such systems is “randomization” - sending out copies of the original, each of which differs very slightly - in the order or set of words. This can be quite successfully handled by calculating the checksum not from the entire content, but from some part, or randomly selected words, or one of the many algorithms for fuzzy text comparisons.

    Filtering based on message content.

    Heuristic method. A set of templates is created with which the content is compared. Basically, this is done using regular expressions. It requires incredible maintenance costs. Efficiency - about 70-80%, a lot of false-negative responses.

    Artificial Intelligence Methods.To date, I only know that Spam Assassin assigns a rating to a letter using neural network algorithms.

    Statistical methods, as the name implies, do not focus on the
    meaning of words in emails. That is, it does not matter in what context the
    word “sex” is used - whether it is an advertisement for intimate goods, or the column “gender” in the
    applicant’s profile. The methods are based on calculating the likelihood that the message
    is spam. The main calculation method is Bayes formula
    (http://en.wikipedia.org/wiki/%D0%A2%D0%B5%D0%BE%D1%80%D0%B5%D0%BC%D0%B0_%D0% 91% D0% B0% D0% B9% D0% B5% D1% 81% D0% B0)

    The fundamental work in the field of filtration is Paul Graham's article “A plan
    for spam ”(http://paulgraham.com/spam.html), as well as its continuation - paulgraham.com/better.html . (Almost any article in which the
    word “spam” appears refers to Graham.)

    In general terms, to determine spam, it
    is necessary to break the text into words (in the case of an email message, this should include both the subject of the letter and some part of the headings ) Specifically, words are a special case of breaking text into parts called “tokens”. For example, by various methods of cutting the text, you can resist the "littering" of the text.
    - the frequency of occurrence of tokens in spam messages is found (stored in the database, created during the "training" of the filter)
    - a representative sample is selected (Graham chose the 15 most common).
    - The Bayes formula calculates the likelihood that the message is spam.
    - if the probability exceeds a certain threshold value (80-90%), then the message is filtered out and automatically goes to the Spam folder.

    [formula] The word X occurs in messages marked as spam in 95%, Y in

    The probability that the message is spam: P (SPAM) = P (X) * P (Y) / (P (X) * P (Y) - (1-P (X)) * (1-P (Y) )) = 0.95 * 0.6 / (0.95 * 0.6 + 0.05 * 0.4) = 0.57 / (0.57 +
    0.02) = 0.966.

    The word X occurs in messages marked as spam in 50%, Y in

    The probability that the message is spam: P (SPAM) = P (X) * P (Y) / (P (X) * P (Y) - (1-P (X)) * (1-P (Y) )) = 0.5 * 0.6 / (0.5 * 0.6 + 0.5 * 0.4) = 0.3 / (0.3 +
    0.2) = 0.6.

    In addition to the Bayes formula, the chi-vadrat distribution is also used . This
    distribution, good only because it depends only on the discharges (segments of the
    domain of definition), and allows you to evaluate the deviation not only from the uniform, but from the
    distribution of any kind. (This is not very clear to me myself. It would be great to hear a clear explanation).

    Hidden Markov models , a method based on a much more complex
    mathematical apparatus. It is used as an auxiliary method for searching for
    “almost” similar texts. Unfortunately, it is a double-edged sword - except
    recognition tasks copes well with text generation. In addition to spam messages, generated messages are actively used, for example, in a live log.

    The effectiveness of statistical methods is highly dependent on what is input. For example, the naive approach, when all words are separated into the algorithm, separated by spaces or line breaks, are severely broken off if all spaces are replaced with underscores. Or write “Viagra”, then a phone number, and then three pages of text from “War and Peace” in small print.

    "Cutting" the text into parts (they are called "tokens") can be done in different ways, but they have one common part: the text is "cut" into parts, which are combinations of words. For example, the OSB (Orthogonal Sparse Bigram) algorithm will cut the phrase “You are the Internet of the future” so -
    - You are the Internet *
    - You * are of the future
    - * The Internet of the future.
    You can read a little about comparing classifiers here: www.esi.uem.es/~jmgomez/papers/sigir07.pdf . It will also be useful to learn about a project entirely devoted to the task of classifying texts - CRM114 - crm114.sourceforge.net/wiki/doku.php?id=documents

    What's next?

    You need to take the best of all methods, getting rid of the shortcomings. That is, for example, combine vaccination and automatic filtration.

    From the point of view of the classification problem itself, development along two perpendicular axes can be distinguished. On the one hand, this is an improvement in the hardware base and acceleration of calculations, which will allow you to quickly pass letters through a set of existing filters, and, accordingly, respond to it.
    On the other hand, this is the development of the algorithms themselves. For example, a regular Bayesian filter with the inclusion of the OSB classifier increases the filter efficiency several times.


    Spam is ineradicable because it works. And since he is working, this encourages the invention of new methods. There is no silver bullet. But there is a balance that fluctuates in one direction or another. We still have to support him ...

    Also popular now: