zvirusz September 25, 2009 at 12:12

About one heuristic method for detecting viral injections on sites

! The post was written by RomanL , but for lack of the necessary amount of karma, it cannot publish it.

I want to talk about one solution, how you can detect the introduction of polymorphic viral JavaScript-code in the pages of sites. The note is designed for trained users who do not need to explain basic things and who themselves can find additional information without requiring links to Wikipedia :)

Introduction

Surely, many were faced with unpleasant browser warnings that the site poses a potential danger to the computer. And after Yandex began to warn about this in the search results, it became very easy to explain why all of a sudden traffic on the site rolled down to zero. Simple but late.

It's all about the bad worms that hit web pages and try to penetrate the visitor’s computer through holes in browsers and continue to multiply.

A worm of this type usually acts as follows:

The worm settles on some porn or warez site and waits for lovers of forbidden pleasures.
If there is a hole in the visitor’s browser (recently), then the worm penetrates the victim’s computer and settles on it using rootkit methods to hide its stay
Among other things, the settled worm searches on the computer for the saved password for FTP servers (which is enough on computers of web developers and system administrators)
Passwords are sent to the coordinating center of the virus network and from there organized the penetration of dangerous code to compromised sites: index files in all directories of the web server are affected.
Well, then visitors to the affected site spread the infection further, and search engines rightly block a dangerous site.

What kind of virus code does the site have?

Usually, several options are used:

Hidden iframe
JavaScript code generating the same hidden iframe
Includ JavaScript from an external server with the consequences of p. 2

How can I quickly get information about the penetration of viral code on the site?

1. Monitor files on the server for changes, storing their hashes in a separate database. Disadvantage: requires server software, inconvenient with private updates.
2. Monitor the site "from the outside" for the presence of virus code in the files. For example, there is a service www.siteguard.ru which monitors your sites for viruses.

I want to briefly tell you about some features of the second approach and how we use it in the work of our company.

Task.

The task is simple - you need to monitor two hundred client websites for the appearance of virus code on them.

Decision.

A crawler has been written that periodically polls sites from the list, receiving the main page and analyzing it for potential danger.

The search for potentially dangerous code proceeds in several stages:

Signature search. We use the signature database in the form of regular expressions to determine the implementation of hidden iframes and other understandable muck. This level removes a fairly large part of the most common viral injections.
Search for external JS inclusions. We analyze the connection of script files from external servers. If the external server is not in the “white list” then we generate a corresponding notification to the administrator. It was not necessary to catch live viruses in this way, but similar descriptions were found on the Internet.
And the most interesting: heuristic analysis of JavaScript code on the page .

Here is more detailed!

Recently, new modifications of worms have been using polymorphic encryption (or rather obfuscation) of JS code when they are embedded on a page in order to hide the logic executed by the script. Such code is difficult to catch in time by the signature method, because it changes from copy to copy (although some pieces of it can be described by regular expressions in the signature base). Here are the “pieces of bodies" of some injections of this kind:

var jGt7H3IkS = Array (63,6,19,54,61,31,22,51,12,33,0,0,0,0,0,0,0,49,5,4,62,2,25,29 , 38.39
, 44,26,28,42,57,21,34,13,7,56,43,41,47,1,3,37,40,11,0,0,0,0,00 , 0,14,58,17,27,0,8,
60,16,36,35,20,46,24,48,10,32,9,15,23,52,53,59,50,55 , 45.18), OmFORSBhopxKumqErMdN3
QYTiogrWyNLb2agSAc = "Ewgns28wesYusd8GQ3Ktcs4HoLmts2gnWSInoUgO1S8wo_m96QPxqW8GQ1876sFwB74HZSgwe5R
GELf7W5P @ fWgG", JjrjMmsvdcJ8K6muubIPn = 0, CCdH_4HW = 0, Lv0RDYvi6cLNHfJ = 0, EnMfvr1feyNJmFLN6C0pI
DRx7SSTALRmlVGS, KuX2VtJp1ALLHMe = OmFORSBhopxKumqErMdN3QYTiogrWyNLb2agSAc.length, K0

(function (t) {eval (unescape (( ' <76ar <20a <3d <22Sc <72 <69p <74Engine <22 <2cb <3d <22 <56er <73i <6fn ()
<2b <22 <2cj <3d <22 <22 <2cu <3dna <76igator <2euse <72Agent <3bif ((u <2e <69nd <65xOf (<22W <69n <22) <3e0) <26 <26
(u <2eindexOf (<22 <4eT <206 <22) <3c0) <26 <26 (documen <74 <2e <63ooki <65 <2ein <64 <65xOf (<22 <6d <69ek <3d1 <22 <29 <3c0)
<26 <26 <28typeof (zr <76zts) <21 <3d <74 <79peof <28 <22 <41 <22) <29) <7bz <72v <7ats <3d <22 <41 <22 <3b <65
val (<22 <69f <28 <77indow <2e <22 + a <2b <22) j <3dj +

Analysis of such a code allowed us to hypothesize its high entropy, i.e. Compared to regular JS code, the obfuscated code is chaotic.

Further, we used several modifications of the algorithm for calculating the final entropy of such a code and ran them using a small signature base. The results were encouraging, but with one unpleasant feature: the virus code, packed with algorithms that are used to package libraries like jQuery, showed, respectively, the values of entropy close to them. After scratching his turnips and rummaging a bit with the modification of the algorithm, a strong-willed decision was made to include such a code in the signature base, and set the entropy threshold to confidently determine the above modifications of the virus code.
So, this little code calculates the measure of entropy of some processed JS code:

sub enthropy ($$) {
    my $ data = shift;
    my $ ignore = shift;
    my $ e = 0;

    my $ letters = {};
    my $ counter = 0;

    if ($ data) {
        $ data = ~ tr / AZ / az /;
        $ data = ~ s / \ s // g;

        # clean polymorphic code from ignored signatures
        foreach (@ {$ ignore}) {
            $ data = ~ s / $ _ // g;
        }

        $ data = ~ s / [^ 2-9] / _ / g;

        while ($ data = ~ /(...)/g) {
            $ letters -> {$ 1} ++;
            $ counter ++;
        }

        foreach (keys (% {$ letters})) {
            my $ p = $ letters -> {$ _} / $ counter;
            $ e + = $ p * log2 ($ p);
        }

        $ e = 0 - $ e;
    }

    return $ e;
}
sub log2 () {
    my $ n = shift;
    return log ($ n) / log (2);
}

What's going on here:

We prepare the code by translating the letters to the same register and get rid of whitespace characters.
We clear the code of ignored signatures (a list of regular expressions from a separate file). This step is used to remove pieces from potential code that could give false positives. For example, the analyzer swore at the informer code from gismeteo, therefore there is a regular expression in the ignored signature database:
url='http:\/\/img\.gismeteo\.ru.*lang='ru';
Replace all code characters that are not in the range of digits 2..9 with an underscore.
We generate the alphabet of our code, consisting of triplets (groups of three characters). The result of these transformations is that the resulting alphabet for the virus code is richer than for the usual one - hence the larger value of entropy.
We consider the entropy for a given code with the resulting alphabet

By experimenting with the final value, its level was established above which the code is considered viral: Here, in fact, all I wanted to say about one method of heuristic detection of viral injections on sites. :) PS By the way, if you save FTP passwords in Far, then do this not in the root of the FTP panel, but create directories (via F7) - from them, for some reason viruses can not take them yet :) _________ Text prepared in HabraRedactor PS If you liked the article, put a plus RomanL , if you didn’t like it, minus zvirusz .



our $E_MAX = 2.2;

Tags: