LogParser - a familiar look at unusual things

    When I once again used LogParser, in order to imbue other people's experience, I entered its name in a search on Habré. As a result - the message "Surprisingly, but the search did not return results." It is truly amazing when such an interesting tool is ignored. The time has come to fill this gap. So, meet LogParser . A small but damn useful utility for SQL lovers.

    From the name of the tool, it would seem very difficult to understand what it does in the SQL section. And the truth is that it is the same LogParser as ChartGenerator. In the sense that he copes with both tasks with equal success. In general, I would describe it as an SQL heterogeneous data processor. The concept of work in general is such that it takes data from a certain format and converts it into a tabular form (in fact, only at this stage and sometimes parsing is performed). Then, by executing some SQL query on these tabular data, it forms a table with the result and saves it again in a certain format. In short, the chain looks like preparing input-> SQL processing-> generating output. Or, as illustrated in the documentation:
    It is time, perhaps, to move from theory to practice, for it is much more visual. Let's start for the seed with such an example: I think many immediately guessed what kind of mystery happened here. In the familiar and familiar SQL style, we select ... files from the X: \ Folder folder, group these files by MD5 to identify duplicates in content. Naturally, we discard those cases when the number of such files = 1 (i.e. there are no identical files). In addition, we order the found duplicates in descending order of size and display only the top100 most weighted ones. To make sure the result is correct, look in the Duplicates.csv file. There you will find something in the following spirit:

    X:\>LogParser.exe -i:FS -o:CSV "SELECT TOP 100 HASHMD5_FILE(Path) AS Hash, COUNT(*) AS FileCount, AVG(Size) AS FileSize INTO Duplicates.csv FROM X:\Folder\*.* GROUP BY Hash HAVING FileCount > 1 AND FileSize > 0 ORDER BY FileSize DESC"

    Elements processed: 10
    Elements output: 2
    Execution time: 0.06 seconds


    The first value is the MD5 hash of the duplicates found, the second is their number and the third is size. Let us now try to decompose the example code in accordance with the previously described concept. The input data is determined by the provider of the input format and some address according to the selected provider. In our case, it is specified by the -i: FS option for the file system. And specific data (folder X: \ Folder) is addressed in the FROM part of our SQL query. A registry query, for example, for the \ HKLM \ Software branch would look like this: LogParser.exe -i: REG "SELECT * FROM \ HKLM \ Software". The provider is REG, the address is \ HKLM \ Software.

    By default, LogParser offers us the following providers of source formats:

    IIS Log File Input Formats
    • IISW3C: parses IIS log files in the W3C Extended Log File Format.
    • IIS: parses IIS log files in the Microsoft IIS Log File Format.
    • BIN: parses IIS log files in the Centralized Binary Log File Format.
    • IISODBC: returns database records from the tables logged to by IIS when configured to log in the ODBC Log Format.
    • HTTPERR: parses HTTP error log files generated by Http.sys.
    • URLSCAN: parses log files generated by the URLScan IIS filter.

    Generic Text File Input Formats
    • CSV: parses comma-separated values ​​text files.
    • TSV: parses tab-separated and space-separated values ​​text files.
    • XML: parses XML text files.
    • W3C: parses text files in the W3C Extended Log File Format.
    • NCSA: parses web server log files in the NCSA Common, Combined, and Extended Log File Formats.
    • TEXTLINE: returns lines from generic text files.
    • TEXTWORD: returns words from generic text files.

    System Information Input Formats
    • EVT: returns events from the Windows Event Log and from Event Log backup files (.evt files).
    • FS: returns information on files and directories.
    • REG: returns information on registry values.
    • ADS: returns information on Active Directory objects.

    Special-purpose Input Formats
    • NETMON: parses network capture files created by NetMon.
    • ETW: parses Enterprise Tracing for Windows trace log files and live sessions.
    • COM: provides an interface to Custom Input Format COM Plugins.

    Already a lot. And considering that you can create your own providers, it’s generally wonderful. Later I will touch on this issue and show how you can create them both “in an adult way” using compiled assemblies and on-the-fly, simply using scripts.

    The output artifact format is defined in a similar way. With the -o: CSV option, we indicated that we are interested in a provider for CSV files, and in the INTO part of our SQL query, we searched for the file to be saved, where the result will be saved. By analogy with the input providers, we list the output available out of the box.

    Generic Text File Output Formats
    • NAT: formats output records as readable tabulated columns.
    • CSV: formats output records as comma-separated values ​​text.
    • TSV: formats output records as tab-separated or space-separated values ​​text.
    • XML: formats output records as XML documents.
    • W3C: formats output records in the W3C Extended Log File Format.
    • TPL: formats output records following user-defined templates.
    • IIS: formats output records in the Microsoft IIS Log File Format.

    Special-purpose Output Formats
    • SQL: uploads output records to a table in a SQL database.
    • SYSLOG: sends output records to a Syslog server.
    • DATAGRID: displays output records in a graphical user interface.
    • CHART: creates image files containing charts.

    Let's try another example for seeding with completely different input and output providers. For example, a frequently encountered task of analyzing web server logs and displaying the top of referring sites. As you can see, LogParser threshed almost a half million records in less than 17 seconds with non-trivial conditions on the ancient Pentium D 2.8, which, in my opinion, is not such a bad result. Well, the main result is in the picture:

    c:\Program Files\Log Parser 2.2>LogParser.exe -i:W3C -o:CHART "SELECT TOP 10 DISTINCT EXTRACT_TOKEN(EXTRACT_TOKEN(cs(Referer), 1, '://'), 0, '/') AS Domain, COUNT(*) AS Hits INTO Hits.gif FROM X:\ex0909.log WHERE cs(Referer) IS NOT NULL AND Domain NOT LIKE '%disele%' GROUP BY Domain ORDER BY Hits DESC" -chartType:Column3D -chartTitle:"Top 10 referrers" -groupSize:800x600

    Elements processed: 1410477
    Elements output: 10
    Execution time: 16.92 seconds


    It seems to me - this is an amazing tool for bazadanschikov :), because it allows you to apply SQL magic where we often wanted, but it was impossible. For this thesis, I will suspend my narrative for now. Firstly, you can talk about LogParser for a very long time and it will not fit into the framework of one post. Secondly, I want to hope that this material is enough to at least slightly interest the colleagues in the workshop and understand how much they share my enthusiasm for this utility. And as soon as there are those interested in continuing, I will do it immediately, since there are a lot of things that have not been disclosed. This is a conversion mode. And the formation of the output data directly, as well as using LogParser templates and XSLT. Incremental mode for working with large amounts of data. Creating custom input formats using C # and JavaScript as an example. Using LogParser from applications. Multi-input and multi-output to multiple files. Acquaintance with the abundance of subtleties of SQL LogParser'a, including the vastness of its functions. Various options and command line options. Subtleties of these or those providers and their setup. And of course, more examples of all kinds and different :)

    Also popular now: