We visualize geo-information from logs on a web-map in real time

  • Tutorial


So that there are no ambiguities, I will indicate the essence. When applying for a new job, they gave me a test task that can be briefly described as follows: “Write an analogue of Glowfor geovisualizing the events of users entering the custom store online store. " Simply put, it is necessary to monitor the system log for the occurrence of certain events and, in the case of these, perform (in this case) display a point on the map, which will be determined by the user's IP address. The purpose of implementation: to create a pleasant-looking “toy” for presentation purposes, capable of immersing the beholder in nirvana of harmony and aesthetic pleasure. The main condition was the use of a stack of Java technologies in the development process, which led to the adoption of many decisions. In addition, it was decided to implement this as a one-page site. And since I was very superficially familiar with Java and the web (I wrote mainly in C / C ++), I had to learn a lot. Well, we'll figure it out together.
The article is intended for those who are interested and beginners, but it does not “chew” simple things that can be found using documentation or specialized articles. The most useful resources, a link to the source (distributed under the BSD license ) and a link to the working version are given at the end of the article.


And anyway, why not use the sources of the aforementioned Glow? Firstly, they are quite specific for the volumes of data that Mozilla operated - remember the number of Firefox installations on launch day, as well as the fact that their logging system is decentralized. In our case, about 100 entries per second are written to a single log file at the peak, of which only part needs to be visualized. Secondly, the map in Glow is not the most pleasant to look at. And thirdly, this is a test task :)

Quick look


What is required of our mini-system?
  • Keep track of updates in the log file (such as tail -f). In addition, it should be noted that once a day the log file is closed and carefully archived, and a new file takes its place, that is, you need to monitor these actions and switch to the current log.
  • Determine the type of event corresponding to each new entry in the log, and if it needs to be displayed on the map as a point, allow (resolve) the coordinates of the point by the IP address contained in the record.
  • Events data must be transmitted in real time to clients (in this case, a script in the client’s browser).
  • The client script should deal with the output of information in the form of a neat map with points on it, which are colored depending on the type of the corresponding event.

After conducting a small study on each item, the following was decided. A small java daemon (it sounds funny, I understand, well, nothing) will be monitoring the log, parsing records, resolving IP, which will send data to the server via HTTP POST. This will subsequently make it easy to change individual parts of the system without a headache. The server will be part-time servlet container, for which we will write the corresponding servlet. The client side should be some kind of map widget (map render) that will communicate with the server asynchronously. There are several basic ways (for more details, see article [1] and review [2] ):
  1. Comet. The client connects to the server, and the server does not break the connection, but keeps it open, which allows for new data to be sent immediately to the client (push). As an option, use WebSocket technology .
  2. Frequent polls. A client at a given frequency polls the server for new data.
  3. "Long" polls (long polling). Something between the previous two ways. The client requests new data from the server, and if this data is not already on the server, the server does not close the connection. When data arrives, they are sent to the client, and he, in turn, again sends a request for new data.

The choice fell on long polling, since WebSocket is not supported by all browsers , and frequent polls simply eat up traffic for nothing, while exploiting server resources. In addition, the Jetty web server (part-time servlet container) makes it possible to use the continuations technique to process long polling requests (see [1] ). But let me tell you, where is real time here? We are not writing a built-in system for airplanes, but a neat presentation map, so the delays between user action and displaying a point on the observer map in 1-2 seconds are not so critical, are they?
Among the map engines, Leaflet was selected.as one of the most pleasant looking and having a simple, friendly API. Also, pay attention to Leaflet’s good browser support.
Well, let's get started, and we will solve the problems at the place of admission.

Get data from the log


How to monitor log updates, given its periodic archiving-creation? You can use, for example, a class Tailerfrom the well-known Apache Commons library , but we will go our own way, partly in the same way. Our class is TailReaderinitialized by the directory in which the log is located, a regular table describing the name of the log file (since it can change), and the update period - the time after which we will periodically check the appearance of new entries in the log. The class interface resembles working with standard streams of input-output (I / O), but at the same time it blocks the execution process when called nextRecord(), if no new entries appear in the log. To check for new entries (without blocking), you can use the methodhasNext(). Since the tracking speed log carried out in a separate stream (not to be confused with the input-output, thread), there are methods start()and stop()for controlling the flow of work. If the file stream is closed (the log was sent for archiving), after a specified number of attempts to read, the class object decides that it is time to open a new log. The log is searched according to the rules specified in getLogFile():
    /**
     * вернуть используемый в данный момент лог-файл
     * @return лог-файл или null в случае отсутствия
     */
    private File getLogFile() {
        File logCatalog = new File(logFileCatalog);
        File[] files = logCatalog.listFiles(new FileFilter() {
            @Override
            public boolean accept(File pathname) {
                return pathname.canRead()
                        && pathname.isFile()
                        && pathname.getName().matches(logFileNamePattern);
            }
        });
        if (0 == files.length)
            return null;
        if (files.length > 1)
            Arrays.sort(files, new Comparator() {
                @Override
                public int compare(File o1, File o2) {
                    return (int) (o1.lastModified() - o2.lastModified());
                }
            });
        return files[files.length - 1];
    }

After we have learned to monitor the log updates, you need to do something with these updates. First, you need to determine the type of this event, and if it needs to be displayed on the map, pull out the client IP and resolve it to geocoordinates.
The class RecordParser, as you might guess, analyzes the lines of the log file using regular expressions. The method LogEvent parse(String record)returns a simple object that encapsulates the type of event and IP address, or nullif this log entry does not interest us (this, by the way, is far from the best practice in the world of Java development - it is better to use the Null Object pattern ). At the same time, entries are also filtered from requests from search robots (they are not really store users, right?).
Finally classIpToLocationConverterdeals with resolving IP addresses to their geocoordinates using the services of Maxmind ( Java API to it ) and IpGeoBase (access to it is via the XML API , the logic of which is encapsulated in the package com.ecwid.geowid.daemon.resolvers). Maxmind quite lousy resolves the Russian addresses, therefore we will use in addition IpGeoBase'om. The Maxmind API is trivial; resolving is done through a database file located locally. For IpGeoBase it was written resolver , caching access the service for obvious reasons.
In order not to load the server, we will send it data in packs of several pieces so that the records in one pack do not differ in time slightly. To do this, the points objects on the map (class Point) accumulated for visualization are stored in a buffer, the class object, PointsBufferand are “discarded” when it is filled to the server in JSON format (we serialize objects using Gson ).
The whole logic of the demon is in the classroom GeowidDaemon. The daemon settings are stored in XML (vulgarity on my part, one could get along with properies-files or take YAML , but I really wanted to try XML to Object mapping ). pay attention to
def\b((?:\d{1,3}\.){3}\d{1,3})\b\s+script\.jsmob\b((?:\d{1,3}\.){3}\d{1,3})\b\s+mobile:api\b((?:\d{1,3}\.){3}\d{1,3})\b\s+api:

Types of events: def- opening of the “usual” customomer, mob- opening of the mobile customomer, api- calling the API service. The type is determined by finding in the log entry a substring corresponding to a particular regular in which the IP is allocated to the group.
A wonderful script was found to launch the daemon on the network .

We share data with customers


Let's rock, what's up with the vaunted continuations in the Jetty API (we’ll agree to use the 7th version of the server)? This is excellently written in the documentation [3] , including code examples. We will use them. Our servlet is GeowidServletminimalistic: it can receive data from the daemon and give it to clients. The following code is most interesting in this regard:
    @Override
    protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
        synchronized (continuations) {
            for (Continuation continuation : continuations.values()) {
                continuation.setAttribute(resultAttribute, req.getParameter(requestKey));
                try {
                    continuation.resume();
                } catch (IllegalStateException e) {
                    // ok
                }
            }
            continuations.clear();
            resp.setStatus(HttpServletResponse.SC_OK);
        }
    }
    @Override
    protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
        String reqId = req.getParameter(idParameterName);
        if (null == reqId) {
            resp.sendError(HttpServletResponse.SC_BAD_REQUEST, "Request ID needed");
            logger.info("Request without ID rejected [{}]", req.getRequestURI());
            return;
        }
        Object result = req.getAttribute(resultAttribute);
        if (null == result) {
            Continuation continuation = ContinuationSupport.getContinuation(req);
            synchronized (continuations) {
                if (!continuations.containsKey(reqId)) {
                    continuation.setTimeout(timeOut);
                    try {
                        continuation.suspend();
                        continuations.put(reqId, continuation);
                    } catch (IllegalStateException e) {
                        logger.warn("Continuation with reqID={} can't be suspended", reqId);
                        resp.sendError(HttpServletResponse.SC_INTERNAL_SERVER_ERROR);
                    }
                } else
                if (continuation.isExpired()) {
                    synchronized (continuations) {
                        continuations.remove(reqId);
                    }
                    resp.setContentType(contentType);
                    resp.getWriter().println(emptyResult);
                } else {
                    resp.sendError(HttpServletResponse.SC_BAD_REQUEST, "Request ID conflict");
                }
            }
        } else {
            resp.setContentType(contentType);
            resp.getWriter().println((String) result);
        }
    }

What's going on here?
When a client comes for new data, we check for the presence of a unique identifier in the parameters of the GET request (which, in truth, is pseudo-unique, see the implementation of the client part, the function getPseudoGUID()here), if the ID is absent - we "send off" the client. This is necessary in order to correctly identify the continuation associated with a particular client. Next, we check whether the attribute containing the necessary data is set for this request. Naturally, if the client came to us for the first time, there can be no talk of any data. Therefore, we create a continuation for it with a given timeout, suspend it and put it in a hash table for storage. However, there are situations when the continuation timeout has expired, but there has been no data either. In this case, we are helped by checking the condition if (continuation.isExpired()), upon passing which the servlet gives the client an empty array in JSON, while removing the continuation corresponding to the given client from the table as unnecessary.
If the data attribute is set, we simply return this data to the client. Where does this data come from? In the POST request handler, of course. As soon as the daemon sent the data, the servlet runs through the “suspended” continuations table, setting each attribute with the data and resuming each one (resume), and then clearing the table. It is at this moment that the method is re-entered doGet()for each continuation, but with the data the user needs.
You can, for example, measure the mysterious power of these very continuations using the profiler under load. For this, the author used VisualVM and Siege . From the author, the tester is mediocre, so the test looked extremely artificial. JVM "warmed up" for about an hour, settling down on 15Mb heap space. Then, using Siege, we load the server with parallel 3000 requests per second (I did not want to poke around in the system to raise limits on open files, etc.) for 5 minutes. The JVM ate ~ 250Mb heap space, loading the processor core by ~ 10-15%. I think a good result for beginners.


Visualization sir


I’ll make a reservation right away: perhaps my JavaScript code will seem “uncanonical” to you from the point of view of a professional frontend developer. To judge those who will understand my code :)

So, we use Leaflet. How will we display the points on the map? Standard markers look inappropriate. Using png or, forbid W3C, gif, you cannot achieve a nice picture with animation of points. There are two ways:
  1. Animation through SVG. An excellent article on this topic recently slipped on a hub Pros: Leaflet already has an excellent plugin (pay attention to the demo at the bottom of the page) using the excellent Raphaël library , and this library allows you to draw SVGs even on IE6 (more precisely, VML ). Cons: due to the specifics of SVG, animation on it is quite a resource-intensive operation (imagine yourself in the place of the browser: you will have to parse the XML most of the time and render the graphics in accordance with the changes in it).
  2. HTML5's . У всех на слуху, масса статей, туториалов и библиотек, упрощающих работу (особенно рекомендую посмотреть на www.html5canvastutorials.com и KineticJS). Плюсы: то, что надо для анимации в браузере. Минусы: не всеми браузерами поддерживается.

Also popular now: