Monitoring servers through the JAVA job queue

    Recently I was puzzled by the problem of monitoring several dozen servers (well, probably, rarely did anyone come across such a problem). The problem can be described by several rules:

    1. It is necessary to periodically ping the server
    2. Sometimes to perform some action with the server (for example, executing a command via ssh) that the user has succumbed to
    3. Actions with servers can be of several types, each action has its own priority
    4. Tasks (from p.1-3) cannot be performed simultaneously for each server
    5. Tasks can fail, for example, due to lack of communication with the server, you need to wait until the connection is restored and try to complete the scheduled task

    The first solution that most people think of is to start their own thread for each server and do their own thing there. This is not bad, but what if the set of servers changes during the monitoring process? Starting and ending threads in the monitoring process is somehow inelegant. And what if a thousand servers? You can probably have a thousand threads, but why do this when most of the time the thread is idle and waiting for its time for the next ping?

    You can look at this problem from the other side and present it in the form of the classic producer-consumer task. We have producers who produce tasks (ping, ssh command) and we have consumers who perform these tasks. Of course, we have not one copy of the producers and consumers. Solving our producer-consumer problem in JAVA is not easy, but very simple using the PriorityQueue and ExecutorService classes.

    Let's start, as usual, with the unit test:

        public void testOffer() {
            PollServerQueue xq = new PollServerQueue();
            xq.addTask(new MyTask(1, 11));
            xq.addTask(new MyTask(2, 12));
            xq.addTask(new MyTask(1, 13));
            MyTask t1 = (MyTask)xq.poll();
            assertEquals(1, t1.getServerId());
            assertEquals(11, t1.getTaskId());
            MyTask t2 = (MyTask)xq.poll();
            assertEquals(2, t2.getServerId());
            assertEquals(12, t2.getTaskId());
            MyTask t3 = (MyTask)xq.poll();
            assertEquals(null, t3);
            MyTask t5 = (MyTask)xq.poll();
            assertEquals(1, t5.getServerId());
            assertEquals(13, t5.getTaskId());

    In this unit test, we added in our turn three tasks of the MyTask type (the first argument of the constructor means serverId, the second - taskId). The poll method retrieves the task from the queue. If the task could not be retrieved (for example, the tasks have ended or there are tasks in the queue for servers for which tasks are already running), the poll method returns null. The code shows that the completion of the task for serverId = 1 leads to the fact that the next task for this server can be extracted from the queue.

    Hurrah! The unit test is written, you can write code. We will need:
    1. Data structure (HashMap) for storing the current executable tasks for each server (currentTasks)
    2. A data structure (HashMap) for storing tasks queued for execution. Each server has its own queue (waitingTasks)
    3. Data structure (PriorityQueue) for sequential polling of servers. It is necessary that in the next poll () call, a task for another server comes to us. In short, the structure is like a revolver, only the bullets after each shot remain in the drum (peekOrder)
    4. Structure (HashSet) for storing and quickly searching for server identifiers in the revolver, so that each time you do not view the revolver from the first to the last element (servers)
    5. Simple synchronization object (syncObject)

    Now the procedure for extracting the task from the queue will be simple and short. And although the code turned out to be compact, I don’t see the point of publishing it here, but I will send you to

    Disclaimer : the code on github is not complete, in particular, it is missing the ability to set priorities for tasks inside the queue for each server and the mechanism for handling errors and returning failed tasks to the queue. Well, the ping code itself. As they say, less code - you sleep better. :)

    Also popular now: