Patching Java code in production without anesthesia
Here I will talk about the device of one of the many tools that help in the development of various services for the project Odnoklassniki. Inside the company, we call it “Hot Code Replace” (HCR), and this tool is intended to fix critical and simple bugs in production services without stopping them. This is an extremely important feature, as it allows you to avoid a rather cumbersome and time-consuming process of laying out a new, revised version of the failing service, avoiding the rather long enough pause in the availability of each host, and avoiding resetting caches.
In general, it saves a lot of time and reduces the interval from the moment of error detection to correction from hours to minutes. Most often, as it was intended, they correct minor errors in the code, for example, the programmer forgot to check for null and for some users certain actions on the site lead to an error. That is, when a correction is made by changing several lines within a method. And for the sake of such minor changes, you no longer need to distract colleagues and wait for hours for calculations on production.
You can easily fix it on:
Of course, you can make a lot more changes at the same time and add new classes, quickly make edits that the manager asks in parallel, without waiting for the next update. But this is already, if monsiere knows a lot about perversions.
Further, you can put "patches" on each other and to infinity.
But this tool is not omnipotent and is based on the standard functionality that the Java class offers: java.lang.instrument. Instrumentation and its void redefineClasses method (ClassDefinition ... definitions) .
Instrumentation.redefineClasses replaces the previously loaded classes with a new byte code. You can simultaneously overload several classes with different dependencies. Overload does not change existing instances of classes, does not change inheritance, does not touch the field of the instance or class. You can only change the body of the method, a pool of constants and attributes. You can add new classes or subclasses. Signatures of methods, instances fields and class fields cannot be changed. If you try to make incompatible changes, redefineClasses basically won't work and throws an error. It must be remembered that when classes are overloaded, the execution of the overloaded section of code is not interrupted, the new bytecode will be used the next time the same method is called. And therefore, if you try to correct the code of a method that has an infinitely long cycle inside,
If quite simply: you can change the code only inside the methods and period.
And here is an example of a while loop, which until the method is completed will not be fixed.
The main difficulty was to make a tool that works in the Odnoklassniki ecosystem, a tool that fits into all the established work processes. Which will be stable and transparently interact with all services on hundreds of hosts, as well as be flexible and easy to use. This tool must cope with dozens of experiments, work and updates that continuously occur on production.
What is the process of installing the patch from the standpoint of the developer / admin, trying to fix the bug in the production, but so that it can be done with the help of some standard and reliable procedure on dozens of servers. Omit the process of finding and correcting errors in the code.
1. A separate brunch is created in the GIT to fix the code. Using versioning is very important not only because of convenience, but also for future possible investigations.
2. TeamCity launches the patch build process. First, the project build is created from the specified brunch and then the new build is compared with the one installed in the production. For this, I wrote a plugin for the build tool that pulls all the files from the archives, compares the discrepancies and selects only those files that have changed or been added. At the same time, the Java compiler version in both builds should be the same, since another version of the compiler will create different files and almost all project files will be included in the patch. It is very important - to create exactly a small-sized archive, which will get only the necessary files, because This will significantly speed up the patch delivery process to dozens of servers. The build process is not only suitable for the project code patch, you can also replace the patched library in the project. When comparing the contents of two assemblies,
3. In case of a successful build, the patch is sent to a special repository, and the key (or hash) is given in the result window, which is needed to unambiguously identify the patch and some guarantee that this particular code will go to production.
Well, again - you can patch an unlimited number of times and assemblies with the same version number will differ by hash.
4. Further, all activity moves to the configuration service, where in the usual UI you can specify for which service, on which hosts and which versions of applications you need to patch.
Such an abundance of parameters gives the necessary level of flexibility of settings, which is very important in a large zoo from a variety of servers. For example, on some part of the servers, the version number of the application is different, and this code does not need to be patched. Or, for verification, it first runs the Hot Code Rreplace on one server, or on a group of servers, and then spreads across all instances of the application.
5. Through the configuration change, the selected services receive information that the patch needs to be installed, its version and verification hash. The idea is that all services receive the “install patch” command and then act independently. Independently compare their own version and only if the version is the same and the patch hash is missing or different, download the patch assembly from the repository on their own. The download process itself takes place via HTTP, and you can quickly change the repository address, the number of download attempts, and the waiting period between retries.
6. Each application locally checks the assembly hash and unpacks it. At the same time, each file is checked for its presence in the array, among those returned by Instrumentation.getAllLoadedClasses (), all new classes and files are written to the new - temporary classpath and this classpath is added via Instrumentation.appendToSystemClassLoaderSearch (), and the existing classes are read into memory and pass through the redefineClasses method.
7. The whole process: the arrival of a signal about the need to patch the application, its downloading, checking, unpacking and use is logged in detail, both in the log file and in its own, so that you can quickly and easily follow the process.
8. After successfully applying the patch, the process is completed by changing the version of the application to the patched one by adding a specially composed string that includes the patch hash. In case some version of the host has not changed to the expected one, we go to the Hot Code Replace log for this host and see what happened there. If there were problems with the connection, then you can safely repeat the patch command and the desired host will retry.
What possible problems can prevent the application from being patched? There are quite a few of those, and among them I would put the Instrumentation class functionality in last place. Until now, the curve code that does not meet the strict conditions of the redefineClasses has always been dropped by the JVM without any consequences for the operation of the application. When using the redefineClasses method, the JVM completely stops the application, but this process takes a split second. Because it is not scary.
The most risky moment is the delivery of the patch to the server, which is decided by additional rerays. But if the retracts do not help, then you can repeat the command of calling the patch and each of the hosts will try to repeat the process, but install the patch only if it is necessary, i.e. The patch has not been previously installed, or if the hash key has changed.
Another potential problem is when a fix fixes one error and adds a new one. To minimize this risk, we first lay out a patch on a limited number of servers, look at the logs, graphics, and monitor the result. And only then roll the patch to the other hosts.
How to deal with the restart of the application or server? This is already embedded in the logic of all applications of classmates: one of the first in any application is initiated by the HCR module. And if during the initialization information about the need to patch the application is noticed, the patch will be applied first.
And now a little about what makes Hot Code Replace.
- Our JavaAgent. JavaAgent, if anyone has forgotten , this is a separate specially formed * .jar archive, which is picked up by the JVM when the application is started using an additional parameter, for example: -javaagent: /path/to/lib/my-agent.jar It is due to the additional features of the Javaagent- and it is possible to use magic code replacement. It is in the agent that the java.lang.instrument.Instrumentation class is available. But, I did not bother him (the agent) with an extra code, since agent update task is nontrivial, but simply rendered the instance of the Instrumentation class to the static field of the utility class. Thus, all manipulations can be initiated from anywhere in the application.
- Configuration service - is responsible for the configuration of any of our applications and, therefore, in each application is initialized first. It is there that the main functionality of the Hot Code Replace is hidden. When the application starts or when the HCR configuration is changed for a particular application, the version compatibility is checked and all the above described manipulations are performed.
- TeamCity and build scripts - to conveniently create “patches” and save only modified or added classes and resources in them.
What are the advantages we have from this tool? The first is the promptness of correcting critical errors in the sale. According to the logs, I see that colleagues have gradually become more and more often use HCR, instead of waiting for releases. Next is the speed of use. The application does not need to be stopped, the JVM only freezes for a split second and all your objects remain in place and continue to work.
And our developers healed freely and happily and corrected their mistakes immediately and independently, right in production, without regard to the number of servers and workload.