Big data for big science. Lecture in Yandex

    The author of this report has been an employee of the Large Hadron Collider (LHC) for 12 years, and last year he began working in Yandex at the same time. In his lecture, Fedor talks about the general principles of the LHC, the objectives of research, the amount of data and how this data is processed.


    Under the cut - decoding and the main part of the slides.



    Why did I take the report with pleasure? Almost all my career I have been engaged in experimental high-energy physics. He worked on more than five different detectors, starting from the 90s of the last century, and this is an exceptionally interesting exciting work - to research, to be at the very forefront of studying the universe.



    Experiments in high-energy physics are always large, these experiments are quite expensive, and it’s not easy to take each next step in cognition. In particular, therefore, behind each experiment, as a rule, there is a certain idea, what do we want to check? As you know, there is a relationship between theory and experiment, this is a cycle of cognition of the world. We are observing something, we are trying to streamline our understanding, to present certain models. Good models have predictive power, we can predict something, and we again test this experimentally. And gradually, walking in a spiral, at each turn we learn something new.

    The experiments I worked in - how did they make discoveries? Why, in fact, were the detectors built? For example, a CDF detector at Tevatron? Tevatron was built primarily to open the top quark. He was safely open. By the way, it was during the analysis of the top quark data in my memory that the multi valued analysis data was seriously used for the first time. In fact, back then, so long ago, work was built using machine learning.

    Another experiment where I was lucky to work is a CMS on a LHC. The LHC physical program was aimed at three goals: Higgs, supersymmetry, and quark-gluon plasma. Higgs - this was the number one goal that was successfully completed by 2012.

    When we collect new data, we walk around an unknown area, we discover unexpected things. Among these was the discovery of the mixing of B-mesons and anti-B-mesons in the last century, this somewhat changed our understanding of the standard model. It was absolutely unexpected.

    In 2016, a Herbie experiment was observed by a pentaquark; states of this type were observed for a long time, but each time they resolved. This time, I have a feeling that the object that is observed will stand. A truly unique object, this is not a baryon, not a meson, but a new state of matter.



    But we still talk about the data. On this slide, I tried to imagine the position occupied by the data of high-energy physicists in the general hierarchy. In today's world, data rules the world, data is analyzed everywhere. I dragged this picture from Wired, but statistics are given for 2012. What is the data flow typed in various services? Far ahead with the figure of 3000 RV, business correspondence comes forward, we all write a lot of letters. On the second place of Facebook that we upload there, seals, oddly enough, take the second place in terms of volume, ahead of the serious tasks that, say, Google does. In fourth place is an archive of all the more or less well-known medical records that are used to develop new techniques in medicine. And the next are the LHC data, about 15 PB per code. At the same level are data from YouTube. That is, the data of one, albeit large,

    This means that since we are not the largest, we can use the best practices of Google, Facebook, and other industrial services. Nevertheless, our data have quite specific properties that require serious adaptation of existing methods. It is interesting to note that high-energy physics was the driver of computer technology from the 60s to the 80s, and then large computer mainframes, data storage systems, were made specifically for the tasks of high-energy physics.

    After that, nuclear tests were banned, and the modeling of nuclear explosions became the driver.



    Why are we doing this all? Vladimir Igorevich approximately told. In this picture I tried to roughly outline the border of the famous. According to the prevailing theory, our universe arose as a result of the Big Bang, which occurred about 13 billion years ago. The universe arose from a bunch of energy that began to expand, particles formed there, they clustered into atoms, into molecules, into matter, and so on. This picture shows the timeline of what happened during the Big Bang.

    What happens now when the particles known to us are formed - electrons, neutrons, protons? We more or less understand the physics of these processes. There is a certain border that I depicted here and which corresponds approximately to the times 10 -10c, the energies achievable on the LHC now. This is practically the largest energy achievable experimentally. This is the border beyond which we cannot clearly say anything. What is on the right side at lower energies, we more or less understand and verified experimentally. What is on the left side ... there are different theories and ideas about this, different scientists present it differently, and it is quite difficult to verify this experimentally, directly.

    The bottom line is that today, when we explore high-energy physics at the collider, we look into the very early history of the universe. And this story is 10-10 s. This is very small, this is the time during which light passes 3 cm.



    Here are some examples of global physical questions that we have no answers to. There are a lot of these questions. And if you think about it, there are very dark spots. Dark matter and energy. The matter and energy that we observe is about 5% of what we measure through various gravitational observations. This is about the same as your apartment of 20 square meters, and you get a heating bill of 1000 meters, and that's right, somewhere else the remaining square meters are heated, only you don’t understand where they are. In the same way, we don’t understand what particles dark matter is in.

    We are researching new properties of matter. In particular, at the colliders we directly produce new particles unknown to us. How do we do this? Literally beat out of nothing. Everyone knows Einstein's formula E = mc2 , but it also has the opposite meaning: if we concentrate energy in one small area, then new matter may form, a pair of an electron with a positron, or something else. If we concentrate quite a lot of energy, rather complex particles can be born, which in this case instantly decay into ordinary particles known to us. Therefore, we can not observe literally new unusual particles. Nevertheless, we can register the products of their decay and extract from it information about what was born.

    This is the paradigm that we follow in experimental high-energy physics. And for this, on the one hand, an accelerator of high particles is needed in order to concentrate energy at a certain point, to create these particles. And on the other hand, we must surround this point with a certain detector in order to fix and photograph the decay products of particles.



    Photography is a very good association, since now most of the information in the detectors comes from silicon detectors, which are nothing more than a matrix in digital cameras. At CERN at LHC we have four large digital cameras. Four experiments: two general, ATLAS and CMS, and two specialized, HERA-B and Alice.

    What kind of cameras are these? The height of the ATLAS detector is 25 meters. I tried to imagine it in real time. Mass - 150 thousand tons.

    You roughly represent the size of the matrix of the camera. There are approximately 200 square meters of matrix.

    In fact, there are many layers of the matrix to register the particles that pass between them.

    The coolest thing about these experiments is their speed. They shoot at a speed of 40 million frames per second. 40 million times per second protons in the collider collide with protons. In each of these clashes, the birth of something new is possible. It is rare, but nevertheless, in order not to miss, we must read every collision.

    We can’t record so much, in reality we record about 200-500 frames per second, but this is also not bad.

    In addition, the detectors click at this speed for many years with a break for the winter holidays.



    On this slide, I reproduce what Vladimir Igorevich remarkably explained. In the right picture is what he showed. How often do we have various physical processes as a result of a collision? As a result of the collision of protons, only a cluster of energy arises, and from this energy anything can be born that is energetically permissible. Usually something ordinary is born: pi-mesons, protons, electrons.

    I was asked why you give birth to something that you do not need? Why don’t you make a collision so that the Higgs boson is born and you don’t have to read 40 million events per second? I would say that this is now impossible because of the first principles, precisely because we have a collision, and the only characteristic is energy. The cluster knows neither about Higgs, nor about anything else. Everything that is born will be born randomly. And naturally, what is often born, we have already observed. We are interested in what is rarely born.

    How rare? If collisions are read at a speed of 40 million collisions per second, then interesting processes, such as the birth of the Higgs boson, occur from about 1 to 10 times per hour. Thus, we should see these events that occur so rarely, against the background of noise from the point of view of data science, which has a frequency of about 10-11 orders of magnitude more.

    As an association: what we are trying to do is to see a snowflake of a certain unprecedented shape in the light of a flash during a snowfall and against the background of a large snow-covered field. And honestly, I think the task with the snowflake, which I described, is much simpler.

    Where do we get the data from? The primary source of data is a detector that surrounds the proton collision point. This is what is born from a bunch of energy that occurs as a result of collisions of protons.

    The detector consists of subdetectors. You know, when sporting events are filmed, many different cameras are used, which are filmed from different points, and as a result, the event as such is visible in all its diversity.

    Detectors are similarly arranged. They consist of several subdetectors, and each is configured to identify something special.



    The CMS detector segment is shown here, it has subdetectors: a silicone tracker, the very silicone matrices that I talked about, after that there is an electromagnetic calorimeter, that is, a fairly dense medium in which particles emit energy. Electromagnetic particles - an electron and a photon - emit almost all of the energy, and we have the ability to measure it. Next is the hadron calorimeter, which is denser than the electronic calorimeter, and particles are already stuck in it - again, hadrons, neutral and charged.

    Then there are several layers of the muon system, which is designed to detect muons - particles that interact with matter quite little. They are charged and simply leave behind ionization. In this picture, I show how a muon flies through a detector. Having been born in the center, he crosses the detector, leaves practically nothing in the calorimeter, in the tracker it is visible as a sequence of points, and in the muon system it is observed as a sequence of clusters.

    An electron is a charged particle, so when passing through a tracker it also leaves tracks, but as I said, it gets stuck in an electromagnetic calorimeter and gives almost all its energy there.

    What is an electron? This is a sequence of hits in the tracker that lie on the same curve. It, in turn, corresponds to a certain particle of a certain energy. It also corresponds to the energy in the electromagnetic calorimeter, which is consistent with the energy received from the tracker. Thus, literally an identification of the electron is obtained. If we see a similar correspondence of the signal in the tracker and in the electromagnetic calorimeter, then this is most likely an electron.

    A charged hadron flies through an electromagnetic calorimeter with almost no interaction; it lands almost all of its energy in a hadron calorimeter. The idea is the same: we have a track, and there is a corresponding cluster of energies in the calorimeter. We compare them and can say: okay, if the cluster is in an electromagnetic calorimeter, then this is most likely an electron. If the cluster is in a hadron calorimeter - then this is most likely a hadron.

    Suppose a photon can form. This is a neutral particle; it does not bend in a magnetic field; it flies in a straight line. Moreover, the tracker is flashing through, without leaving such hits there, but leaves the cluster in an electromagnetic calorimeter. As a result, we see a separate cluster. Our conclusion: if we have a cluster that the track is looking at, this is an electron. If there is a cluster that no one is looking at, this is a photon. This is how trackers work.

    On this slide, I have summarized how particle identification occurs. Indeed, different types of particles have different signatures.



    Nevertheless, from the detector itself we get just a report. In a silicone, for example, detector, we simply read signals from pixels, etc. We have about a million illuminated cells in the detector, which are recorded at a speed of 40 MHz. This is 40 terabytes of data per second.

    At the moment, we cannot process such a stream, although technically we are already approaching opportunities. Therefore, data is stored in local buffers on the detector. And during this time, using a bit of information, we must determine: is this event completely uninteresting or potentially interesting, and should it be left for work? We need to solve this, suppress the general rate by three orders of magnitude. And so, as the red arrow on the right shows, we jump from the general rate somewhere below.

    This is a typical classification task. Now we are actively using machine learning methods, Yandex is widely involved in this - then Tanya will tell you everything .

    We reduced the rate by three orders of magnitude, now we can exhale, we have a little more time, we can perform better processing of events, we can restore some local objects. If these objects, which were restored in certain subdetectors, are tied together, a global pattern recognition is obtained. Thus, we suppress three more orders, again we select particles, possibly interesting from completely uninteresting ones.

    What we have selected here is remembered and stored in a data center. We save one event out of a million. Other events are lost forever. So it is very important that this selection has good efficiency. If we have lost something, it cannot be restored.

    Then our event is recorded on tapes, we can safely work with them, engage in reconstruction and analysis. And this path from the initial detectors to the final physical result is nothing more than a very aggressive multi-step reduction in the dimension of the problem. Taking about 10 million data from the detector, we first cluster them onto neighboring objects, then reconstruct them into objects such as particle tracks, then select objects that are decay products, and ultimately we get the result.



    For this, we need powerful computer resources. High energy physics uses distributed computer resources. On this slide, to give the idea of ​​quantity, 120 thousand cores and about 200 petabytes of disk capacity are used for the CMS collaboration.

    The system is distributed, so data transfer is an important component. Technically, for this we use dedicated lines, practically our own communication channels, which are used specifically for high-energy physics.



    On this slide, I wanted to compare our data transfer tasks with industrial data transfer tasks. For example, compare with Netflix, the largest streaming media provider in the United States. It is interesting in that it transfers about the same amount of data per year as the LHC. But the LHC task is much more complicated. We transfer more data to fewer customers, so it’s not easy for us to replicate data. This is the approach by which Netflix solves its problem.



    Here are the resources we use. About computer I spoke. It is important that all these resources are used by researchers, and experiments in high particle physics are always huge collaborations. So, about 10,000 people who are happy to work synchronously should be combined into our resources. This requires appropriate technology, and the result of the application of these resources, the many years of work of tens of thousands of people, is a scientific result.



    An example is the CMS article, which is characterized as the discovery of the Higgs boson. Vladimir explained about statistics, mistakes, and more. You will never see a scientific article that says “discovery of something,” precisely because making a discovery means putting an end to it. No, we are experimental science, we are observing something. The conclusion of this article is that we observe such and such a boson with such and such mass within such and such limits with such and such probability and in such and such parameters its properties coincide with those of the Higgs boson.



    Here's what a science article looks like. The same thing, 36 pages. Half of them are articles, 136 references to scientific research, and three thousand authors. And there are even more people who really invested in getting the result. Your humble servant is there.



    On this slide, I wanted to explain why we are looking for peaks in distributions, what is the relationship between peaks and particles. I will not dwell on this. The reason is physical. Physics is so arranged that if we have a particle that decays into other particles, then the probability distributions of the decaying particles by kinematic parameters really have a pole. In fact, the pole is slightly shifted, so it looks like a peak near the mass, with a parameter that is the rest mass of the very decaying particle.



    On this slide, I happily report that we found the Higgs boson back in 2012. We found this very snowflake of a very special shape.



    Here is the confirmation that Vladimir spoke of interpretation. In 2012, we discovered the Higgs boson, see the picture on the left. How can we confirm that we saw a truly correct measurement? On the right is the same distribution in the data of the last two years. There we see the same signal. What is the likelihood that in two completely independent data - in fact, new data are collected even at a different energy - the same thing? Pretty small. But most importantly, the result that was earlier had predictive power. We predicted that in the new data it will be in this place, and it really is there. That's all, we put a daw.



    At the same time, last year a very similar signal on the right graph, the same excess, was found in the region of 750 GeV, and theorists, of course, immediately confirmed that yes, there should be a particle with just such a mass. The right and left pictures are not statistically much different.

    However, as a result of the collection of new data, after we increased the statistics by about five times, the result resolved. This is a typical example of how more or less similar initial results can ultimately go into discovery and dissolve. There are a lot of pitfalls.



    And the conclusion. The tasks to be solved are unique, complex, interesting. At the moment, they are solved mainly by physicists. And it is computer professional experience that is exclusively in demand in this area. Physicists understand this. In Yandex, we are literally working on this - we are trying to promote and apply computer science methods to scientific research in high energy physics, to experiments on the LHC. For this we have a special group.

    I leave my coordinates. If there is interest, want to know more, I will talk with people with great interest, welcome. And thanks.

    Also popular now: