Ignatius Kolesnichenko: “You won’t ask for money from bacteria”

    Introducing the second issue of the podcast on technology, processes, infrastructure, and people in IT companies. Today, CTOcast is visiting Ignatius Kolesnichenko, technical director of iBinom (analysis of the human genome).

    Listen to the podcast

    1st part of the text version of the podcast

    Text version of the podcast (Part 2)


    Pavel Pavlov: You can ask a lot about algorithms, but let's start with the most obvious question. At first, for sure, you used existing algorithms, that is, it was a question of their adaptation. Was there an opportunity to create something of your own? How exactly was your service and approach built?

    Ignatius Kolesnichenko: I would say that somewhere around half of us use the existing as is. But you need to understand that the existing also needs to be customized. For example, there are sequencers with different data, a little in different formats and lengths, and the algorithms on such data work with differences in quality. It was necessary to figure out on what data to apply certain algorithms, what the results would be.

    Well, the alignment algorithm itself. Of course, at some point we had an idea, and we even tried to write our own, but the task turned out to be difficult and not very lifting for a startup. Because if you look at some well-known alignment algorithm, there are 30,000 lines of code on the optimized pluses that have been developing, for example, the last 4 years with the power of two or three people. Obviously, we cannot repeat this immediately.

    Pavel Pavlov: What was the main criterion for choosing the algorithms? Performance? Reliability degree?

    Ignatius Kolesnichenko:The criterion was a compromise, so that we were satisfied and that we really could do everything in the declared hour. Just most of the algorithms do not fit precisely on this parameter - they are very slow. Maybe something else is good, but not speed. In general, they are most likely not very suitable for analyzing the human genome. It's just that all these algorithms are used not only for the human genome. Scientists likewise analyze bacteria in plants. Bacteria have genomes, of course, not 3 billion, but 10 million, for example. There the task is much simpler due to this.

    Alexander Astapenko: But bacteria are not a very solvent client, as I understand it, right?

    Ignatius Kolesnichenko: Naturally. You can’t ask the bacteria for money.

    Pavel Pavlov: Sorry!

    Ignatius Kolesnichenko: About our algorithms ... Naturally, the first stage is the most difficult and we use one of the available algorithms. Next is the stage of calling, you need to understand what mutation has occurred. Here we use both our own algorithm and the existing one. Until finally chosen. And then there is another interesting stage: for example, a mutation has occurred and we need to understand the probability with which it will lead to the fact that the protein in the coding of which this mutation is involved will break. What have we done? There are already many different algorithms that say: "This mutation leads to the breakdown of such and such a protein, with such and such probability." We have compiled such algorithms, used their results as features, and built our machine learning on these features. That is, here your algorithm works.

    Pavel Pavlov:And what was more difficult? Looking for the implementation of these algorithms or, for example, building a platform that would carry out all the calculation, analysis, is it infrastructural?

    Ignatius Kolesnichenko: These are slightly different tasks, but it seems to me that the infrastructure part was a bit more complicated. In the analysis of algorithms, the difficulty in doing this correctly, also requires a lot of computing resources. But in general, of course, the task is simpler. And to build an infrastructure that will run everything in Amazon, which will allow you to fill in the data and that continues to fill, to build in such a way that everything comes together in one service and works like a clock is not so simple.

    Pavel Pavlov: It turns out that both problems were solved, first of all, it was you, as the chief technical specialist in the company?

    Ignatius Kolesnichenko: Yes, he solved both those and other problems. But actually, when we were just starting out, I said: “Well, listen, guys, there are 20 hours a week that I can spend on you. But, obviously, we will not go far for 20 hours a week. ” Therefore, the first thing we went to look for CTO, which would be in charge of everything. I also helped him, naturally, but more in terms of thinking over architecture. It was realized without me. And I, in fact, was engaged in experiments and research, I chose which algorithm and how we will use it. Well, the whole part that concerns the launch and storage on Amazon was also my task.

    Pavel Pavlov:It probably makes sense to go through the entire workflow in order. To begin with, when users try to upload their 2--20 GB to your S3 storage. How is this process going? What services do you use?

    Ignatius Kolesnichenko: There are no special secrets. The user uploads the data, they are proxied and fall into S3. At the back-end, of course, everything was done neatly, pouring was simplified, it is possible to continue pouring, if the connection suddenly breaks down. To process the data after the fill, a MapReduce task is launched, which aligns all these data, analyzes, and at the output we have a list of mutations.

    Since there are not so many mutations (about 50 thousand are obtained for a human exoma), we can analyze locally, which we do. We have established various bases with which we look at mutations and understand where they are located at all, whether they encode proteins or not. From these databases, we also deflate links to articles, publications about mutations. And then we build a PDF report into which we write out the 50 most significant mutations. In the same report, as all the algorithms work, we collect a lot of additional information, for example, how many mutations there are, such and such. We show the user so that he can imagine what happened to his data, how we all processed it.

    There is also a personal account where the user can register, he stores his data, in fact, with us. He can also change the settings and choose what to show in the report and what not.

    Pavel Pavlov: And in most cases, the user, that is, the medical specialist, is able to deal with these settings and get the result he needs?

    Ignatius Kolesnichenko:Complex issue. It seems that nobody really really twists the settings, that is, people are more likely to receive a report and look at it further. The default settings are both suitable and not suitable at the moment. Now we already have the understanding that simply giving out mutations is not enough. Therefore, we are completing a system that will be interactive: here are the mutations that we have already found, and there are still symptoms of diseases. The doctor will indicate the symptoms and what diseases he is interested in, and we will understand how this relates to mutations.

    Pavel Pavlov: And this expectation is based on some kind of feedback?

    Ignatius Kolesnichenko:Yes, to be honest, every second user said that he gets our mutations, and then he still has to do the work to make a conclusion. We realized that we needed to aggregate all this information and help the user solve his problem in one place as quickly and easily as possible.

    Pavel Pavlov: I will return to the Amazon. To solve problems associated with MapReduce, Hadoop, there are separate specialized startups, cloud services that solve the problem more efficiently, and, at some points, even cheaper. By and large, if you only need MapReduce and S3 storage, then the service is obviously not used at full capacity.

    Ignatius Kolesnichenko:No, of course, not at full capacity. As far as I know, everyone else who specializes in deploying Hadoop and raising the stack, they are more suitable for, for example, banks that already have their own 10 machines, 10 servers. They just need to deploy it, configure it, help them build processes. But this is not the case with us, that is, we cannot keep 10 cars lifted all the time, because it will be very expensive for us now.

    We have a flow - several analyzes per day, so it’s easier for us to deploy a cluster of 5-10 machines several times, analyze everything in it in an hour and collapse it. It is currently cheaper. Probably, at some point, if we grow up, it will be more profitable to keep our own cluster, just out of 10 machines, and then it makes sense to turn to specialized services.

    Pavel Pavlov: Did you manage to achieve any changes, a serious improvement in productivity and, thereby, reduce the calculation time and save money?

    Ignatius Kolesnichenko: Yes, in Amazon there is such a nice feature as spot instances that allow you to reduce the amount of money and lose almost nothing. That is, they sell their residual capacity at a price 10 times less than they usually have a car.

    And so, in fact, we twisted, experimented, but in general I won’t say that we won something significant, 10-15 percent. The main problems in Amazon are that they all start out of the box, which, by and large, is designed to run java-code, and we have all the code written on the pros and had some problems with binary compatibility.

    Pavel Pavlov:Can you tell us a little more about incompatibility? I just didn’t have to face it before.

    Ignatius Kolesnichenko: In Amazon, there is an Elastic MapReduce, that is, a ready-made Hadoop image, but you can’t manage the image in any way. Well, actually, because Amazon is responsible for ensuring that he is a worker, so he set it up once in the appropriate way and gave it to you. And there is an opportunity to raise only similar virtualka. You pick up a similar virtual machine, configure something on it, send the assembled binary there, and it suddenly drops in libc when it creates a new track. Well, and a problem!

    Pavel Pavlov: Well, actually, can the process be automated to track errors? I understand that this all happens quite regularly?

    Ignatius Kolesnichenko: No, rather one-time. The problem happens when we update something on the machine. An update comes out, and the machine, say, was old enough, you update and libc is updated with it, because of which everything suddenly breaks.

    This is a technical problem, but it is solved. There is no special sense in automation here, because we run the same thing all the time. We are not a public service, the user himself does not collect binaries and does not send us. We, I do not know, have 5 binaries and you just need to make sure that they run successfully and work out.

    I, unfortunately, found out that Hadoop does not seem to be very capable of running such binary code. It is nevertheless very sharpened to make it convenient for java, but for those people who are trying to run third-party code, everything is not very convenient. There were some problems with the interface, with how to correctly specify the settings there. Because in the new Hadoop now all this in containers is launched separately, there is still a java layer that needs to allocate so much memory. That is, your whole task runs under the java machine as a child. There is a container, a java machine lives in it, your program lives in a java machine, and this java also buffers the data that it reads and writes. And you need to correctly take all this into account in order to maximize the use of the memory on the machine, which is, so that nothing falls, it does not come out for the memory.

    Pavel Pavlov: Well, and again it turns out that if you had a static cloud, some kind of stable configuration, would there be much less problems?

    Ignatius Kolesnichenko: Yes, that's true. But for now, we cannot afford a static cluster.

    Pavel Pavlov: And how much does a typical calculation - from downloading to receiving the final PDF file - take in time?

    Ignatius Kolesnichenko:Depends on the amount of data, but in general an hour. When the data is 2 GB, the postprocessing, which searches for different databases, it takes a minute, and when the data is 30 GB, it is already 10-15. And, unfortunately, this process is very difficult to parallelize; you can’t put it on MapReduce so easily, since all the databases in which you need to search, they occupy tens of gigabytes. And since the cloud is not static, then we can’t simply send these tens of gigabytes to all machines, because everything will start to abut the network. We already have about 5 minutes at the start of the cluster, the fact that we need our reference genome, which weighs 3 GB, and even any indexes with it, that is, about 10 GB, to send to all machines.

    Pavel Pavlov: On the Amazon? But is there a pretty serious channel, that is, at least 1 Gbps?

    Ignatius Kolesnichenko:It turns out not 5, but by and large less, because S3 lives somewhere in one place, and the cluster rises a little in another. That is, all in one person, of course, but there is clearly not an endless network and takes 10 GB to take 5 minutes. Unfortunately, there is no such thing that here is pure gigabit, we divide 10 GB by 100 MB / s and get 100 seconds. This does not happen, everything turns out much slower.

    Pavel Pavlov: And does Amazon now somehow optimize such infrastructure moments related to network topology and costs?

    Ignatius Kolesnichenko: I tried to communicate with the support and, it seems, doesn’t really allow it.

    Pavel Pavlov: That is, even if you choose the same data center, availability zone, you can still drive through a bunch of routers?

    Ignatius Kolesnichenko: It is difficult to understand how many routers are chasing there. They have an internal network raised and all this is very incomprehensible. Outside, some IPs stick out, and inside others. That is, inside there is completely its own network and it is impossible to understand what exactly happens there. It happens that different cars rise. It seems to have no effect, just such an interesting fact.

    Pavel Pavlov: Well, are they still trying to somehow balance the computing power of their units?

    Ignatius Kolesnichenko: Yes. I can’t say that one car turned out to be worse than others, it slows down.

    Pavel Pavlov: The input that the user downloads is some kind of standardized format? Does it depend on a sequencer or something else?

    Ignatius Kolesnichenko: Rather, different standards. There are, for example, in some programming area 20 standards. And let's write the 21st, which will unite all these 20! Well, and then it turns out that now 21 standards are simply. This often happens in exactly the same way. Some new company, laboratory or producer of sequencers says: “Listen, I thought and thought, and realized that the old format is so-so. Here he has such and such problems. I’ll take it and make a new format. ” And indeed the old format has such problems. But other companies continue to do the old-fashioned way and as a result, there are 1 more formats. But everything is actually not so bad with the input data formats.

    Alexander Astapenko:Ignat, are there any services now that store this data in cloud so that the user does not need to download them via the web interface?

    Ignatius Kolesnichenko: Yes, there is such a task. Now the trend is that sequencing providers are telling their customers: “Let's not connect the hard drive to the sequencer, but plug this green posting in, and we will all pour into our cloud right away.” The same Illumina, she has her own cloud, where she automatically uploads all the data. And when the doctor does the analysis, they send him a link. That's it, and no moves are needed there.

    ABOUT iBinom TEAM

    Alexander Astapenko: What kind of people are on your team now? Who are you going to hire and in which direction to move?

    Ignatius Kolesnichenko: Now, due to lack of money, the team has decreased, but initially it was divided into 2 parts. The first part is the construction of the service, the web interface, its development, etc. And there, in fact, was our CTO, which is no longer working with us, as well as another excellent web developer. In addition, we had and have one junior developer.

    Alexander Astapenko: Can you name the technologies to make it more or less clear?

    Ignatius Kolesnichenko: Web, back-end and front-end, we have on Node.js, plus Jade, CSS. There is still a bit of Python, Bash in the back-end, but that’s all about building a report, running on Amazon.

    This is one part of the team. The second part of the team that we have had and still have is research. The idea of ​​the whole startup belongs to our biologist Valery, who was also once involved in Genotek. In the research part of the team, there were 4-5 people at different points in time. We tried algorithms, studied new problems.

    For example, the following task: there is an analysis of the genome of mother, father and child, and it is necessary to more accurately answer the question about the child’s mutations, to understand what kind of mutations he has new, which parents do not have.

    There is another interesting task that concerns the mother-fetus system: we take a blood test from a pregnant woman, and pieces of her baby’s DNA are floating in this blood. And you can try to isolate and analyze them from there. That is, when we analyze, we will see a reading of the DNA of both mother and child. Thus, you can try to understand the mutations in the child in advance. A very promising area, but so far, unfortunately, the quality indicators in this matter are quite low.

    Alexander Astapenko: If we are talking about the back-end, then there were three of you?

    Ignatius Kolesnichenko: Not counting me.

    Alexander Astapenko: That is, only four people, right?

    Ignatius Kolesnichenko:Yes, but now I am left with one developer. That is, now we have a goal - to make some new piece of the service, to complete something. He is ready, working, does not fall, but since we do not have money, we cannot right now significantly develop it, redo it somewhere.

    Alexander Astapenko: It turns out that you are the official CTO in a team, company?

    Ignatius Kolesnichenko: Officially, I may not have such a role. As I said, we found CTO in June and until we ran out of money, he, in fact, led the entire back-end development.

    Alexander Astapenko: And now this position was given to you?

    Ignatius Kolesnichenko:Yes. Beta, I already launched more or less myself, I also did all the fine-tuning myself. Formally, I may not be CTO, but in fact I perform its functions.

    Alexander Astapenko: How do you imagine CTO in your company? What role should he play in this type of project?

    Ignatius Kolesnichenko: The responsibility of the service station is to think about architecture, about the future of the system. It requires a good understanding of how the back-end works, how it all runs on MapReduce — what problems and difficulties are there. Our service station should be interested in understanding biotechnology, algorithms. He also needs to lead these studies, to some extent. STO must definitely understand how it all works on Amazon, manage the development of back-end and front-end.

    Pavel Pavlov:It turns out that in your understanding STO is focused on the technical issues that the team and the project are working on, but should it go beyond the boundaries of the company and see what is happening around, somehow orientate itself in the industry?

    Ignatius Kolesnichenko: Must, of course, must. But here you need to understand that it’s hard, purely technically, even hard in time, to embrace all this in yourself. There are different options, as far as I see, there should be 2 or 3 people who communicate together and each has his own area of ​​responsibility. In my opinion, the area of ​​responsibility of the service station is the development, of course, the architecture of the system, an understanding of how it works, where it will move, what technical difficulties are. Also, he, of course, must look at the world around him, but still this is not his main duty.

    Pavel Pavlov: There is such a thing as an architect, lead developer, team leader, etc. And at the same time, there is a service station, that is, do you somehow overlap these concepts, is there any substitution?

    Ignatius Kolesnichenko: Probably there is. Much also depends on the scale of the company. As long as the company is small (less than 10 people) and there aren’t enough hands, then, in my understanding, the service station should even program something, some complicated things. At a minimum, he should read and review the entire code.

    When the company grows, of course, stratification occurs, the service station stops reading the code and begins to think more about the architecture, the product and its technical development. Team leads are appearing, senior developers who take the old responsibilities of the service station upon themselves.

    Similarly, in the scientific field, which begins to be divided into groups. A large sales group grows in a company. It’s just that scaling is necessary and in the current state we need a service station that can do everything a little bit.


    Alexander Astapenko: Are there any competitors? In Russia? World competitors?

    Ignatius Kolesnichenko: There are competitors, one of the main competitors is probably the desktop program CLCBio, which is already about 8. It basically solves the same problems. What are her flaws? Firstly, the analysis is done long enough; it takes 12 hours to analyze the complete human genome. And the second drawback is that it is quite complex and solves a million tasks at once and different, so biologists spend a lot of time just to learn how to work with it. But otherwise, of course, she knows how to give out all the necessary information. One of the richest companies in this area today.

    Alexander Astapenko: And some service models?

    Ignatius Kolesnichenko:There are service models. As I said, different companies are building cloud platforms for bioinformatics computing. Such services have a problem in that they are not at all sharpened by bioinformatics, by scientists, doctors. They have no purpose to investigate precisely hereditary human diseases. Their goal is to cover the segment from gray data to mutations. Most often, they do not go further.

    There are also competitors who are doing the same thing as us, but rather in manual mode. The client received the mutations, sent them to the company, which, say, a week or two, analyzes this data and sends a report with a detailed description of the mutations, an assessment of their value. Such companies cover a second area: how to understand from a mutation its relationship to hereditary diseases. Here we can still develop very far. And I do not have full confidence that we can automatically repeat everything that they do with their hands, but we will strive for this, naturally.

    Alexander Astapenko: And the sequencing companies?

    Ignatius Kolesnichenko: They also think on the topic of cloud platforms, they are building something, but in what form it is not very clear.

    Alexander Astapenko:Suppose iBinom manages to get a serious investment and the role of technologically determining the future of the company falls on you. Can you somehow outline where you should go with iBinom?

    Ignatius Kolesnichenko: We have a big new direction - the analysis of transcriptomes and the solution of the problem of determining the type of cancer. There are hundreds or even thousands of different types of cancerous tumors for which different drugs must be used. And this is a very difficult scientific task, that is, people are able to solve it in some specific cases, but from the point of view of algorithms it is already more complicated.

    The first stage is the same: you need to take the data, sequence it and get the mutations, and then go deep into the area of ​​what these mutations affect. That is, here I will need my interaction with a biologist. In fact, it’s really interesting for me to work with a scientist who can do something similar with some existing tools. And my task is to figure out how he does it, how the tools work and assemble from them something unified and working.

    Alexander Astapenko: And what about the consumer? How do you plan to develop?

    Ignatius Kolesnichenko: We have an idea to remake the web-interface, to make it more convenient. But this is such a technical task, that is, there are no interesting challenges in terms of infrastructure and programming.

    There was an idea to make sure that there is no gap when the user fills in the data and then analyzes it. And since we are doing, in fact, one analysis, it would be possible to analyze the whole thing at once during the pouring. Non-trivial task for the future.

    Also popular now: