ViktoryiaFedzkovich June 23, 2014 at 09:40

CTOcast # 2: Ignatius Kolesnichenko (iBinom - analysis of the human genome)

Introducing the second issue of the podcast on technology, processes, infrastructure, and people in IT companies. Today, CTOcast is visiting Ignatius Kolesnichenko, technical director of iBinom.

Listen to podcast

A few words about our interlocutor and iBinom:

Ignatius Kolesnichenko graduated from the Faculty of Mechanics and Mathematics of Moscow State University and the School of Data Analysis. Works at Yandex (since 2009). He started at Yandex.Tracks, the last couple of years he has been engaged in distributed computing. In 2013 he became a co-founder of iBinom. Conducts workshops on computational complexity.

iBinom was founded in 2013 (among the creators are Andrey Afanasyev, Valery Ilyinsky, Ignatius Kolesnichenko). The company is developing a SaaS solution for analyzing human genome data. The results of such an analysis can be used by doctors without special knowledge in bioinformatics, which makes the iBinom service unique. At the moment, a beta version of the project is ready and testing is being actively conducted at doctors and clinics.

Text version of the podcast (1st part)

About Olympiad programming, education and personal experience

Alexander Astapenko: Ignatius, I propose to start all the same not with iBinom, but with you, from your career. Can you tell us a little about how you became interested in programming, about working at Yandex, and how did you eventually come to iBinom?

Ignatius Kolesnichenko: ... For the first time, I more or less learned what programming is at the Lyceum No. 1511 at the Moscow Engineering Physics Institute. Before that, programming was also there, but this is not serious. Once I read an ad and I think: “Interestingly, programming, it turns out, can be olympiad! We need to go see what it is. " He came, looked and it turned out that you need to know a bunch of everything. You need to be able to program, before that, it turns out, I did not know how ...

Alexander Astapenko:As far as I know, you had an interesting story with the Olympics. Or did it begin later, at the university?

Ignatius Kolesnichenko: I began to participate in the Olympiads at school. ... And upon admission, in essence, I chose between physics and mathematics. At some point I decided - mathematics, because I really liked programming. I thought that I would come to the mehmat, and there, probably, the circles would deal with me. It turned out that there was nothing like that on the mechmath of programming circles. I had to look for people for a long time, a team to start participating in this somehow. But if you show proper perseverance, then the team is located and everything can be done.

It's great that there are a lot of people at the mechmath who have huge experience in olympiad programming and who won at different olympiads at school. You meet them, they teach you something. You start to participate and get into this community. Although there is no specific circle, it’s just that the fact that the community exists, with whom to compete and with whom to train, helps a lot. Well, for a couple of years we trained to the point that we began to take good places in the quarter-finals, went to the semifinals. The coolest guys, of course, were at the final of the ACM programming world championship. I didn’t go to the finals, but, nevertheless, this gave me great experience.

Alexander Astapenko:You can tell in a nutshell for those who are not entirely aware of what tasks you encountered at the Olympics. Does participating in such events provide practical experience and is it useful in real life and development?

Ignatius Kolesnichenko:I will immediately answer the second question. Yes! Now about the tasks. Tasks in the olympiad programming consist of two parts. Often one of them is simple, the other difficult is to come up with an algorithm. ... All tasks in Olympiad programming are tested automatically. There is a certain set of tests presented to a task that you do not see and which is hidden from you. And there is a time limit for which your program should work out on tests, the amount of memory that it should use. It is necessary to come up with and write such an algorithm that will fit into these limitations. Well, and, in fact, tasks are divided into two types: there are those in which it is difficult to come up with an algorithm, and there are those in which it seems that it is not very difficult to come up with an algorithm, but there are many different details and you need to carefully write and implement all this.

Is there any benefit from this? Yes, of course there is. Because a good programmer needs algorithms, he must be able to evaluate time, memory, speed, must be able to write programs efficiently. And it is advisable to do it quickly. Because even if you write excellent code, but spend two days on every thing, then this is no good and you won’t write so much. At some point, these vectors naturally diverge, and the olympiad programming the farther the more focused on constant training, on honing the writing of algorithms ...

Pavel Pavlov: What else distinguishes a good programmer from a bad one, in addition to understanding algorithms, algorithmic thinking and the ability to solve such problems ?

Ignatius Kolesnichenko:The olympic base, of course, is not enough to be a good programmer, because in life you are faced with a completely different kind of tasks and problems.

Firstly, real programs are much larger. At the olympiads, all programs are from 100 to 400-500 lines. In life, you have to write systems that consist of tens of thousands of lines, which are very complex and voluminous. It may be that every detail there is quite simple, but to think over all this interaction is very difficult. And another important point - the ability to think through the API. This is one part.

And the second part: since the programs are large, you need to be able to work with them in the future. Not that we are writing code now, we have written everything - it works, shut down and forgot about it. It is necessary to cover the code with tests, to think that someone will read it in the future. So olympiad programmers do not think at all that someone will read the code, therefore there are one-letter variables, functions are often not allocated, that is, such a canvas of code. The main thing is to write it quickly so that it works. This is suitable when you need to quickly write a prototype in life and check if an idea works. But this is completely unsuitable for code that goes into production, which needs to be maintained, developed, and so on.

Well, these skills are such, very unstructured, non-trivial and, in my opinion, are acquired simply with experience. It is impossible to read a book and learn to cover the code with tests, come up with the system architecture, come up with its API correctly. You do it once, you do two, you do three, and for the third time you already understand that now I am doing pretty well.

Pavel Pavlov: You touched on the topic of the education system ... Was it difficult to find people with whom you could have some common interests in terms of developing your skills and knowledge? How adequate do you think is the level of education at universities, schools? Or do you have to basically cook in some kind of narrow environment in order to receive this kind of knowledge?

Ignatius Kolesnichenko:Indeed, there is some problem in Russian education. We have top-level technical universities - Moscow State University, Moscow Engineering Physics Institute, Moscow Institute of Physics and Technology, Baumanka, where programming courses, especially everything related to industrial programming, do not correspond slightly to the level that is now generally presented in the field, and the level that is presented in those same European or American universities.

On the other hand, there are a lot of talented guys. In Russia, one of the strongest olympiad communities and in all these top universities there are people who are involved in olympiad programming. You can get into this party and there, in fact, fill your knowledge gap. They also often give some lectures or simply share knowledge with each other. But, of course, this is not a very correct approach, because Olympiad programmers grow up in this way, who then still have to finish up on the industry.

... But it seems that the situation is changing. In this sense, there is such a wonderful place as the School of Data Analysis. In addition, as far as I know, Yandex is opening a new faculty. More precisely, HSE is opening a new faculty with the support of Yandex. I get the feeling that it should be very cool there, but we'll see, see.

Alexander Astapenko: You hire people and you see that the guys also participated in programming competitions, is this an important factor for you?

Ignatius Kolesnichenko: This, of course, is a plus if a person went through the Olympiad programming, but he is not decisive.

Alexander Astapenko: University, Olympiad programming ... What happened next?

Ignatius Kolesnichenko:... During training at the School of Data Analysis, I was invited to an interview at Yandex. Then it was a shock: a third year, but you can already work, earn money and even solve some interesting problems. I acted as an intern and seems to have been him for over a year. ... Then I was already engaged in more complex things, I worked in Yandex.Tubes, where we rewrote the current infrastructure as a team. The first large system that I saw and in which I somehow participated. That was great.

About iBinom

Alexander Astapenko: Tell us about iBinom ... About the idea of how the project appeared and how you started.

Ignatius Kolesnichenko:Everything turned out very simple. I had a good friend from mehmat who, along with one biologist, set about creating the Genotek company. They were about a service like 23andMe, when a user comes to you, you take saliva from him, analyze it and tell him about some of his predispositions to hereditary diseases. The service, for the most part, is entertaining, that is, people just come just for fun. They have some money and are willing to spend it in order to find out such interesting new information for themselves. One evening I talked with my friend, and he told me: “Look, we have a biologist and he has one task ... Would you be interested in it?” The task was just to make a search for mutations in the genome by exon analysis.

... Relatively speaking, we discussed this in December, and in February I came with the news that everything seems to work out. What you previously did locally in 8 hours, I can do in 30 minutes. And somehow it turned out a prototype. Then the guys said: “Listen, in fact we are not a prototype, of course, we are interested. We have an idea how to monetize it. Let's do it. ” I thought, thought, and decided that I need to join.

There was a conflict of interest in the sense that I worked and work for Yandex, I like it on the one hand, and on the other, it’s also a super-interesting task, something completely different. And it’s kind of stupid to miss such an opportunity, so I decided to eat personal time and start spending two things at once. Actually, for a little over a year now we have been making a more or less meaningful startup.

Alexander Astapenko:Does Genotek still exist?

Ignatius Kolesnichenko: Yes, the company exists.

Alexander Astapenko: And there is no conflict of interest there?

Ignatius Kolesnichenko: She does a little different. There are some algorithms there too, but this is not its main specialization. A user comes to them, they need to take saliva analysis from him, take it to the device and analyze it there, and then, of course, do some computer analysis and on a beautiful site give the result: “You have a predisposition to this and that ... ”

Alexander Astapenko: Is it like the 23andMe that you mentioned?

Ignatius Kolesnichenko: Yes, this is the Russian analogue of 23andMe.

Alexander Astapenko:Let's talk about how iBinom works.

Ignatius Kolesnichenko: What is the human genome? The human genome is such a long, long sequence, which consists of 23 pieces. 23 pieces are chromosomes. Each chromosome is paired. In this sequence (its entire length of the order of 3 billion characters) there are sections of interest to us. That is, it, in principle, may be of interest to us all, but there are special sections - the so-called coding regions, in other words, genes. There are not so many of them, not 3 billion, but, I do not know, 50 million. And you need to find out what happens there: these genes work or not, what mutations are and what they affect.

The first difficult task is how to read it all. A long time ago, 50–40 years ago, they came up with a simple manual method for reading one piece. All of our DNA can be thought of as a long line of four letters ACTG. From the point of view of algorithms and analysis, you can look at DNA as letters and not think that these are some kind of nucleic acids and so on. And, in fact, you need to read it. About 50 years ago we learned to do it. True, to read a thousand characters, a person must spend a day. And we need not a thousand characters, but 50 million! How to do this is completely unclear. There was a big, big project called the Human Genome, in which a billion dollars was invested, if not more. And, in fact, the goal of this project was to read, assemble, and sort through the entire human genome.

But there is one more problem: here we can read 50, 100 characters each, we can read such pieces from different places, but we very poorly understand where these places actually are, and how then to assemble the human genome from them. Where is the science in this area now? Science has learned to read these pieces in very large quantities and very quickly, but, again, we do not know where they are. We take the entire human genome, all our chromosomes, break them into small pieces from 100 to 1000 in length, and then we read each from beginning to end. After that, we have a lot of such different readings and we need to collect the whole genome from them. In order for assembly to be possible at all, this procedure is repeated many times, at least 30, and often more. To each letter, each plot in the genome is covered many times. If we only do this once, then we don’t have any knowledge about how these pieces should stick together, and we won’t be able to glue them together. Therefore, there should be many such pieces, they are with a large coating. And then you need to apply some magic, a complex algorithm, which of these pieces will collect the entire genome. This is a task that is still difficult today, that is, to assemble the genome of a new organism is very difficult. This is what is called genome assembly. that is, to assemble the genome of a new organism is very difficult. This is what is called genome assembly. that is, to assemble the genome of a new organism is very difficult. This is what is called genome assembly.

It’s a little easier to analyze the genome of a specific person. The exact same analysis is done, small pieces from the genome are read, however, already from interesting, coding regions. The whole genome, as a rule, cannot be read, although this can also be done. Then we use such a wonderful property that all people are very similar to each other. We differ by no more than 1%. And therefore, if we take some reference human genome and take our reading, then we can not collect a new genome from our reads, but we can immediately compare these readings with the reference genome. Actually, find where they meet there, and see how they differ. So more or less the analysis of human mutations and various changes in this genome is done.

Alexander Astapenko:By the way, a very interesting phrase is “the ideal human genome”. It echoes so much with the middle of the last century.

Ignatius Kolesnichenko: There is some not ideal, but reference ...

Alexander Astapenko: Does it depend on the current political situation?

Ignatius Kolesnichenko: No, it does not depend on the current political situation. This is a purely biological, rather bioinformatic, scientific term.

Alexander Astapenko: And also doesn’t depend on skin color?

Ignatius Kolesnichenko:In many ways, actually. There are different builds. Relatively speaking, people in Europe and people in Russia (populations) are different. And, in general, you can collect such an average person located in Russia and an average person located in Europe, and they will be slightly different. If we analyze a person who lives in Europe, it is more reasonable to compare him with the European standard, and not with the African standard.

Alexander Astapenko: Or in Russia.

Ignatius Kolesnichenko:Or in Russia. Well, Russia is still like Europe in that sense. That is, they are closer, with Africa a little, as far as I know, further. But still there are minor differences and can be compared. The standard genome that is used everywhere, it is assembled according to the type of 100 or 1000 completely different people. They took 100 or 1000 different people, they were all analyzed and together from all their data they collected such a common genome. If we take one specific person, then he may have many different hereditary diseases that simply are in a recessive form or for some other reason do not appear, and then we will not succeed in a real standard. To collect such a standard, it is easier to take as many people as possible. If we take a majority at each point, then most likely this is the majority - a popular allele in our population, A popular letter in our population at this point. And, most likely, it is good, correct, and one that is rare is wrong and can cause something.

Let me tell you what the whole analysis consists of, because this is only the first part. This is the first complex technical part, the gray data that the device (sequencer) issued, making small sequences - readings. You simply donate your saliva or blood to this sequencer - any sample, all kinds of reagents are additionally stuffed there. It works, I don’t know, for 12 hours and writes readings to your hard drive. There are many, many, with a large coating. Typical data for exon analysis (analysis of human genes) is from 2-3 to 30 GB.

Alexander Astapenko: I met 200 GB in some of your videos.

Ignatius Kolesnichenko:200 GB is generally a complete genome. But the complete genome, in fact, in practical applications is of little interest to anyone. Rather, it is interesting purely in a scientific sense.

Alexander Astapenko: That is, in real life it is from 2 to 30 GB, something like that?

Ignatius Kolesnichenko: More or less, yes. When we made the service, we wanted to be able to work with the full genome too, because you never know who wants to ...

Well, we found changes in the genome: for example, in the reference genome we have the letter A, and for some reason I have the letter G in this place. Information that is completely incomprehensible. What to do with her? Then the difficult part begins again, which from a scientific point of view is only in its very origin - understanding the mutation, what diseases it can lead to, whether the protein for which it is responsible will work or not.

Yes, let's talk a little about proteins, actually. So we have these genes, what are genes? In terms of DNA, this is a subsequence. What she does? She encodes a protein. A protein consists of amino acids - these are separate molecules, and each amino acid is encoded by such conditional three letters from this sequence of ours, in succession. That is, we have a sequence of length 30, we break it into successive triples, and each triple is responsible for one amino acid. Then these 10 amino acids already form some protein. Usually they are not 10, the gene is, I do not know, 600 nucleotides and 200 amino acids. Proteins play a huge role in our body. They are responsible for all regulation, for the work of the cell. That is, a lot of proteins float in our cell, they participate in different reactions as catalysts and do a lot of things. If our protein broke, how did it break? A mutation has occurred, so in the protein one amino acid is not the same as another. The protein folds, that is, it matters to him how he will fold, and what kind of three-dimensional structure he will have. It will curl up somehow wrong, and everything will not work.

And it seems that if we have a mutation, then everything will break down and nothing will work, but the body is a tricky thing and there are a million layers of protection, so that one particular mutation does not lead to anything. In general, for understanding: each person has mutations, I do not remember exactly, about 1%. Nevertheless, all people live somehow and do not worry. This is because there are so many layers of protection. The first layer: if something has changed in one chromosome, the second (paired) chromosome can still continue to encode this protein, and everything will work. The second layer of protection: for example, in both of our alleles, the protein is broken, it stops working, and this protein has, say, 5 different brothers, which more or less perform the same function. It may be performed a little worse, slower. But in general, everything will continue to work. There is an even higher level. Even if a whole chain breaks down, then most likely the body can just live without it, it will not be so good, but it will still exist.

Alexander Astapenko: Are we talking about a cluster now, or is it about a person?

Ignatius Kolesnichenko: No, we are talking about a person now.

Alexander Astapenko: OK, go ahead!

Ignatius Kolesnichenko: This area of knowledge - how to understand whether a mutation has occurred in us and whether it will lead to some kind of hereditary disease - is very difficult. The knowledge of mankind, science is very limited. We know, maybe several thousand hereditary diseases, some kind of relationship, but this knowledge is very narrow. Obviously, the general area of such knowledge is 100 times larger than what is now studied.

Well, and what are we doing? We take what humanity knows and simply apply it to our analysis. There are many databases in which this knowledge is presented in different formats, received in different ways and in different laboratories. We try to aggregate them in the best way and, in fact, provide information on what hereditary diseases the current found mutations can lead to.

What do we do for the doctor in the end? The doctor in the clinic orders an analysis using a sequencer, the received data is poured into the cloud. Then he says: “Oh guys, look, I have data about this man here, let’s analyze them!” We analyze them and then give the doctor the following information: we found, for example, 200 of the most interesting, important mutations, and for which each of these mutations can be responsible. This is what we are doing now. But we are thinking of going a little further, in fact, many doctors are researching this way. The doctor is not interested in watching all these 200 mutations, because he now has a specific patient, and he is studying a specific disease, for example, of the liver. And he suggests that this disease has hereditary causes. And hereditary reasons can be, say, 10 different, and he does not know which one. Therefore, we are now completing a system in which the doctor can say: "I want to look at this group of diseases." Or “My patient has these symptoms, but you know what mutations he has. Let us understand from these symptoms and mutations what kind of disease this person will have. ” Well, there are also bases that associate the disease with different mutations. You just need to collect this information together to show the user, the doctor, and to understand why we made such a conclusion, how true he is. Here. It turned out a little messy, but somehow. Well, there are also bases that associate the disease with different mutations. You just need to collect this information together to show the user, the doctor, and to understand why we made such a conclusion, how true he is. Here. It turned out a little messy, but somehow. Well, there are also bases that associate the disease with different mutations. You just need to collect this information together to show the user, the doctor, and to understand why we made such a conclusion, how true he is. Here. It turned out a little messy, but somehow.

Alexander Astapenko: In fact, pretty much everything is clear. And one more moment, until we finally switched to the technical component. Here is, for example, a couple of hundred mutations, not all of which are interesting to the doctor. What percentage of error can creep in here? It’s one thing when the guys are looking for some features, analyzing the sale-purchase of securities on the stock exchange - there is a lot of money, but not the life of a person. Here we are talking about diagnosing serious diseases and any of these errors is much more significant than in the same exchange trading. Tell us about the likelihood of such errors, and are there any solutions to reduce their number.

Ignatius Kolesnichenko:Yes, there are a lot of mistakes. The sequencer itself is also mistaken, and this has to be fixed somehow. We are also mistaken from the point of view of algorithms. So, let's say, we find 200 some kind of critical mutations that look scary. But in real life, a person is healthy and only has liver problems.

Alexander Astapenko: And you say no. To the morgue - then to the morgue! Yes?

Ignatius Kolesnichenko:These 200 mutations ... In many cases, we simply do not understand why they do not actually lead to certain diseases. Therefore, for a doctor, this is rather advisory information. He took advantage of our service, found out that there are such and such mutations and they lead to a specific disease, which he suspected or not. The doctor thinks: “Well, here is the disease, most likely what you need. In principle, I can go and prescribe a medicine for my patient to treat it. ” But the doctor is not a fool and also just doesn’t believe us. Relatively speaking, he found this chain-relationship from mutations to what we told him, and can check if this is so.

We try to fully tell what bases we used and what we found for the doctor to see this chain. If he sees this whole chain, he can go and manually recheck. For a specific disease, the doctor is interested, for example, 3 mutations that we have brought. Checking 3 mutations is easy. This is not easy, of course, but quite possible. The doctor can double-check, and then he is already sure: yes, indeed, these 3 mutations are associated with this disease. Well, and then he takes responsibility and begins to treat the patient.

In the same America, NGS (New Generation Sequencing) technology does not have FDA approval, that is, this is just additional information for the doctor. The doctor uses it to better treat his patient. As the saying goes: "Why not?" Not all doctors use this and, as far as I understand, many of them do not trust this. It is clear that in the future the accuracy of technology will increase and at some point the doctor will no longer have to double-check everything. It’s just that the methods will become accurate, and the amount of knowledge will be sufficient.

I would compare this with programming since the 70s of the last century. Then people could safely say, for example, that it is impossible to be a good programmer and at the same time not know how some hardware works, how the mainframe works, and not know the Assembler language. Indeed, in those days it was like that. If you don’t know some basic things, then you have nothing to do, because there is not enough memory, there is not enough processor, you need to optimize all this, understand and know it well. But 30-40 years have passed and now 90% of programmers do not know what Assembler is, and they have little idea of how it all works inside. They look at programming from a high point of view and solve their problems at the level at which they need to be solved, and do not go into different details and areas.

I believe that in bioinformatics and biotechnology, in the field of hereditary diseases, this will also happen. Just now we are at the very, very beginning and our knowledge is very scarce. A geneticist, in a good way, must know the whole chain and understand how it works, otherwise he can make a mistake.

Pavel Pavlov: Tell me, for your part, how do you find and track sequencer errors?

Ignatius Kolesnichenko:Firstly, there are various automatic methods. But in general, we try to take real analyzes. We come to the clinic or laboratory, which already have a person with some kind of disease, they analyzed it, found certain mutations and checked these mutations. We take already verified information and find out how much our algorithm matches these verified knowledge. If not, then why? How to fix it?

Now people, in principle, are able to do the same as we. Just what is the problem? The problem is that for this laboratory you have to keep bioinformatics in your possession, which itself will launch all these algorithms with your hands somewhere on a typewriter, wait 12 hours until all this is completed. Then he searches for the necessary information on various databases and sites. Our goal is, firstly, to do quickly what people already know how to do, and secondly, to aggregate all the information, show it in a beautiful way, draw different statistics, and generally wrap it all up in a reasonable product.

Alexander Astapenko:Well, if we talk directly about the procedure, how does this all happen? For example, I am a doctor and I have some sort of sequenced data. In what format do I have this data (say, 30 GB)? And how does data transfer to you in real life so that you can begin to analyze and search for mutations?

Ignatius Kolesnichenko: In real life, the user already has them on the disk, or he downloads these 30 GB from somewhere. And then it just uploads to our website, they end up in S3. And then, as soon as the user uploaded them, he can begin the analysis.

Alexander Astapenko: And upload these 30 GB via the web interface?

Ignatius Kolesnichenko:Yes, we upload through the interface. But this, in fact, was one of the difficult technical tasks: to realize such an opportunity that, if the connection was broken, it was possible to continue.

The text version of the podcast will continue in the coming days.

Tags:

CTOcast # 2: Ignatius Kolesnichenko (iBinom - analysis of the human genome)

Text version of the podcast (1st part)

About Olympiad programming, education and personal experience

About iBinom

Also popular now: