"Mass Product": the first commercial DNA repository will be presented in 2019

    Launching the service is planning a startup Catalog. The company is developing a special installation that will allow daily recording of a terabyte of data of 500 trillion DNA molecules.

    Further we will tell about the approach used by Catalog and other recent developments in the DNA field.


    / photo University of Michigan CC

    Project details


    The classical approach to writing data to DNA involves transforming a sequence of bits — zeros and ones — into a sequence of four basic DNA bases. For example, nitrogen bases adenine (A), thymine (T), guanine (G) and cytosine (C) can be represented as follows: A = 00, T = 01, G = 10, C = 11.

    Using this approach, in 2016 Microsoft managed to “perpetuate” 200 MB of text and video in synthetic DNA molecules (which we already wrote in one of the posts ). However, this method is poorly suited for mass data recording, while being expensive.

    Instead of using millions of DNA chains, researchers from Catalog propose to generate a large number of different DNA molecules consisting of no more than 30 base pairs. Then, due to enzymatic reactions, these previously prepared “pieces” form special patterns that encode information. Thus, instead of representing a single nitrogen base, the bits are lined up in multi-dimensional matrices. And groups of molecules reflect the position of the bits in these matrices.

    Devin Leake, Head of Catalog Research, leadsfollowing analogy: “Imagine that you have a book. You can copy it manually: letter by letter. Similarly, you can write data in DNA - molecule by molecule. This approach is used in Microsoft. We propose to create a kind of "printing press", where DNA molecules will be a headset . Thus, rearranging the pre-generated molecules, we work immediately with whole words, arranging them in the right order. ”

    Using this method, researchers from the Catalog successfully recorded and recovered data in DNA. To do this, they used the poem The Road Not Taken(in one of the translations - “The Other Road”) by Robert Frost. Now the company is solving the task of scaling the platform to the needs of IT companies and government organizations.

    According to one of the founders of the Catalog, Hyunjun Park, this approach will make terabyte DNA storage commercially profitable by the beginning of 2019. However, the exact cost of the storage service that the startup will offer is not yet known.

    Similar developments


    As already noted, the creation of DNA repositories involved in Microsoft. And since 2016, researchers from the company have advanced in their developments: in February 2018, they created a “ primer library ” to organize random access to DNA. Each of the primers is “tied” to a specific chain, therefore, using the polymerase chain reaction, you can select any of them (and get access to the recorded data).


    / photo by Col Ford and Natasha de Vere CC

    The company hopes that this approach, together with a new, less error-prone algorithm for writing and reading data, will help in the future to create DNA storage of several terabytes. The plans of the IT giant to provide DNA storage as a service. Companyset a goal to implement the idea by 2020.

    Reciprocal DNA and AI


    There are no particular difficulties with recording information onto a carrier DNA: companies have come up with ways to automate. But the process of reading information is still complicated and time consuming. To solve this problem, Lifebit also plans to use AI systems. Lifebit is developing a cloud-based platform, Deploit, based on MO algorithms, which will automate the process of reading information from DNA carriers.

    Thus, machine learning will contribute to the organization of DNA storage. However, the opposite is also true - DNA molecules are used to create artificial intelligence systems. For example, researchers from Caltech are working in this area .

    The principle of operation of their neural network is based on chemical reactions, calledshifting strands (the DNA replication mechanism known in some viruses), when a thread called an incoming one displaces one of the strands of the original DNA. The “intellectual system” has already been taught to recognize numbers written by hand.

    The figure is drawn on a square plane divided into one hundred identical cells (10x10) - original pixels. Each of these cells is represented by a DNA molecule that “knows” whether there is a piece of digit on this pixel. After all the molecules are mixed in one tube, and the DNA network gives its answer with the help of fluorescent signals. The tube begins to emit a glow, the color of which depends on the recognized digit. For example, green and yellow mean five, and green and red nine.

    The plans of researchers to form a kind of memory in the neural network, so that it “remembers” the training vectors and uses them to solve other problems.

    O Catalog

    Catalog is an American startup founded in 2016 that develops data storage technologies in DNA molecules. Headquartered in Boston, Massachusetts.



    PS A couple of additional materials from the First Corporate IaaS blog:




    The main direction of our activity is the provision of cloud services:

    Virtual Infrastructure (IaaS) | PCI DSS Hosting | Cloud FZ-152 | Rent 1C in the cloud

    Also popular now: