Google Platform 10+ years

    Storage and processing of data is a task that mankind has been solving for a thousand years with varying success. The problems associated with the solution of this problem are associated not only with the physical volume of data ( volume ), but also with the speed of variability of this data ( velocity ) and the variety ( variety ) of data sources - that Gartner analysts in their articles [11, 12] designated as "3V".

    Computer Science recently encountered a Big Data problem that private companies, governments, and the scientific community are waiting for solutions from IT.

    And a company has already appeared in the world that, with varying success, has been coping with the Big Data problem for 10 years now. In my opinion (since in order to state reliably we need open data that is not freely available), no commercial or non-commercial organization operates with a larger amount of data than the company in question.

    It was this company that was the main contributor to the ideas of the Hadoop platform , as well as many components of the Hadoop ecosystem, such as HBase, Apache Giraph, Apache Drill.

    You guessed it, this is about Google.



    Google Big Data Timeline


    Conventionally, the history of the development of "Big Data" solutions in Google can be divided into 2 periods:
    • 1st period ( 2003-2008 ): during this period a set of principles and concepts were described that are now de facto the standard in the world of processing large amounts of data (on commodity equipment).
    • 2nd period ( from 2009 to the present ): data processing technologies were described that, with a high degree of probability, will be used to solve "Big Data" tasks in the near future.

    2003-2008


    During this period, Google engineers described and published freely available research papers about 3 systems that Google uses to solve their problems:
    • Google File System (GFS) - distributed file system [1];
    • Bigtable [3] - a high-performance database focused on storing petabytes of data;
    • MapReduce [2] - a software model designed for distributed processing of large amounts of data.

    The impact of the work published by Google on the first steps in the development of the Big Data industry is hard to overestimate.
    The most famous example of the implementation of the concepts described by Google is the Hadoop platform. So the prototype of the HDFS file system is GFS ; The ideas behind the HBase architecture come from BigTable ; and the Hadoop MapReduce calculation framework (without YARN ) is an implementation of the principles embodied in a similar Google MapReduce framework.

    The Hadoop platform itself, starting in 2008, will gain popularity over several years and by 2010-2011 will become the de facto standard for working with Big Data.

    Now Hadoop is already a “locomotive” in the Big Data world and has a huge impact on this IT segment. But once upon a time, Hadoop had the same huge impact on the architectural approaches to building the Big Data platform described by Google .

    The Google platform itself has been developing all this time, adapting to more and more new requirements, the search engine has new services, including those whose nature corresponded to an interactive processing mode rather than a batch mode ; chunk sizes (clusters in GFS) were not suitable for efficient storage of all data types; there were requirements related to geo - distribution and support for distributed transactions .

    By 2009-2010, both at Google itself and in the academic environment, they studied in sufficient detail the advantages and limitations of a set of approaches for building the Big Data platform, described by Google engineers from 2003 to 2008. Yes, and the Google platform itself for the period until 2009 developed and evolved.

    2009-2013


    So, in (conditionally) the 2nd stage of development of the Big Data platform in Google - 2009-2013 - the following software systems were described by the company's researchers with varying degrees of detail:
    • Colossus (GFS2) is a distributed file system that is a development of GFS [10].
    • Spanner is a scalable geo-distributed storage with support for data versioning, which is a development of BigTable [8].
    • Dremel is a scalable system for processing requests in near-real-time mode (near-real-time), designed to analyze related read-only data [4].
    • Percolator is a platform for incremental data processing, which is used to update Google search indexes [9].
    • Caffeine is a Google search services infrastructure using GFS2, next-generation (iterative) MapReduce and next-generation BigTable [6].
    • Pregel is a scalable, fault tolerant and distributed graph processing system [7].
    • Photon is a scalable, fault tolerant and geo-distributed system for processing streaming data [5].

    In subsequent articles of the Google platform series, we will discuss most of the above internal Google products with which Google successfully solves the problems of storing, structuring and searching data, detecting spam, increasing the efficiency of displaying advertisements in contextual advertising services, maintaining data consistency on the social network Google+, etc.

    Instead of a conclusion


    Instead of concluding, I will quote a person who has already proved his ability to successfully predict the future of the Big Data industry, Mike Olson CEO Cloudera
    If you want to know what the large-scale, high-performance data processing
    infrastructure future looks like, my advice would be to read the Google research papers that are coming out right now.
    - Mike Olson, Cloudera CEO

    List of sources used to prepare the cycle


    main sources


    • [1] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. The Google File System. ACM SIGOPS Operating Systems Review, 2003.
    • [2] Jeffrey Dean, Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Proceedings of OSDI, 2004.
    • [3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, et al. Bigtable: A Distributed Storage System for Structured Data. Proceedings of OSDI, 2006.
    • [4] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, et al. Dremel: Interactive Analysis of Web-Scale Datasets. Proceedings of the VLDB Endowment, 2010.
    • [5] Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, Ashish Gupta, Haifeng Jiang, Tianhao Qiu, et al. Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams, 2013.
    • [6] Our new search index: Caffeine. Google Official blog.
    • [7] Grzegorz Malewicz, Matthew H. Austern, Aart JC Bik, James C. Dehnert, Ilan Horn, et al. Pregel: A System for Large-Scale Graph Processing. Proceedings of the 2010 international conference on Management of data, 2010.
    • [8] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, et al. Spanner: Google's Globally-Distributed Database. Proceedings of OSDI, 2012.
    • [9] Daniel Peng, Frank Dabek. Large-scale Incremental Processing Using Distributed Transactions and Notifications. Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.
    • [10] Andrew Fikes. Storage Architecture and Challenges. Google Faculty Summit, 2010.

    Additional sources



    Post Change History
    Commit 01 [12/23/2013]. Changed the title of the article.
    - Google Platform. 2003-2013
    + Google Platform. 10+ years
    Commit 02 [12.24.2013].
    + link to a post with a description of Colossus.
    Commit 03 [12/25/2013].
    + link to the post with the description of Spanner.
    Commit 04 [12/26/2013].
    + link to a post with a description of Dremel.
    Commit 05 [12/27/2013].
    + link to a post with a description of Photon.


    Dmitry Petukhov,
    MCP, PhD Student , IT zombies, a
    man with caffeine instead of red blood cells.

    Also popular now: