Greenplum db

    We continue the series of articles about the technologies used in the work of the data warehouse (Data Warehouse, DWH) of our bank. In this article I will try to briefly and a bit superficially talk about Greenplum - a database management system based on postgreSQL, which is the core of our DWH. The article will not provide installation logs, configs, etc. - and without this, the note turned out to be quite voluminous. Instead, I’ll talk about the general architecture of the DBMS, how to store and upload data, backups, and also list a few problems that we encountered during operation.



    A bit about our installations:

    • the project lives with us a little more than two years;
    • 4 circuits from 10 to 26 cars;
    • DB size about 30 TB;
    • there are about 10,000 tables in the database;
    • up to 700 queries per second.

    For how it works, I ask for a cut!

    1. General Architecture

    So, Greenplum (GP) is a relational DBMS that has a massive parallel processing architecture without sharing resources (Shared Nothing). For a detailed understanding of the principles of GP operation, it is necessary to identify the main terms:

    Master instance (aka “master”) - Postgres instance, which is both the coordinator and entry point for users in the cluster;

    Master host (“server-master”) - the server on which the Master instance is running;

    Secondary master instance - the Postgres instance, which is the backup master, is included in the work if the main master is unavailable (switching occurs manually);

    Primary segment instance ("segment")- Postgres instance, which is one of the segments. It is the segments that directly store data, perform operations with them, and return the results to the master (in the general case). In essence, the segment is the most common PostgreSQL 8.2.15 instance with WAL replication configured in its mirror on another server:

    /app/greenplum/greenplum-db-4.3.5.2/bin/postgres -D /data1/primary/gpseg76 -p 50004 -b 126 -z 96 --silent-mode=true -i -M quiescent -C 76 

    Mirror segment instance (“mirror”) - the Postgres instance, which is a mirror of one of the primary segments, automatically assumes the primary role if it falls:

    /app/greenplum/greenplum-db-4.3.5.2/bin/postgres -D /data1/mirror/gpseg76 -p 51004 -b 186 -z 96 --silent-mode=true -i -M quiescent -C 76 

    GP only supports 1-to-1 segment replication: there can only be one mirror for each primary.

    Segment host ("server-segment") - a server on which one or more segments and / or mirrors are running.

    In general, a GP cluster consists of several segment servers, one master server, and one second master server, interconnected by one or more fast (10g, infiniband) networks, usually interconnect:


    Fig. 1. Cluster composition and network interaction of elements. Here, the green and red lines are the isolated interconnect networks, the blue line is the external, client network.

    The use of several interconnect networks allows, first, to increase the bandwidth of the channel between the segments interacting with each other, and secondly, to ensure the fault tolerance of the cluster (in case of failure of one of the networks, all traffic is redistributed among the remaining ones).

    When choosing the number of segment servers, it is important to choose the cluster ratio “number of processors / TB data” depending on the planned database load profile — the more processor cores per unit of data, the faster the cluster will perform “heavy” operations and also work with compressed tables.

    When choosing the number of segments in a cluster (which in general is not tied to the number of servers in any way), remember the following:

    • all server resources are divided between all segments on the server (the load of mirrors, if they are located on the same servers, can be arbitrarily neglected);
    • each request on one segment cannot consume more than one CPU core. This means, for example, that if a cluster consists of 32-core servers with 4 GP segments on board and is used on average to process 3-4 simultaneous heavy, well-utilizing CPU requests, there will be no “average hospital” CPU disposed of optimally. In this situation, it is better to increase the number of segments on the server to 6-8;
    • the regular process of backup and data restaurant “out of the box” works only on clusters with the same number of segments. It will be impossible to restore data backed up on a cluster of 96 segments into a cluster of 100 segments without a file.

    2. Data storage

    Greenplum implements the classic data sharding scheme. Each table is N + 1 tables on all segments of the cluster, where N is the number of segments (+1 in this case is a table on the master, there is no data in it). Each segment stores 1 / N rows of the table. The logic of partitioning the table into segments is set by the distribution key (field) - such a field, based on the data of which any row can be assigned to one of the segments.

    The distribution key (field or set of fields) is a very important concept in the GP. As mentioned above, Greenplum operates at the speed of the slowest segment, which means that any imbalance in the amount of data (both within a single table and within the entire database) between segments leads to degradation of cluster performance, as well as to other problems. That is why you should carefully select the field for distribution - the distribution of the number of occurrences of the values ​​in it should be as uniform as possible. Did you choose the distribution key correctly? The gp_segment_id service field that exists in each table will tell you what it contains — the number of the segment on which a particular row is stored.

    An important nuance: GP does not support the UPDATE of the field over which the table is distributed.

    Consider example 1 (hereinafter in the examples, the cluster consists of 96 segments)
    db=# create table distrib_test_table as select generate_series(1,20) as num_field distributed by (num_field);
    SELECT 20
    db=# select count(1),gp_segment_id from distrib_test_table group by gp_segment_id order by gp_segment_id;
     count | gp_segment_id
    -------+---------------
         1 |             4
         1 |             6
         1 |            15
         1 |            21
         1 |            23
         1 |            25
         1 |            31
         1 |            40
         1 |            42
         1 |            48
         1 |            50
         1 |            52
         1 |            65
         1 |            67
         1 |            73
         1 |            75
         1 |            77
         1 |            90
         1 |            92
         1 |            94
    db=# truncate table distrib_test_table; 
    TRUNCATE TABLE
    db=# insert into distrib_test_table values (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1);
    INSERT 0 20
    db=# select count(1),gp_segment_id from distrib_test_table group by gp_segment_id order by gp_segment_id;
     count | gp_segment_id
    -------+---------------
        20 |            42
    


    In both cases, we distributed the table over the num_field field. In the first case, we inserted 20 unique values ​​in this field, and, as you can see, the GP laid out all the lines in different segments. In the second case, 20 identical values ​​were inserted into the field, and all rows were placed on one segment.

    If the table does not have suitable fields for use as a distribution key, you can use random distribution (DISTRIBUTED RANDOMLY). The distribution field can be changed in the already created table, but after that it needs to be redistributed.
    It is in the distribution field that Greenplum performs the most optimal JOINs: if in both tables the fields used to execute the JOIN are distribution keys, the JOIN is executed locally on the segment. If this condition is not true, the GP will have to either redistribute both tables by the desired field, or drop one of the tables entirely onto each segment (BROADCAST operation) and then join the tables locally on the segments.

    Example 2: Distribution Key Joint
    db=# create table distrib_test_table as select generate_series(1,192) as num_field, generate_series(1,192) as num_field_2 distributed by (num_field);
    SELECT 192
    db=# create table distrib_test_table_2 as select generate_series(1,1000) as num_field, generate_series(1,1000) as num_field_2 distributed by (num_field);
    SELECT 1000
    db=# explain select * from distrib_test_table sq
    db-# left join distrib_test_table_2 sq2
    db-# on sq.num_field = sq2.num_field;
    QUERY PLAN
    ------------------------------------------------------------------------------------------
     Gather Motion 96:1  (slice1; segments: 96)  (cost=20.37..42.90 rows=861 width=16)
       ->  Hash Left Join  (cost=20.37..42.90 rows=9 width=16)
             Hash Cond: sq.num_field = sq2.num_field
             ->  Seq Scan on distrib_test_table sq  (cost=0.00..9.61 rows=9 width=8)
             ->  Hash  (cost=9.61..9.61 rows=9 width=8)
                   ->  Seq Scan on distrib_test_table_2 sq2  (cost=0.00..9.61 rows=9 width=8)
    


    Joy not by distribution key
    db_dev=# explain select * from distrib_test_table sq                                                                                                                                                                                           left join distrib_test_table_2 sq2
    on sq.num_field_2 = sq2.num_field_2;
                                                   QUERY PLAN
    --------------------------------------------------------------------------------------------------------
     Gather Motion 96:1  (slice3; segments: 96)  (cost=37.59..77.34 rows=861 width=16)
       ->  Hash Left Join  (cost=37.59..77.34 rows=9 width=16)
             Hash Cond: sq.num_field_2 = sq2.num_field_2
             ->  Redistribute Motion 96:96  (slice1; segments: 96)  (cost=0.00..26.83 rows=9 width=8)
                   Hash Key: sq.num_field_2
                   ->  Seq Scan on distrib_test_table sq  (cost=0.00..9.61 rows=9 width=8)
             ->  Hash  (cost=26.83..26.83 rows=9 width=8)
                   ->  Redistribute Motion 96:96  (slice2; segments: 96)  (cost=0.00..26.83 rows=9 width=8)
                         Hash Key: sq2.num_field_2
                         ->  Seq Scan on distrib_test_table_2 sq2  (cost=0.00..9.61 rows=9 width=8)
    


    As you can see, in the second case, two additional steps appear in the query plan (one for each of the tables participating in the query): Redistribute Motion . In fact, before executing the query, GP redistributes both tables into segments, using the logic of the num_field_2 field, and not the original distribution key - the num_field field.

    3. Customer interaction

    In the general case, all interaction of clients with the cluster is carried out only through the wizard - it is he who answers the clients, gives them the result of the request, etc. Ordinary client users do not have network access to segment servers.

    To accelerate the loading of data into the cluster, bulk load is used - parallel loading of data from / to the client simultaneously from several segments. Bulk load is only possible with customers who have access to interconnects. Typically, such clients are ETL servers and other systems that need to load a large amount of data (in Fig. 1 they are designated as ETL / Pro client ).

    The gpfdist utility is used to load data onto segments in parallel. In fact, the utility raises a web server on a remote server, which provides access via the gpfdist and http protocols to the specified folder:

    /usr/local/greenplum-loaders-4.3.0.0-build-2/bin/gpfdist -d /tmp/work/gpfdist_home -p 8081 -l /tmp/work/gpfdist_home/gpfdist8081.log


    After starting, the directory and all files in it become accessible by the usual wget. For example, create a file in the directory served by gpfdist, and refer to it as a regular table.

    Example 3: gpfdist operation
    #На ETL-сервере:
    bash# for i in {1..1000}; do echo "$i,$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 8 | head -n 1)"; done > /tmp/work/gpfdist_home/test_table.csv
    #Теперь создаим внешнюю таблицу и прочитаем данные из файла
    #В Greenplum DB:
    db=# create external table ext_test_table
    db-# (id integer, rand varchar(8))
    db-# location ('gpfdist://etl_hostname:8081/test_table.csv')
    db-# format 'TEXT' (delimiter ',' NULL ' ');
    CREATE EXTERNAL TABLE
    db_dev=# select * from ext_test_table limit 100;
    NOTICE:  External scan from gpfdist(s) server will utilize 64 out of 96 segment databases
     id  |   rand
    -----+----------
       1 | UWlonJHO
       2 | HTyJNA41
       3 | CBP1QSn1
       4 | 0K9y51a3
    …
    


    Also, but with slightly different syntax, external web tables are created. Their peculiarity lies in the fact that they refer to the http protocol, and can work with data provided by third-party web servers (apache, etc.).

    The ability to create external tables on the data lying on the distributed Hadoop file system (hdfs) stands out separately - a separate component of gphdfs is responsible for this in the GP. To ensure its operation on each server that is part of the GP cluster, you need to install the Hadoop libraries and register the path to them in one of the database system variables. Creating an external table that accesses data on hdfs will look something like this:

    db=# create external table hdfs_test_table
    db=# (id int, rand text) 
    db=# location('gphdfs://hadoop_name_node:8020/tmp/test_file.csv') 
    db=# format 'TEXT' (delimiter ',');
    

    Here hadoop_name_node is the host address of the nimnode, /tmp/test_file.csv is the path to the desired file on hdfs.

    When referring to such a table, Greenplum finds out from the Hadoop numodeode the location of the necessary data blocks on the datandes, which are then accessed from the segment servers in parallel. Naturally, all nodes of the Hadoop cluster should be in the Greenplum cluster interconnect networks. Such a scheme of operation allows to achieve a significant increase in speed even in comparison with gpfdist. Interestingly, the logic for selecting segments for reading data from the hdfs datanode is very nontrivial. For example, a GP can start pulling data from all datanodes with only two segment servers, and when a similar request is repeated, the interaction scheme may change.

    There is also a type of external tables that refer to files on the segment servers or the file on the wizard, as well as the result of the command on the segment servers or on the wizard. By the way, the good old COPY from has not gone away and can also be used, however, compared to the above, it works more slowly.

    4. Reliability and redundancy

    4.1. Reservation Wizard

    As mentioned earlier, the GP cluster uses full master redundancy using the transaction log replication mechanism controlled by a special agent (gpsyncagent). However, automatic switching of the wizard role to the backup instance is not supported. To switch to the backup master, you must:

    • make sure that the main wizard is stopped (the process is killed and the postmaster.pid file is not in the working directory of the wizard instance)
    • run the gpactivatestandby -d / master_instance_directory command on the backup master server
    • switch the virtual ip address to the server of the new master (there is no virtual ip mechanism in Greenplum, you need to use third-party tools).

    As you can see, switching is not at all difficult and can be automated when taking certain risks.

    4.2. Segment Reservation

    The segment reservation scheme is similar to that for the master, the differences are very small. If one of the segments crashes (the postgres instance stops responding to the master during the timeout), the segment is marked as failed and its mirror is automatically started instead (in fact, an absolutely similar postgres instance). Segment data is replicated to its mirror based on WAL (Wright Ahead Log).
    It should be noted that a rather important place in the process of planning the architecture of a GP cluster is occupied by the location of segment mirrors on servers, since the GP gives complete freedom to the choice of location of segments and their mirrors: using a special map of segment location, they can be placed on different servers, in different directories and force to use different ports. Consider two boundary options:


    Option 1: all the mirrors of the segments located on host N are on host N + 1.

    In this case, if one of the servers fails, the neighbor server will have twice as many working segments. As mentioned above, cluster performance equals the performance of the slowest of the segments, which means that in the event of a single server failure, the database performance is reduced by at least half.
    However, this scheme has its positive aspects: when working with a failed server, only one server becomes the cluster vulnerability - the same one where the segments moved.


    Option 2: all the mirrors of the segments located on the host N are uniformly “smeared” on the servers N + 1, N + 2 ... N + M, where M is the number of segments on the server.

    In this case, in the event of a server failure, the increased load is evenly distributed among several servers, without greatly affecting the overall cluster performance. However, the risk of failure of the entire cluster increases significantly - it is enough to fail one of the M servers adjacent to the failed one initially.

    The truth, as often happens, is somewhere in the middle - you can place several mirror segments of one server on several other servers, you can combine servers into fault tolerance groups, and so on. The optimal mirror configuration should be selected based on the specific hardware data of the cluster, the criticality of downtime, and so on.

    There is also one more nuance in the mechanism of segment reservation, which affects cluster performance. In the event of a mirror failure of one of the segments, the latter goes into change tracking mode - the segment logs all changes, so that when restoring a fallen mirror, apply them to it and get a fresh, consistent copy of the data. In other words, when the mirror falls, the load created on the disk subsystem of the server by the segment left without a mirror increases significantly.

    If the cause of the segment failure (hardware problems, running out of space on the storage device, etc.) is eliminated, it must be returned to work manually using the special gprecoverseg utility (DBMS downtime is not required). In fact, this utility will copy the WA-logs accumulated on the segment onto the mirror and raise the fallen segment / mirror. In the case of a primary segment, it will initially be included in the work as a mirror for its own mirror, which has become primary (the mirror and the main segment will work by changing roles). In order to return everything to normal, you will need a rebalance procedure - a change of roles. This procedure also does not require downtime of the DBMS, however, during the rebalance, all sessions in the database will freeze.

    If the damage to the fallen segment is so serious that simply copying the data from the WA logs is not enough, it is possible to use the full recovery of the fallen segment - in this case, in fact, the postgresql instance will be recreated, however, due to the fact that the recovery will be not incremental, the recovery process can take a long time.

    5. Performance

    Greenplum cluster performance rating is a pretty loose concept. I decided to start with the tests conducted in this article: habrahabr.ru/post/253017 , since the systems in question are much similar. Since the tested cluster is significantly (8 times only in terms of the number of servers) more powerful than the one given in the article above, we will take 10 times more data for the test. If you would like to see the results of other cases in this article, write in the comments, if possible I will try to conduct testing.

    Initial data: a cluster of 24 segment servers, each server - 192 GB of memory, 40 cores. The number of primary segments in a cluster: 96.
    So, in the first example, we create a table with 4 fields + primary key for one of the fields. Then we fill the table with data (10,000,000 rows) and try to perform a simple SELECT with several conditions. I recall that the test is taken entirely from an article about Postgres-XL .

    Test 1: SELECT with conditions
    db=# CREATE TABLE test3
    db-# (id bigint NOT NULL,
    db(# profile bigint NOT NULL,
    db(# status integer NOT NULL,
    db(# switch_date timestamp without time zone NOT NULL,
    db(# CONSTRAINT test3_id_pkey PRIMARY KEY (id) )
    db-# distributed by (id);
    NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "test3_pkey" for table "test3"
    CREATE TABLE
    db=# insert into test3 (id , profile,status, switch_date) select a, round(random()*100000), round(random()*4), now() - '1 year'::interval * round(random() * 40) from generate_series(1,10000000) a;
    INSERT 0 10000000
    db=# explain analyze  select  profile, count(status) from test3
    db=#                         where status<>2 
    db=#                         and switch_date between '1970-01-01' and '2015-01-01'  group by profile;
    Gather Motion 96:1 (slice2; segments: 96) (cost=2092.80..2092.93 rows=10 width=16)
    Rows out: 100001 rows at destination with 141 ms to first row, 169 ms to end, start offset by 0.778 ms.
    -> HashAggregate (cost=2092.80..2092.93 rows=1 width=16)
       Group By: test3.profile
       Rows out: Avg 1041.7 rows x 96 workers. Max 1061 rows (seg20) with 141 ms to end, start offset by 2.281 ms.
       Executor memory: 4233K bytes avg, 4233K bytes max (seg0).
       -> Redistribute Motion 96:96 (slice1; segments: 96) (cost=2092.45..2092.65 rows=1 width=16)
          Hash Key: test3.profile
          Rows out: Avg 53770.2 rows x 96 workers at destination. Max 54896 rows (seg20) with 71 ms to first row, 117 ms to end, start offset by 5.205 ms.
          -> HashAggregate (cost=2092.45..2092.45 rows=1 width=16)
          Group By: test3.profile
          Rows out: Avg 53770.2 rows x 96 workers. Max 54020 rows (seg69) with 71 ms to first row, 90 ms to end, start offset by 7.014 ms.
          Executor memory: 7882K bytes avg, 7882K bytes max (seg0).
          -> Seq Scan on test3 (cost=0.00..2087.04 rows=12 width=12)
             Filter: status <> 2 AND switch_date >= '1970-01-01 00:00:00'::timestamp without time zone AND switch_date <= '2015-01-01 00:00:00'::timestamp without time zone
             Rows out: Avg 77155.1 rows x 96 workers. Max 77743 rows (seg26) with 0.092 ms to first row, 31 ms to end, start offset by 7.881 ms.
    Slice statistics:
    (slice0) Executor memory: 364K bytes.
    (slice1) Executor memory: 9675K bytes avg x 96 workers, 9675K bytes max (seg0).
    (slice2) Executor memory: 4526K bytes avg x 96 workers, 4526K bytes max (seg0).
    Statement statistics:
    Memory used: 128000K bytes
    Total runtime: 175.859 ms
    


    As you can see, the query execution time was 175 ms. Now let's try an example with a join on the distribution key of one table and the usual field of another table.

    Test 2: JOIN on the distribution key of one table and the usual field of another table
    db=# create table test3_1 (id bigint NOT NULL, name text, CONSTRAINT test3_1_id_pkey PRIMARY KEY (id)) distributed by (id);
    NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "test3_1_pkey" for table "test3_1"
    CREATE TABLE
    db=# insert into test3_1 (id , name) select a, md5(random()::text) from generate_series(1,100000) a;
    INSERT 0 100000
    db=# explain analyze select test3.*,test3_1.name from test3 join test3_1 on test3.profile=test3_1.id;
    -> Hash Join (cost=34.52..5099.48 rows=1128 width=60)
       Hash Cond: test3.profile = test3_1.id
       Rows out: Avg 104166.2 rows x 96 workers. Max 106093 rows (seg20) with 7.644 ms to first row, 103 ms to end, start offset by 223 ms.
       Executor memory: 74K bytes avg, 75K bytes max (seg20).
       Work_mem used: 74K bytes avg, 75K bytes max (seg20). Workfile: (0 spilling, 0 reused)
       (seg20) Hash chain length 1.0 avg, 1 max, using 1061 of 262151 buckets.
       -> Redistribute Motion 96:96 (slice1; segments: 96) (cost=0.00..3440.64 rows=1128 width=28)
          Hash Key: test3.profile
          Rows out: Avg 104166.7 rows x 96 workers at destination. Max 106093 rows (seg20) with 3.160 ms to first row, 44 ms to end, start offset by 228 ms.
          -> Seq Scan on test3 (cost=0.00..1274.88 rows=1128 width=28)
          Rows out: Avg 104166.7 rows x 96 workers. Max 104209 rows (seg66) with 0.165 ms to first row, 16 ms to end, start offset by 228 ms.
       -> Hash (cost=17.01..17.01 rows=15 width=40)
          Rows in: Avg 1041.7 rows x 96 workers. Max 1061 rows (seg20) with 1.059 ms to end, start offset by 227 ms.
          -> Seq Scan on test3_1 (cost=0.00..17.01 rows=15 width=40)
             Rows out: Avg 1041.7 rows x 96 workers. Max 1061 rows (seg20) with 0.126 ms to first row, 0.498 ms to end, start offset by 227 ms.
    Slice statistics:
    (slice0) Executor memory: 364K bytes.
    (slice1) Executor memory: 1805K bytes avg x 96 workers, 1805K bytes max (seg0).
    (slice2) Executor memory: 4710K bytes avg x 96 workers, 4710K bytes max (seg0). Work_mem: 75K bytes max.
    Statement statistics:
    Memory used: 128000K bytes
    Total runtime: 4526.065 ms
    


    The query execution time was 4.6 seconds. Whether it is a lot or a little for such a volume of data is a moot point and lies outside this article.

    6. Cluster expansion

    In the life cycle of a distributed analytical database, sooner or later a situation arises when the amount of available disk space can no longer accommodate all the necessary data, and retrofitting storage devices to existing servers is either impossible or too expensive and complicated (at least expanding existing partitions will be required) . In addition, the addition of disk capacities alone will negatively affect the “number of processors / TB data” ratio, which we discussed in Section 1. In simple terms, sooner or later it will be necessary to introduce new servers into the cluster.

    Greenplum allows you to add both new servers and new segments with virtually no downtime. The sequence of this action is approximately the following:

    • develop a segment map, according to which the GP will place new segments and mirrors on new servers;
    • backup the necessary critical data (at least all metadata);
    • install DBMS software on new servers;
    • stop the DBMS (the next point is performed in downtime);
    • Initialize new segments with gpexpand (takes 5 to 10 minutes);
    • raise the DBMS (downtime is over);
    • redistribute all tables;
    • collect statistics (analyze) for all tables.

    As you can see, although the expansion procedure is long, the database is completely inaccessible with the correct administrator actions will not exceed 20-30 minutes.

    7. Features of operation

    As usual, practice makes adjustments to a beautiful theory. I will share some of the nuances of operation that we have identified over a long time using the GP. Immediately make a reservation that the standard nuances of postgresql (the need for vacuum, the features of WAL replication) did not fall into this list.

    • automatic failover does not give a 100% guarantee of switching to the mirror
      Alas, but this is so, especially under load, there is a risk of the base processes freezing when trying to switch to the mirror. Partially, the problem is solved by reducing the response timeout from segments to several minutes, however, even in this case, the risk remains. As a private solution to the problem of freezing during switching, you can use manual killing of a hung segment or reloading the database.
    • Greenplum and OLTP are incompatible
      GP - an analytical database designed for a small number of simultaneous queries performing heavy operations on a large amount of data. A large number (more than 600 queries per second) of light queries / transactions that perform one operation adversely affect the performance of the database due to its distributed architecture - each transaction on the master generates N transactions on segments. It is good practice to aggregate a large number of update / insert into batches.
    • lack of incremental backup mechanism
    • own syntax
      Despite the fact that the Greenplum client is essentially postgresql DB, slight differences in SQL syntax force the use of standard client postgresql software with great care.
    • lack of the ability to mark segments as “archived”
      Partially, this drawback can be solved by using archive partitions located on the slow cheap tablespace, as well as using the ability to locate table partitions in external sources that appeared in the last version of the GP at the time of writing (for example, external gphdfs tables in the Hadoop cluster).
    • Greenplum uses Postgresql version 8.2.15, which means you will have to forget about many modern goodies of this wonderful database.

    8. Conclusion

    Greenplum is a powerful and flexible tool for analytical processing of large volumes of data. It requires a slightly different approach to itself than the rest of the enterprise-level solutions for Data Warehouse (the file is the favorite tool of the GP administrator). However, with a fairly low entry threshold and a lot of standardization with postgresql, Greenplum is a strong player in the Data Warehouse DB field.

    And finally, a small bonus - on February 17, 2015, Pivotal announced that in the near future Greenplum will become an open source project by entering the Big Data Product Suite .

    UPD 10/28/2015. The source code of the database is available on github: github.com/greenplum-db/gpdb

    Well, the section "urgent to number": on October 12 it became knownDell's acquisition of Pivotal, an EMC corporation .

    9. Links
    Official website
    How we replicate data from sources in Greenplum
    How we replicate data from one Greenplum contour to another

    Also popular now: