The largest database in the world - at Yahoo! And it works on PostgreSQL!
Yahoo claims that it managed to break the world record by creating the largest and most loaded database in the world!
The volume of the database launched a year ago reached 2 petabytes. The system was created for analytical purposes, it stores a history of web users' behavior (it is alleged that half a billion users are saved per month). Among other things, the Internet giant claims that it is not only the largest database in the world, but also the most loaded - data on 24 billion events is recorded in it per day.
And now for the fun part. This monster is controlled by a modified PostgreSQL. This is the result of the purchase of Mahat Technologies, a startup company, initially working with the most advanced open-source database management system PostgreSQL. The Postgres code has been modified to work with such huge amounts of information (one of the biggest changes: orientation to column-wise storage instead of traditional line-by-line storage, which slows down writing to disk, but provides better access to data for analytical purposes). A positive result is evident: some tables in the database contain trillions of rows that not only lie dead on disks, but can be queried and processed with standard SQL, in a standard ACID-compatible environment.
Yahoo engineers expect growth to 5 petabytes by next year. And they are ready for such growth. For comparison: rarely there are enterprise-level databases of more than tens of terabytes. For example, one of the largest publicly known databases in the world - the US tax service database “weighs” only 150 terabytes. EBay says it works with systems that process 10 billion rows per day, with a total data volume of 6 petabytes in these systems, and the largest system with data volume of about 1.4 petabytes.
It should be understood that we are talking specifically about the DBMS and DB built on them. There are data warehouses with even more impressive volumes, but at the same time, the data in them is practically inaccessible for analysis and processing. For example, the World Climate Data Center in Hamburg has a storage of more than 6 petabytes of data stored on magnetic tape, while “only” 220 terabytes of data are in the “active” state (which are maintained by the DBMS running Linux, see PDF ) .
“PostgreSQL continues to evolve, confirming the title of the most developed open-source database management system,” commented Nikolai Samokhvalov, representative of Postgresmen. - Last year, Sun engineers showed the world that PostgreSQL is not inferior to Oracle performance. At the recent PGCon2008 international conference in Canada, NASA representatives spoke about their experiences using PostgreSQL to work with large climate observation databases. Yahoo's experience is yet another clear confirmation of the maturity of PostgreSQL. And this is very good news for all of us, it’s only a pity that, as far as I know, Yahoo has no plans to share its best practices with the community. ”
The volume of the database launched a year ago reached 2 petabytes. The system was created for analytical purposes, it stores a history of web users' behavior (it is alleged that half a billion users are saved per month). Among other things, the Internet giant claims that it is not only the largest database in the world, but also the most loaded - data on 24 billion events is recorded in it per day.
And now for the fun part. This monster is controlled by a modified PostgreSQL. This is the result of the purchase of Mahat Technologies, a startup company, initially working with the most advanced open-source database management system PostgreSQL. The Postgres code has been modified to work with such huge amounts of information (one of the biggest changes: orientation to column-wise storage instead of traditional line-by-line storage, which slows down writing to disk, but provides better access to data for analytical purposes). A positive result is evident: some tables in the database contain trillions of rows that not only lie dead on disks, but can be queried and processed with standard SQL, in a standard ACID-compatible environment.
Yahoo engineers expect growth to 5 petabytes by next year. And they are ready for such growth. For comparison: rarely there are enterprise-level databases of more than tens of terabytes. For example, one of the largest publicly known databases in the world - the US tax service database “weighs” only 150 terabytes. EBay says it works with systems that process 10 billion rows per day, with a total data volume of 6 petabytes in these systems, and the largest system with data volume of about 1.4 petabytes.
It should be understood that we are talking specifically about the DBMS and DB built on them. There are data warehouses with even more impressive volumes, but at the same time, the data in them is practically inaccessible for analysis and processing. For example, the World Climate Data Center in Hamburg has a storage of more than 6 petabytes of data stored on magnetic tape, while “only” 220 terabytes of data are in the “active” state (which are maintained by the DBMS running Linux, see PDF ) .
“PostgreSQL continues to evolve, confirming the title of the most developed open-source database management system,” commented Nikolai Samokhvalov, representative of Postgresmen. - Last year, Sun engineers showed the world that PostgreSQL is not inferior to Oracle performance. At the recent PGCon2008 international conference in Canada, NASA representatives spoke about their experiences using PostgreSQL to work with large climate observation databases. Yahoo's experience is yet another clear confirmation of the maturity of PostgreSQL. And this is very good news for all of us, it’s only a pity that, as far as I know, Yahoo has no plans to share its best practices with the community. ”