Database Design Basics - PostgreSQL, Cassandra, and MongoDB Comparison

Original author: https://medium.com/@sid_7889
  • Transfer
Hello, friends. Before leaving for the second part of the May holidays, we share with you the material that we translated on the eve of the launch of a new stream at the rate of “Relational DBMS” .



Application developers spend a lot of time comparing multiple operating databases to choose the one that works best for their intended workload. Needs may include simplified data modeling, transactional guarantees, read / write performance, horizontal scaling, and fault tolerance. By tradition, the choice starts with a database category, SQL or NoSQL, since each category provides a clear set of trade-offs. High performance in terms of low latency and high throughput is usually seen as a trade-off requirement, and is therefore necessary for any database in the sample.

The purpose of this article is to help application developers make the right choice between SQL and NoSQL in the context of modeling application data. We will look at one SQL database, namely PostgreSQL and two NoSQL databases - Cassandra and MongoDB, to talk about the basics of database design, such as creating tables, populating them, reading data from the table and deleting them. In the next article, we will definitely look at indexes, transactions, JOINs, TTL directives, and JSON-based database design.

What is the difference between SQL and NoSQL?

SQL databases increase application flexibility with ACID transactional guarantees, as well as their ability to query data using JOINs in unexpected ways on top of existing normalized relational database models.

Given their monolithic / single-node architecture and the use of a master-slave replication model for redundancy, traditional SQL databases do not have two important features - linear scalability of the record (i.e., automatic splitting into several nodes) and automatic / zero data loss. This means that the amount of data received cannot exceed the maximum write throughput of one node. In addition, some temporary data loss should be taken into account during fault tolerance (in an architecture without resource sharing). Here you need to keep in mind that recent commits have not yet been reflected in the slave copy. Updates without downtime are also difficult to achieve in SQL databases.

NoSQL databases are typically distributed by nature, i.e. in them, data is divided into sections and distributed across several nodes. They require denormalization. This means that the data entered must also be copied several times in order to respond to specific requests that you send. The overall goal is to get high performance by reducing the number of shards available while reading. It follows the statement that NoSQL requires you to model your queries, while SQL requires you to model your data.

NoSQL focuses on achieving high performance in a distributed cluster, and this is the main rationale for the many database design trade-offs that include transaction loss ACIDs, JOINs, and consistent global secondary indexes.

It is believed that although NoSQL databases provide linear write scalability and high fault tolerance, the loss of transactional guarantees makes them unsuitable for mission-critical data.

The following table shows how data modeling in NoSQL differs from SQL.



SQL and NoSQL: Why are both needed?

Real-world applications with a large number of users, such as Amazon.com, Netflix, Uber, and Airbnb, carry out complex, multi-sorted tasks. For example, an e-commerce application like Amazon.com needs to store lightweight, highly critical data, such as information about users, products, orders, invoices, along with heavy but less sensitive data, such as product reviews, support messages , user activity, user reviews and recommendations. Naturally, these applications rely on at least one SQL database along with at least one NoSQL database. In interregional and global systems, NoSQL database works as a geo-distributed cache for data stored in a trusted source, SQL database, operating in any one region.

How does YugaByte DB combine SQL and NoSQL?

Built on a log-oriented mixed storage engine, auto-sharding, sharding distributed consensus replication and ACID distributed transactions (inspired by Google Spanner), YugaByte DB is the first open source database in the world to be simultaneously compatible with NoSQL (Cassandra & Redis ) and SQL (PostgreSQL). As shown in the table below, YSQL, the Cassandra-compatible YugaByte DB API, adds the concepts of single and multi-key ACID transactions and global secondary indexes to the NoSQL APIs, thereby opening the era of transactional NoSQL databases. In addition, YSQL, a PostgreSQL compliant YugaByte DB API, adds the notion of linear write scaling and automatic fault tolerance to the SQL API, introducing distributed SQL databases to the world.



As previously stated in the article “Introducing YSQL: A PostgreSQL Compatible Distributed SQL API for YugaByte DB” , the choice between SQL or NoSQL in YugaByte DB depends entirely on the characteristics of the main workload:

  • If the main workload is multi-key operations with JOINs, then when choosing YSQL, understand that your keys can be distributed across several nodes, which will lead to a higher delay and / or lower throughput than in NoSQL.
  • Otherwise, select either of the two NoSQL APIs, bearing in mind that you will get better performance as a result of queries being served from one node at a time. YugaByte DB can serve as a single operational database for real complex applications in which you need to manage multiple workloads at the same time.

The data modeling lab in the next section is based on YugaByte DB databases compatible with PostgreSQL and Cassandra API, unlike the source databases. This approach emphasizes the ease of interaction with two different APIs (on two different ports) of the same database cluster, as opposed to using completely independent clusters of two different databases.

In the following sections, we will meet the data modeling lab to illustrate the differences and some common features of the databases in question.

Data Modeling Lab Database

Installation

Given the emphasis on designing a data model (rather than complex deployment architectures), we will install the databases in Docker containers on the local computer, and then we will interact with them using their corresponding command-line shells.

Compatible with PostgreSQL & Cassandra, YugaByte DB Database

mkdir ~/yugabyte && cd ~/yugabyte
wget https://downloads.yugabyte.com/yb-docker-ctl && chmod +x yb-docker-ctl
docker pull yugabytedb/yugabyte
./yb-docker-ctl create --enable_postgres

Mongodb

docker run --name my-mongo -d mongo:latest

Access via the command line

Let's connect to the databases using the command line shell for the corresponding APIs.

PostgreSQL

psql is a command line shell for interacting with PostgreSQL. For ease of use, YugaByte DB comes with psql directly in the bin folder.

docker exec -it yb-postgres-n1 /home/yugabyte/postgres/bin/psql -p 5433 -U postgres

Cassandra

cqlsh is a command-line shell for interacting with Cassandra and its compatible databases via CQL (Cassandra query language). For ease of use, YugaByte DB comes with cqlsha catalog bin.

Note that CQL was inspired by SQL and has similar concepts to tables, rows, columns, and indexes. However, as a NoSQL language, it adds a certain set of restrictions, most of which we will also cover in other articles.

docker exec -it yb-tserver-n1 /home/yugabyte/bin/cqlsh

MongoDB

mongo is a command line shell for interacting with MongoDB. It can be found in the bin directory of the MongoDB installation.

docker exec -it my-mongo bash 
cd bin
mongo

Creating a table

Now we can interact with the database to perform various operations using the command line. Let's start by creating a table that stores information about songs written by different artists. These songs may be part of an album. Also optional attributes for the song - year of release, price, genre and rating. We need to consider additional attributes that may be needed in the future through the "tags" field. It can store semi-structured data as key-value pairs.

PostgreSQL

CREATE TABLE Music (
    Artist VARCHAR(20) NOT NULL, 
    SongTitle VARCHAR(30) NOT NULL,
    AlbumTitle VARCHAR(25),
    Year INT,
    Price FLOAT,
    Genre VARCHAR(10),
    CriticRating FLOAT,
    Tags TEXT,
    PRIMARY KEY(Artist, SongTitle)
);	

Cassandra

Creating a table in Cassandra is very similar to PostgreSQL. One of the main differences is the lack of integrity constraints (for example, NOT NULL), but this is the responsibility of the application, not the NoSQL database . The primary key consists of a section key (Artist column in the example below) and a set of clustering columns (SongTitle column in the example below). The partition key determines which partition / shard to place the row in, and the clustering columns indicate how the data should be organized inside the current shard.

CREATE KEYSPACE myapp;
USE myapp;
CREATE TABLE Music (
    Artist TEXT, 
    SongTitle TEXT,
    AlbumTitle TEXT,
    Year INT,
    Price FLOAT,
    Genre TEXT,
    CriticRating FLOAT,
    Tags TEXT,
    PRIMARY KEY(Artist, SongTitle)
);

MongoDB

MongoDB organizes data into databases (Database) (similar to Keyspace in Cassandra), where there are collections (Collections) (similar to tables) that contain documents (Documents) (similar to rows in a table). MongoDB basically does not require the definition of the original schema. The “use database” command shown below creates an instance of the database on the first call and changes the context for the newly created database. Even collections do not need to be created explicitly, they are created automatically, simply when you add the first document to a new collection. Please note that MongoDB uses the test database by default, therefore any collection-level operation without specifying a specific database will be performed in it by default.

use myNewDatabase;

Retrieving PostgreSQL table information

\d Music
Table "public.music"
    Column    |         Type          | Collation | Nullable | Default 
--------------+-----------------------+-----------+----------+--------
 artist       | character varying(20) |           | not null | 
 songtitle    | character varying(30) |           | not null | 
 albumtitle   | character varying(25) |           |          | 
 year         | integer               |           |          | 
 price        | double precision      |           |          | 
 genre        | character varying(10) |           |          | 
 criticrating | double precision      |           |          | 
 tags         | text                  |           |          | 
Indexes:
    "music_pkey" PRIMARY KEY, btree (artist, songtitle)

Cassandra

DESCRIBE TABLE MUSIC;
CREATE TABLE myapp.music (
    artist text,
    songtitle text,
    albumtitle text,
    year int,
    price float,
    genre text,
    tags text,
    PRIMARY KEY (artist, songtitle)
) WITH CLUSTERING ORDER BY (songtitle ASC)
    AND default_time_to_live = 0
    AND transactions = {'enabled': 'false'};

Mongodb

use myNewDatabase;
show collections;

Posting data to a PostgreSQL table

INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Year, Price, Genre, CriticRating, 
    Tags)
VALUES(
    'No One You Know', 'Call Me Today', 'Somewhat Famous',
    2015, 2.14, 'Country', 7.8,
    '{"Composers": ["Smith", "Jones", "Davis"],"LengthInSeconds": 214}'
);
INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Price, Genre, CriticRating)
VALUES(
    'No One You Know', 'My Dog Spot', 'Hey Now',
    1.98, 'Country', 8.4
);
INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Price, Genre)
VALUES(
    'The Acme Band', 'Look Out, World', 'The Buck Starts Here',
    0.99, 'Rock'
);
INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Price, Genre, 
    Tags)
VALUES(
    'The Acme Band', 'Still In Love', 'The Buck Starts Here',
    2.47, 'Rock', 
    '{"radioStationsPlaying": ["KHCR", "KBQX", "WTNR", "WJJH"], "tourDates": { "Seattle": "20150625", "Cleveland": "20150630"}, "rotation": Heavy}'
);

Cassandra

In general, the expression INSERTin Cassandra looks very similar to that in PostgreSQL. However, there is one big difference in semantics. In Cassandra, INSERTit is actually an operation UPSERTwhere the last values ​​are added to the string in case the string already exists.
Data entry is similar to PostgreSQL INSERTabove

MongoDB

Although MongoDB is a NoSQL database like Cassandra, its insertion operation has nothing to do with semantic behavior in Cassandra. In MongoDB, insert () has no capabilities UPSERT, which makes it look like PostgreSQL. Adding default data without _idspecifiedwill add a new document to the collection. Querying the table Perhaps the most significant difference between SQL and NoSQL in terms of querying is the use of and . SQL allows you to select multiple tables after an expression , and an expression with can be of any complexity (including operations

db.music.insert( {
artist: "No One You Know",
songTitle: "Call Me Today",
albumTitle: "Somewhat Famous",
year: 2015,
price: 2.14,
genre: "Country",
tags: {
Composers: ["Smith", "Jones", "Davis"],
LengthInSeconds: 214
}
}
);
db.music.insert( {
artist: "No One You Know",
songTitle: "My Dog Spot",
albumTitle: "Hey Now",
price: 1.98,
genre: "Country",
criticRating: 8.4
}
);
db.music.insert( {
artist: "The Acme Band",
songTitle: "Look Out, World",
albumTitle:"The Buck Starts Here",
price: 0.99,
genre: "Rock"
}
);
db.music.insert( {
artist: "The Acme Band",
songTitle: "Still In Love",
albumTitle:"The Buck Starts Here",
price: 2.47,
genre: "Rock",
tags: {
radioStationsPlaying:["KHCR", "KBQX", "WTNR", "WJJH"],
tourDates: {
Seattle: "20150625",
Cleveland: "20150630"
},
rotation: "Heavy"
}
}
);




FROMWHEREFROMWHEREJOINbetween tables). However, NoSQL tends to impose a strict restriction on FROM, and work with only one specified table, and in WHERE, the primary key must always be specified. This is due to the desire to improve the performance of NoSQL, which we talked about earlier. This desire leads to every possible reduction in any cross-tabular and cross-key interaction. It can lead to a large delay in inter-nodal communication when responding to a request and, therefore, it is best avoided in principle. For example, Cassandra requires that requests be limited to specific operators (allowed only =, IN, <, >, =>, <=) on partition keys, except when requesting a secondary index (only the = operator is allowed here).

PostgreSQL

Three examples of queries that can be easily executed by an SQL database will be given below.

  • Print all songs of the artist;
  • Print all songs of the artist that match the first part of the name;
  • List all songs of the artist that have a specific word in the title and have a price less than 1.00.

SELECT * FROM Music
WHERE Artist='No One You Know';
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE 'Call%';
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE '%Today%'
AND Price > 1.00;

Cassandra

Of the above PostgreSQL queries, only the first will work in Cassandra unchanged, since the operator LIKEcannot be applied to clustering columns such as SongTitle. In this case, only operators =and are allowed IN.

SELECT * FROM Music
WHERE Artist='No One You Know';
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle IN ('Call Me Today', 'My Dog Spot')
AND Price > 1.00;

MongoDB

As shown in previous examples, the main method for creating queries in MongoDB is db.collection.find () . This method explicitly contains the name of the collection ( musicin the example below), so the request for several collections is prohibited.

db.music.find( {
  artist: "No One You Know"
 } 
);
db.music.find( {
  artist: "No One You Know",
  songTitle: /Call/
 } 
);

Reading all rows of a table

Reading all rows is just a special case of the query template that we examined earlier.

PostgreSQL

SELECT * 
FROM Music;


Cassandra

Similar to the example in PostgreSQL above.


Mongodb

db.music.find( {} );

Editing data in a

PostgreSQL table PostgreSQL

provides instructions UPDATEfor modifying data. It does not have the capabilities UPSERT, so the execution of this instruction will fail if the row is no longer in the database.

UPDATE Music
SET Genre = 'Disco'
WHERE Artist = 'The Acme Band' AND SongTitle = 'Still In Love';

Cassandra

Cassandra has a UPDATEsimilar PostgreSQL. UPDATEhas the same semantics UPSERTlike INSERT.

Similar to the example in PostgreSQL above.

MongoDB
The update () operation in MongoDB can completely update an existing document or update only certain fields. By default, it updates only one document with semantics disabled UPSERT. Updating several documents and similar behavior UPSERTcan be applied by setting additional flags for the operation. For example, in the example below, the genre of a particular artist is updated by his song.

db.music.update(
  {"artist": "The Acme Band"},
  { 
    $set: {
      "genre": "Disco"
    }
  },
  {"multi": true, "upsert": true}
);

Removing data from a PostgreSQL table



DELETE FROM Music
WHERE Artist = 'The Acme Band' AND SongTitle = 'Look Out, World';

Cassandra

Similar to the example in PostgreSQL above.

MongoDB

MongoDB has two types of operations for deleting documents - deleteOne () / deleteMany () and remove () . Both types delete documents, but return different results.

db.music.deleteMany( {
        artist: "The Acme Band"
    }
);

Deleting a PostgreSQL table



DROP TABLE Music;

Cassandra

Similar to the example in PostgreSQL above.

Mongodb

db.music.drop();

Conclusion The

debate about choosing between SQL and NoSQL has been raging for over 10 years. There are two main aspects of this debate: the architecture of the database engine (monolithic, transactional SQL versus distributed, non-transactional NoSQL) and the approach to database design (data modeling in SQL versus modeling your queries in NoSQL).

With a distributed transactional database such as YugaByte DB, the debate about the database architecture can be easily dispelled. As data volumes become larger than what can be written to a single node, a fully distributed architecture that supports linear scalability of recordings with automatic sharding / rebalancing becomes necessary.

In addition, as stated in a Google Cloud articleTransactional, strictly consistent architectures are now more widely used to provide better development flexibility than non-transactional, ultimately consistent architectures.

Returning to the discussion of database design, it is fair to say that both design approaches (SQL and NoSQL) are necessary for any complex real application. The SQL "data modeling" approach allows developers to more easily meet changing business requirements, while the NoSQL "modeling query" approach allows the same developers to handle large amounts of data with low latency and high throughput. It is for this reason that YugaByte DB provides SQL and NoSQL APIs in a common kernel, rather than promoting one of the approaches. In addition, by ensuring compatibility with popular database languages, including PostgreSQL and Cassandra, YugaByte DB ensures that developers do not have to learn another language to work with a distributed, strictly consistent database engine.

In this article, we figured out how the fundamentals of database design differ in PostgreSQL, Cassandra, and MongoDB. In the following articles, we will dive into advanced design concepts such as indexes, transactions, JOINs, TTL directives, and JSON documents.

We wish you a wonderful stay the rest of the weekend and invite you to a free webinar , which will be held on May 14.

Also popular now: