Back to Home

ReplacingMergeTree in ClickHouse: Complete Guide

The article details the ReplacingMergeTree engine in ClickHouse: why it is needed to combat duplicates in unreliable delivery, how merging works by ORDER BY, the role of versioning and the FINAL modifier. Typical pitfalls, comparison with CollapsingMergeTree, and the materialized view pattern to bypass FINAL are discussed.

ReplacingMergeTree: Data Deduplication in ClickHouse
Advertisement 728x90

ReplacingMergeTree: How to Beat Duplicates in ClickHouse Without Pain

1. Why ReplacingMergeTree Is Needed β€” The Real-World Duplicate Problem

Imagine you're developing an online casino. A player clicks the "Place Bet" button β€” 1000 rubles on black. At that moment, the server processing the request suddenly crashes (overheating, network failure, who knows). The client didn't get a response and thinks: "The bet didn't go through." The player clicks again. The server recovers and accepts both requests. In the database β€” two identical bets. The player is furious: 2000 rubles were deducted instead of 1000.

This is a classic idempotency problem (from Latin idem β€” same, potens β€” capable). An operation is idempotent if repeating it yields the same result as doing it once. In the database world, we need a mechanism that figures out: "I've already seen this bet, I'll ignore the second version."

In ClickHouse, there's ReplacingMergeTree for this. It's a table engine that automatically removes duplicates during merging of data parts. But I'll warn you upfront: it's not magic β€” it has quirks we'll discuss.

Google AdInline article slot

Real-life analogy: ReplacingMergeTree is like a secretary who keeps a meeting log. People come to you with requests. Sometimes the same client brings two identical applications (e.g., missed a train and asks for a refund, then calls again with the same request). The secretary doesn't throw out duplicates at the door β€” they just put all papers in a folder. Once a day, they go through the folder and keep only the latest application from each client. If someone asks "how many applications from Ivanov?" before the sorting β€” they'll see two. After β€” one.

2. How ReplacingMergeTree Works β€” Brick by Brick

Duplicates Arise from Unreliable Delivery

ClickHouse was originally designed for large-scale analytics, where an occasional miss or duplicate isn't critical. But then people started using it for critical data β€” and got burned. ReplacingMergeTree is the answer to that pain.

Why do duplicates appear at all?

Google AdInline article slot
  • The client sent data, didn't get an acknowledgment (timeout), and sent again.
  • The queue system (Kafka, RabbitMQ) gives at-least-once guarantee β€” at least one delivery, possible repeats.
  • A bug in the ETL process (Extract, Transform, Load) β€” the pipeline was run twice.

Mechanism: Merge by ORDER BY Key

When creating a table with ReplacingMergeTree, you must specify a sorting key β€” ORDER BY (column1, column2). This is not a primary key in the classical sense (like in PostgreSQL), but a way to physically order data on disk. ClickHouse stores data in parts β€” chunks sorted by this key.

When two parts merge into one (a background process called merging), ReplacingMergeTree scans rows with the same ORDER BY key value and keeps only one. Which one? By default β€” the last one by insertion time. But you can specify a numeric version column, then the row with the maximum version value remains.

Git analogy: ReplacingMergeTree during merge behaves like Git when you resolve a conflict: out of two changes to the same file, the latest one is kept (if you don't specify a strategy explicitly). Only here the file is a row in the table, and the key is ORDER BY.

Google AdInline article slot

Versioning: How ReplacingMergeTree(version) Changes the Rules

Syntax: ReplacingMergeTree(version_column). If version_column is an integer (UInt* or DateTime), the row with the largest value remains. This gives manual control: you can explicitly specify which version "wins."

Example: we send bets with updated_at = now(). On a resend, updated_at will be slightly larger. Merge will keep the fresher one. If you don't specify version, ClickHouse picks the last one that arrived β€” which may not be the most recent by business logic, just the last physical insert. The difference matters.

3. CREATE TABLE with ReplacingMergeTree β€” Breaking It Down

-- Create a table for bets with deduplication
CREATE TABLE bets_dedup
(
    user_id    UInt64,           -- Player ID (who owns the bet)
    bet_id     String,           -- Unique bet ID (generated on the client)
    amount     Decimal(10,2),    -- Amount in rubles
    created_at DateTime,         -- Bet creation time
    updated_at DateTime          -- Last update time (for version)
)
ENGINE = ReplacingMergeTree(updated_at)   -- Engine with version by updated_at
ORDER BY (user_id, bet_id)                -- Deduplication key: (user_id, bet_id)

What's happening line by line:

  • ENGINE = ReplacingMergeTree(updated_at) β€” specifies that this is ReplacingMergeTree, and the updated_at column will be used as the version. During merge, of two rows with the same ORDER BY, the one with the larger updated_at (newer) remains. If updated_at is equal β€” the last physically inserted one remains (but it's better not to rely on that).

  • ORDER BY (user_id, bet_id) β€” the most important parameter! This set of columns defines what counts as a duplicate. Two rows are considered duplicates if they have the same values for all columns in ORDER BY. Here: a bet from user user_id with bet_id β€” is unique. If two rows with user_id=123, bet_id='abc-456' arrive β€” they'll merge into one.

Why ORDER BY and not PRIMARY KEY? In ClickHouse, PRIMARY KEY doesn't have to be unique. It's a hint for the index, while ORDER BY is the physical order on disk. ReplacingMergeTree relies on ORDER BY, even if PRIMARY KEY is shorter. If you don't specify PRIMARY KEY, it matches ORDER BY.

What if ORDER BY is too broad? For example, include amount. Then two bets with different amounts (even identical in user_id, bet_id) won't be considered duplicates β€” both will remain. Deduplication won't work. Pitfall #1 (we'll come back to it at the end).

4. Why SELECT May Return Duplicates Before Merge β€” and How to Live with It

The main nuance: ReplacingMergeTree removes duplicates only during merge of data parts. This is a background process that doesn't happen instantly. Between inserting duplicates and their physical removal, it can take from a few seconds to several hours (depending on settings and load).

What does this mean in practice?

Let's insert two duplicates:

-- First insert
INSERT INTO bets_dedup VALUES (123, 'bet-001', 1000, now(), now());

-- After 5 seconds β€” second (server didn't get confirmation and sent again)
INSERT INTO bets_dedup VALUES (123, 'bet-001', 1000, now(), now() + interval 5 second);

Now run a regular SELECT * FROM bets_dedup WHERE user_id = 123. What will we see? Two rows. Because merge hasn't happened yet. The data is in different parts. Each part is sorted by ORDER BY internally, but duplicates can be in different parts.

How to get guaranteed one row? Use FINAL:

SELECT * FROM bets_dedup FINAL WHERE user_id = 123;

FINAL forces ClickHouse to on the fly merge all parts for this query, applying the ReplacingMergeTree logic. You'll get one row β€” with the maximum updated_at (or the last by insertion time if no version).

Why is FINAL slow? ClickHouse reads all parts of the table, sorts them in memory by the ORDER BY key, removes duplicates, and only then returns the result. On large tables (billions of rows), this can take seconds or minutes. The optimizer can't use indexes efficiently β€” it has to scan a lot of data.

Advice: Don't use FINAL in real-time on large tables. Use it for:

  • Point queries for a single user_id (the index still helps).
  • Background tasks where time isn't critical (nightly reports).
  • Small tables (up to millions of rows).

For production loads, there's a better pattern β€” a materialized view without FINAL.

5. Performance of FINAL β€” When It's Acceptable, When It's Not

When FINAL is OK:

  • The table is small (up to 10–20 million rows per server).
  • You're querying a single user by index (WHERE user_id = specific).
  • You have a background aggregation once an hour, and 10 seconds of waiting is fine.
  • Exporting data once a day for a report.

When FINAL is a killer:

  • Table >100 million rows.
  • Query without filtering (SELECT * FROM table FINAL) β€” ClickHouse will read everything.
  • High-load OLTP-like scenario (tens of queries per second with FINAL).
  • Frequent updates to the same keys β€” many parts accumulate, FINAL reads them all.

Analogy: SELECT ... FINAL is like manually sorting through all papers in an archive to find the latest version of a document, instead of looking at a special "current versions log." It works, but not for every client request.

How to check if a query uses FINAL?

ClickHouse has the EXPLAIN command:

EXPLAIN SELECT * FROM bets_dedup FINAL WHERE user_id = 123;

Look for ReadFromMergeTree with the final flag. If you see it β€” the query honestly goes through parts.

6. Pattern: Background Aggregation Without FINAL via Materialized View

This is my favorite way to bypass FINAL. The idea: let ReplacingMergeTree live its life, duplicates gradually collapse in the background. For reading, we create a materialized view that is periodically rebuilt and contains already "clean" data without duplicates.

How it looks:

-- 1. Base table β€” dirty, with duplicates
CREATE TABLE bets_raw
(
    user_id UInt64,
    bet_id String,
    amount Decimal(10,2),
    created_at DateTime,
    updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY (user_id, bet_id);

-- 2. Target table β€” clean, without duplicates
CREATE TABLE bets_clean
(
    user_id UInt64,
    bet_id String,
    amount Decimal(10,2),
    created_at DateTime,
    updated_at DateTime
)
ENGINE = MergeTree()                    -- Regular MergeTree without deduplication
ORDER BY (user_id, bet_id);

-- 3. Materialized view β€” transfers data on insert
CREATE MATERIALIZED VIEW bets_mv TO bets_clean AS
SELECT
    user_id,
    argMax(amount, updated_at) AS amount,      -- take amount from row with max updated_at
    argMax(created_at, updated_at) AS created_at,
    max(updated_at) AS updated_at
FROM bets_raw
GROUP BY user_id, bet_id;   -- Group by deduplication key

Key points explained:

  • argMax(amount, updated_at) β€” an aggregate function that returns the amount value from the row with the largest updated_at. If we have duplicates with different updated_at (and different amount β€” e.g., the bet amount changed), the freshest amount remains. This is analogous to manual version control.

  • GROUP BY user_id, bet_id β€” here we explicitly say: "consider the combination user+bet ID as a duplicate." Now there's no need to wait for merge β€” each INSERT into bets_raw immediately (almost) triggers a recalculation in bets_clean via bets_mv.

  • Important limitation: Materialized views in ClickHouse process data in batches β€” each insert separately. If a single insert contains two duplicates (user_id, bet_id) β€” they'll collapse within the batch. If duplicates come in different inserts β€” bets_clean may contain temporary duplicates until bets_raw merges. For perfect cleanliness, you either need to use FINAL when reading from bets_raw, or periodically run OPTIMIZE TABLE bets_raw (forced merge).

Analogy: It's like having a draft (bets_raw) where you put all corrections, and a secretary who every 5 minutes retypes a clean copy (bets_clean) without errors. Readers only look at the clean copy β€” fast and without duplicates.

7. ReplacingMergeTree(version) with Monotonically Increasing Version β€” Update Semantics

A regular ReplacingMergeTree simply keeps the "last arrived" row. This is bad if old data can arrive after new data (e.g., due to network delays). Solution: use a version column that increases monotonically (e.g., timestamp or sequence ID).

Example: player balance table with deposit history

CREATE TABLE player_balance
(
    user_id        UInt64,
    transaction_id String,        -- Unique transaction ID (UUID)
    amount         Int64,         -- Balance change (can be negative)
    balance_after  Int64,         -- Balance after transaction
    event_time     DateTime,      -- Event time on the client
    ingestion_time DateTime       -- Insertion time into ClickHouse (version)
)
ENGINE = ReplacingMergeTree(ingestion_time)   -- Version = insertion time
ORDER BY (user_id, transaction_id);

Now even if transaction tx-001 arrives twice, but with different ingestion_time, the one inserted later (with larger ingestion_time) remains. This protects against "late duplicates" β€” when the first insert was at 12:00, the second at 12:05 (repeat), but due to a network glitch the second arrived at the server before the first. Without version, the earlier one (by insertion time) would remain β€” which could be the wrong one.

What does "monotonically increasing" mean? With each new insert, the ingestion_time value must be greater than or equal to previous ones. Use now() (current time on the ClickHouse server) or an atomic counter (e.g., from ZooKeeper). Don't rely on client time β€” clocks can jump.

8. Full Example: Deduplication of Balance Top-ups by transaction_id

Let's put it all together. We have a microservice that accepts balance top-ups from a payment system. The payment system sends webhooks (HTTP calls) β€” sometimes duplicates.

-- Step 1: Create a table for raw events
CREATE TABLE balance_events
(
    user_id        UInt64,
    transaction_id String,        -- Unique ID from payment system
    amount         Int64,         -- +1000 rub
    event_time     DateTime,      -- Time of money deduction from user
    inserted_at    DateTime DEFAULT now()  -- Automatically set on insert
)
ENGINE = ReplacingMergeTree(inserted_at)
ORDER BY (user_id, transaction_id);   -- Deduplication by pair (user, transaction)

-- Step 2: Insert data (suppose a duplicate arrives)
INSERT INTO balance_events (user_id, transaction_id, amount, event_time) 
VALUES (1, 'pay_001', 1000, '2025-06-01 10:00:00');

-- After a minute, a duplicate arrives (inserted_at will be set automatically as now() + 60 sec)
INSERT INTO balance_events (user_id, transaction_id, amount, event_time) 
VALUES (1, 'pay_001', 1000, '2025-06-01 10:00:00');

-- Step 3: Read without FINAL β€” we'll see 2 rows (but only if they haven't merged yet)
SELECT * FROM balance_events WHERE user_id = 1;
-- Result: two rows with same user_id, transaction_id, amount

-- Step 4: Read with FINAL β€” we see one row (with max inserted_at)
SELECT * FROM balance_events FINAL WHERE user_id = 1;
-- Result: one row

Why is transaction_id in ORDER BY not enough? Because two different users might have the same transaction_id (e.g., each payment system has its own counter). Adding user_id guarantees uniqueness within a user. If the system generates global UUIDs (550e8400-e29b-41d4-a716-446655440000) β€” you can use ORDER BY transaction_id alone, one UUID is enough.

9. Comparison with CollapsingMergeTree

CollapsingMergeTree is another engine for handling changes. It stores "plus" and "minus" pairs and collapses them during merge.

Key differences:

Characteristic ReplacingMergeTree CollapsingMergeTree
Mechanism Keeps one row from duplicates Collapses pairs (+1 and -1)
Purpose Deduplication of inserts Updating aggregates (e.g., shopping cart)
Version needed Optional (version column) Mandatory Sign flag (+1/-1)
Can history be stored Yes, all versions until merge No, pairs are destroyed
FINAL for reading Yes, without it duplicates are visible Yes, without it uncollapsed pairs are visible

When to choose ReplacingMergeTree:

  • You just need to remove duplicate rows.
  • You have a natural key for deduplication (transaction ID).
  • Data changes rarely (mostly inserts).

When to choose CollapsingMergeTree:

  • You frequently update an aggregated metric (e.g., "number of items in cart").
  • You only need to store the result, not the change history.

Example for CollapsingMergeTree:

CREATE TABLE cart_items
(
    user_id UInt64,
    product_id UInt64,
    quantity Int16,
    sign Int8  -- +1 (add), -1 (remove)
) ENGINE = CollapsingMergeTree(sign)
ORDER BY (user_id, product_id);

With ReplacingMergeTree, you would just overwrite the row with a new quantity version β€” but then you'd lose the change history. CollapsingMergeTree allows you to calculate the total (SUM(quantity * sign)) even without FINAL.

10. Common Pitfalls β€” and How to Avoid Them

Pitfall #1: ORDER BY Doesn't Include All Unique Fields

-- BAD: using only user_id
CREATE TABLE bets_bad ENGINE = ReplacingMergeTree ORDER BY user_id;

-- Inserted two bets for the same user with different bet_id
INSERT INTO bets_bad VALUES (1, 'bet_001', 100);
INSERT INTO bets_bad VALUES (1, 'bet_002', 200);

-- During merge they WILL MERGE into one row β€” because ORDER BY (user_id) is the same!
-- Lost bet_002.

Correct: Include in ORDER BY all columns that make a row unique β€” usually a surrogate ID (transaction_id) or a combination (user_id, bet_id).

Pitfall #2: Naive Hope for Instant Deduplication

Newbies write INSERT with a duplicate and immediately SELECT without FINAL β€” see duplicates. They get disappointed in ClickHouse. Remember: deduplication is asynchronous. If you need instant consistency β€” use FINAL or the materialized view pattern.

Pitfall #3: Using a Version That Isn't Monotonic

-- BAD: version is client time
CREATE TABLE events ENGINE = ReplacingMergeTree(client_time) ORDER BY (id);

-- Client's clock is behind, they send an old version after a new one
-- During merge, the wrong (old) row will remain

Solution: Use now() on the ClickHouse side or a hardware counter.

Pitfall #4: Optimism About FINAL on Large Data

I had a case: a developer enabled FINAL in all reports on a 2-billion-row table. Queries started timing out after 300 seconds. Had to rewrite to aggregation with GROUP BY and argMax.

Golden Rule: If you're reading more than 10% of a table via FINAL β€” you're doing something wrong. Use materialized views or rethink the architecture.

Pitfall #5: ReplacingMergeTree Without ORDER BY

ClickHouse won't let you create a table without ORDER BY. But you can specify ORDER BY tuple() (empty tuple). Then all rows in the table are considered duplicates β€” only one row will remain after the first merge. Almost never needed.

What's Next β€” Links to Related Articles

Now that you've mastered ReplacingMergeTree, here are the next topics to explore:

  1. How to optimize merges β€” settings like merge_with_ttl_timeout, number_of_free_entries_in_pool_to_lower_max_size_of_merge (sounds scary but useful).

  2. Deduplication at the INSERT level β€” the ReplicatedReplacingMergeTree engine with ZooKeeper. This is another level: duplicates are cut off immediately on insert, but at the cost of delays and complexity.

  3. Alternative: VersionedCollapsingMergeTree β€” a hybrid that supports versioning and collapsing simultaneously.

  4. Materialized views in detail β€” how to build multi-level aggregations to avoid FINAL entirely.

And finally: ReplacingMergeTree is a powerful tool, but it's not about "delete duplicates immediately." It's about "data will eventually become clean, and you work with it in the meantime." If you need strict uniqueness (like PRIMARY KEY in PostgreSQL) β€” ClickHouse isn't the best choice. But for 99% of analytical tasks with repeated inserts β€” it's a lifesaver.


← Previous:
β†’ Next: SummingMergeTree and AggregatingMergeTree: Painless Incremental Aggregation

β€” Editorial Team

Advertisement 728x90

Read Next