ReplacingMergeTree: How to Beat Duplicates in ClickHouse Without Pain
1. Why ReplacingMergeTree Is Needed β The Real-World Duplicate Problem
Imagine you're developing an online casino. A player clicks the "Place Bet" button β 1000 rubles on black. At that moment, the server processing the request suddenly crashes (overheating, network failure, who knows). The client didn't get a response and thinks: "The bet didn't go through." The player clicks again. The server recovers and accepts both requests. In the database β two identical bets. The player is furious: 2000 rubles were deducted instead of 1000.
This is a classic idempotency problem (from Latin idem β same, potens β capable). An operation is idempotent if repeating it yields the same result as doing it once. In the database world, we need a mechanism that figures out: "I've already seen this bet, I'll ignore the second version."
In ClickHouse, there's ReplacingMergeTree for this. It's a table engine that automatically removes duplicates during merging of data parts. But I'll warn you upfront: it's not magic β it has quirks we'll discuss.
Real-life analogy: ReplacingMergeTree is like a secretary who keeps a meeting log. People come to you with requests. Sometimes the same client brings two identical applications (e.g., missed a train and asks for a refund, then calls again with the same request). The secretary doesn't throw out duplicates at the door β they just put all papers in a folder. Once a day, they go through the folder and keep only the latest application from each client. If someone asks "how many applications from Ivanov?" before the sorting β they'll see two. After β one.
2. How ReplacingMergeTree Works β Brick by Brick
Duplicates Arise from Unreliable Delivery
ClickHouse was originally designed for large-scale analytics, where an occasional miss or duplicate isn't critical. But then people started using it for critical data β and got burned. ReplacingMergeTree is the answer to that pain.
Why do duplicates appear at all?
- The client sent data, didn't get an acknowledgment (timeout), and sent again.
- The queue system (Kafka, RabbitMQ) gives
at-least-onceguarantee β at least one delivery, possible repeats. - A bug in the ETL process (Extract, Transform, Load) β the pipeline was run twice.
Mechanism: Merge by ORDER BY Key
When creating a table with ReplacingMergeTree, you must specify a sorting key β ORDER BY (column1, column2). This is not a primary key in the classical sense (like in PostgreSQL), but a way to physically order data on disk. ClickHouse stores data in parts β chunks sorted by this key.
When two parts merge into one (a background process called merging), ReplacingMergeTree scans rows with the same ORDER BY key value and keeps only one. Which one? By default β the last one by insertion time. But you can specify a numeric version column, then the row with the maximum version value remains.
Git analogy: ReplacingMergeTree during merge behaves like Git when you resolve a conflict: out of two changes to the same file, the latest one is kept (if you don't specify a strategy explicitly). Only here the file is a row in the table, and the key is ORDER BY.
Versioning: How ReplacingMergeTree(version) Changes the Rules
Syntax: ReplacingMergeTree(version_column). If version_column is an integer (UInt* or DateTime), the row with the largest value remains. This gives manual control: you can explicitly specify which version "wins."
Example: we send bets with updated_at = now(). On a resend, updated_at will be slightly larger. Merge will keep the fresher one. If you don't specify version, ClickHouse picks the last one that arrived β which may not be the most recent by business logic, just the last physical insert. The difference matters.
3. CREATE TABLE with ReplacingMergeTree β Breaking It Down
-- Create a table for bets with deduplication
CREATE TABLE bets_dedup
(
user_id UInt64, -- Player ID (who owns the bet)
bet_id String, -- Unique bet ID (generated on the client)
amount Decimal(10,2), -- Amount in rubles
created_at DateTime, -- Bet creation time
updated_at DateTime -- Last update time (for version)
)
ENGINE = ReplacingMergeTree(updated_at) -- Engine with version by updated_at
ORDER BY (user_id, bet_id) -- Deduplication key: (user_id, bet_id)
What's happening line by line:
ENGINE = ReplacingMergeTree(updated_at)β specifies that this is ReplacingMergeTree, and theupdated_atcolumn will be used as the version. During merge, of two rows with the sameORDER BY, the one with the largerupdated_at(newer) remains. Ifupdated_atis equal β the last physically inserted one remains (but it's better not to rely on that).ORDER BY (user_id, bet_id)β the most important parameter! This set of columns defines what counts as a duplicate. Two rows are considered duplicates if they have the same values for all columns in ORDER BY. Here: a bet from user user_id with bet_id β is unique. If two rows withuser_id=123, bet_id='abc-456'arrive β they'll merge into one.
Why ORDER BY and not PRIMARY KEY? In ClickHouse, PRIMARY KEY doesn't have to be unique. It's a hint for the index, while ORDER BY is the physical order on disk. ReplacingMergeTree relies on ORDER BY, even if PRIMARY KEY is shorter. If you don't specify PRIMARY KEY, it matches ORDER BY.
What if ORDER BY is too broad? For example, include amount. Then two bets with different amounts (even identical in user_id, bet_id) won't be considered duplicates β both will remain. Deduplication won't work. Pitfall #1 (we'll come back to it at the end).
4. Why SELECT May Return Duplicates Before Merge β and How to Live with It
The main nuance: ReplacingMergeTree removes duplicates only during merge of data parts. This is a background process that doesn't happen instantly. Between inserting duplicates and their physical removal, it can take from a few seconds to several hours (depending on settings and load).
What does this mean in practice?
Let's insert two duplicates:
-- First insert
INSERT INTO bets_dedup VALUES (123, 'bet-001', 1000, now(), now());
-- After 5 seconds β second (server didn't get confirmation and sent again)
INSERT INTO bets_dedup VALUES (123, 'bet-001', 1000, now(), now() + interval 5 second);
Now run a regular SELECT * FROM bets_dedup WHERE user_id = 123. What will we see? Two rows. Because merge hasn't happened yet. The data is in different parts. Each part is sorted by ORDER BY internally, but duplicates can be in different parts.
How to get guaranteed one row? Use FINAL:
SELECT * FROM bets_dedup FINAL WHERE user_id = 123;
FINAL forces ClickHouse to on the fly merge all parts for this query, applying the ReplacingMergeTree logic. You'll get one row β with the maximum updated_at (or the last by insertion time if no version).
Why is FINAL slow? ClickHouse reads all parts of the table, sorts them in memory by the ORDER BY key, removes duplicates, and only then returns the result. On large tables (billions of rows), this can take seconds or minutes. The optimizer can't use indexes efficiently β it has to scan a lot of data.
Advice: Don't use FINAL in real-time on large tables. Use it for:
- Point queries for a single
user_id(the index still helps). - Background tasks where time isn't critical (nightly reports).
- Small tables (up to millions of rows).
For production loads, there's a better pattern β a materialized view without FINAL.
5. Performance of FINAL β When It's Acceptable, When It's Not
When FINAL is OK:
- The table is small (up to 10β20 million rows per server).
- You're querying a single user by index (WHERE user_id = specific).
- You have a background aggregation once an hour, and 10 seconds of waiting is fine.
- Exporting data once a day for a report.
When FINAL is a killer:
- Table >100 million rows.
- Query without filtering (SELECT * FROM table FINAL) β ClickHouse will read everything.
- High-load OLTP-like scenario (tens of queries per second with FINAL).
- Frequent updates to the same keys β many parts accumulate, FINAL reads them all.
Analogy: SELECT ... FINAL is like manually sorting through all papers in an archive to find the latest version of a document, instead of looking at a special "current versions log." It works, but not for every client request.
How to check if a query uses FINAL?
ClickHouse has the EXPLAIN command:
EXPLAIN SELECT * FROM bets_dedup FINAL WHERE user_id = 123;
Look for ReadFromMergeTree with the final flag. If you see it β the query honestly goes through parts.
6. Pattern: Background Aggregation Without FINAL via Materialized View
This is my favorite way to bypass FINAL. The idea: let ReplacingMergeTree live its life, duplicates gradually collapse in the background. For reading, we create a materialized view that is periodically rebuilt and contains already "clean" data without duplicates.
How it looks:
-- 1. Base table β dirty, with duplicates
CREATE TABLE bets_raw
(
user_id UInt64,
bet_id String,
amount Decimal(10,2),
created_at DateTime,
updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY (user_id, bet_id);
-- 2. Target table β clean, without duplicates
CREATE TABLE bets_clean
(
user_id UInt64,
bet_id String,
amount Decimal(10,2),
created_at DateTime,
updated_at DateTime
)
ENGINE = MergeTree() -- Regular MergeTree without deduplication
ORDER BY (user_id, bet_id);
-- 3. Materialized view β transfers data on insert
CREATE MATERIALIZED VIEW bets_mv TO bets_clean AS
SELECT
user_id,
argMax(amount, updated_at) AS amount, -- take amount from row with max updated_at
argMax(created_at, updated_at) AS created_at,
max(updated_at) AS updated_at
FROM bets_raw
GROUP BY user_id, bet_id; -- Group by deduplication key
Key points explained:
argMax(amount, updated_at)β an aggregate function that returns theamountvalue from the row with the largestupdated_at. If we have duplicates with differentupdated_at(and differentamountβ e.g., the bet amount changed), the freshest amount remains. This is analogous to manual version control.GROUP BY user_id, bet_idβ here we explicitly say: "consider the combination user+bet ID as a duplicate." Now there's no need to wait for merge β eachINSERTintobets_rawimmediately (almost) triggers a recalculation inbets_cleanviabets_mv.Important limitation: Materialized views in ClickHouse process data in batches β each insert separately. If a single insert contains two duplicates
(user_id, bet_id)β they'll collapse within the batch. If duplicates come in different inserts βbets_cleanmay contain temporary duplicates untilbets_rawmerges. For perfect cleanliness, you either need to useFINALwhen reading frombets_raw, or periodically runOPTIMIZE TABLE bets_raw(forced merge).
Analogy: It's like having a draft (bets_raw) where you put all corrections, and a secretary who every 5 minutes retypes a clean copy (bets_clean) without errors. Readers only look at the clean copy β fast and without duplicates.
7. ReplacingMergeTree(version) with Monotonically Increasing Version β Update Semantics
A regular ReplacingMergeTree simply keeps the "last arrived" row. This is bad if old data can arrive after new data (e.g., due to network delays). Solution: use a version column that increases monotonically (e.g., timestamp or sequence ID).
Example: player balance table with deposit history
CREATE TABLE player_balance
(
user_id UInt64,
transaction_id String, -- Unique transaction ID (UUID)
amount Int64, -- Balance change (can be negative)
balance_after Int64, -- Balance after transaction
event_time DateTime, -- Event time on the client
ingestion_time DateTime -- Insertion time into ClickHouse (version)
)
ENGINE = ReplacingMergeTree(ingestion_time) -- Version = insertion time
ORDER BY (user_id, transaction_id);
Now even if transaction tx-001 arrives twice, but with different ingestion_time, the one inserted later (with larger ingestion_time) remains. This protects against "late duplicates" β when the first insert was at 12:00, the second at 12:05 (repeat), but due to a network glitch the second arrived at the server before the first. Without version, the earlier one (by insertion time) would remain β which could be the wrong one.
What does "monotonically increasing" mean? With each new insert, the ingestion_time value must be greater than or equal to previous ones. Use now() (current time on the ClickHouse server) or an atomic counter (e.g., from ZooKeeper). Don't rely on client time β clocks can jump.
8. Full Example: Deduplication of Balance Top-ups by transaction_id
Let's put it all together. We have a microservice that accepts balance top-ups from a payment system. The payment system sends webhooks (HTTP calls) β sometimes duplicates.
-- Step 1: Create a table for raw events
CREATE TABLE balance_events
(
user_id UInt64,
transaction_id String, -- Unique ID from payment system
amount Int64, -- +1000 rub
event_time DateTime, -- Time of money deduction from user
inserted_at DateTime DEFAULT now() -- Automatically set on insert
)
ENGINE = ReplacingMergeTree(inserted_at)
ORDER BY (user_id, transaction_id); -- Deduplication by pair (user, transaction)
-- Step 2: Insert data (suppose a duplicate arrives)
INSERT INTO balance_events (user_id, transaction_id, amount, event_time)
VALUES (1, 'pay_001', 1000, '2025-06-01 10:00:00');
-- After a minute, a duplicate arrives (inserted_at will be set automatically as now() + 60 sec)
INSERT INTO balance_events (user_id, transaction_id, amount, event_time)
VALUES (1, 'pay_001', 1000, '2025-06-01 10:00:00');
-- Step 3: Read without FINAL β we'll see 2 rows (but only if they haven't merged yet)
SELECT * FROM balance_events WHERE user_id = 1;
-- Result: two rows with same user_id, transaction_id, amount
-- Step 4: Read with FINAL β we see one row (with max inserted_at)
SELECT * FROM balance_events FINAL WHERE user_id = 1;
-- Result: one row
Why is transaction_id in ORDER BY not enough? Because two different users might have the same transaction_id (e.g., each payment system has its own counter). Adding user_id guarantees uniqueness within a user. If the system generates global UUIDs (550e8400-e29b-41d4-a716-446655440000) β you can use ORDER BY transaction_id alone, one UUID is enough.
9. Comparison with CollapsingMergeTree
CollapsingMergeTree is another engine for handling changes. It stores "plus" and "minus" pairs and collapses them during merge.
Key differences:
| Characteristic | ReplacingMergeTree | CollapsingMergeTree |
|---|---|---|
| Mechanism | Keeps one row from duplicates | Collapses pairs (+1 and -1) |
| Purpose | Deduplication of inserts | Updating aggregates (e.g., shopping cart) |
| Version needed | Optional (version column) | Mandatory Sign flag (+1/-1) |
| Can history be stored | Yes, all versions until merge | No, pairs are destroyed |
| FINAL for reading | Yes, without it duplicates are visible | Yes, without it uncollapsed pairs are visible |
When to choose ReplacingMergeTree:
- You just need to remove duplicate rows.
- You have a natural key for deduplication (transaction ID).
- Data changes rarely (mostly inserts).
When to choose CollapsingMergeTree:
- You frequently update an aggregated metric (e.g., "number of items in cart").
- You only need to store the result, not the change history.
Example for CollapsingMergeTree:
CREATE TABLE cart_items
(
user_id UInt64,
product_id UInt64,
quantity Int16,
sign Int8 -- +1 (add), -1 (remove)
) ENGINE = CollapsingMergeTree(sign)
ORDER BY (user_id, product_id);
With ReplacingMergeTree, you would just overwrite the row with a new quantity version β but then you'd lose the change history. CollapsingMergeTree allows you to calculate the total (SUM(quantity * sign)) even without FINAL.
10. Common Pitfalls β and How to Avoid Them
Pitfall #1: ORDER BY Doesn't Include All Unique Fields
-- BAD: using only user_id
CREATE TABLE bets_bad ENGINE = ReplacingMergeTree ORDER BY user_id;
-- Inserted two bets for the same user with different bet_id
INSERT INTO bets_bad VALUES (1, 'bet_001', 100);
INSERT INTO bets_bad VALUES (1, 'bet_002', 200);
-- During merge they WILL MERGE into one row β because ORDER BY (user_id) is the same!
-- Lost bet_002.
Correct: Include in ORDER BY all columns that make a row unique β usually a surrogate ID (transaction_id) or a combination (user_id, bet_id).
Pitfall #2: Naive Hope for Instant Deduplication
Newbies write INSERT with a duplicate and immediately SELECT without FINAL β see duplicates. They get disappointed in ClickHouse. Remember: deduplication is asynchronous. If you need instant consistency β use FINAL or the materialized view pattern.
Pitfall #3: Using a Version That Isn't Monotonic
-- BAD: version is client time
CREATE TABLE events ENGINE = ReplacingMergeTree(client_time) ORDER BY (id);
-- Client's clock is behind, they send an old version after a new one
-- During merge, the wrong (old) row will remain
Solution: Use now() on the ClickHouse side or a hardware counter.
Pitfall #4: Optimism About FINAL on Large Data
I had a case: a developer enabled FINAL in all reports on a 2-billion-row table. Queries started timing out after 300 seconds. Had to rewrite to aggregation with GROUP BY and argMax.
Golden Rule: If you're reading more than 10% of a table via FINAL β you're doing something wrong. Use materialized views or rethink the architecture.
Pitfall #5: ReplacingMergeTree Without ORDER BY
ClickHouse won't let you create a table without ORDER BY. But you can specify ORDER BY tuple() (empty tuple). Then all rows in the table are considered duplicates β only one row will remain after the first merge. Almost never needed.
What's Next β Links to Related Articles
Now that you've mastered ReplacingMergeTree, here are the next topics to explore:
How to optimize merges β settings like
merge_with_ttl_timeout,number_of_free_entries_in_pool_to_lower_max_size_of_merge(sounds scary but useful).Deduplication at the INSERT level β the
ReplicatedReplacingMergeTreeengine with ZooKeeper. This is another level: duplicates are cut off immediately on insert, but at the cost of delays and complexity.Alternative:
VersionedCollapsingMergeTreeβ a hybrid that supports versioning and collapsing simultaneously.Materialized views in detail β how to build multi-level aggregations to avoid
FINALentirely.
And finally: ReplacingMergeTree is a powerful tool, but it's not about "delete duplicates immediately." It's about "data will eventually become clean, and you work with it in the meantime." If you need strict uniqueness (like PRIMARY KEY in PostgreSQL) β ClickHouse isn't the best choice. But for 99% of analytical tasks with repeated inserts β it's a lifesaver.
β Previous: ClickHouse Configuration: How I Set Up Production and Didn't Shoot Myself in the Foot
β Next: SummingMergeTree and AggregatingMergeTree: Painless Incremental Aggregation
β Editorial Team
No comments yet.