Tombstones, Timestamps, and Gossip: Cassandra’s Way of Staying Consistent
• Databases, Distributed Systems
Cassandra gets a lot of things “right” for large, write‑heavy, always‑on systems. The trick isn’t one feature but a few ideas working together: immutable SSTables, last‑write‑wins timestamps, explicit tombstones for deletes, background compaction, read/repair workflows, and a lightweight gossip protocol to keep the cluster in the same reality. Here’s a conversational tour of how those pieces fit—and patterns worth stealing for other software.
A quick mental model: writes, reads, and files
On write, Cassandra appends to a commit log (durability) and updates an in‑memory memtable. When the memtable fills, it flushes to disk as an immutable SSTable. Because SSTables never change in place, you never rewrite data on random disks—great for throughput and predictability.
On read, Cassandra may need to consult several SSTables for the same row or partition. It merges those views on the fly using per‑cell timestamps (and tombstones—more below) to figure out which value wins. Background compaction later makes those merges cheaper by coalescing overlapping SSTables.
Client
| write(k=v @ t100)
v
Coordinator
|---- to R1 ---->| |---- to R2 ---->| |---- to R3 ---->|
Replica Rx (each):
+------------------------------+
| Commit Log (append only) |
+------------------------------+
| Memtable (in-memory index) |
+------------------------------+
| (flush)
v
+------------------------------+
| SSTable A (immutable) | ... later: A,B,C
+------------------------------+ │
v
Compaction: [A] [B] [C] ==> [D]
- drop overwritten cells by timestamp
- keep tombstones until gc_grace_seconds expires
Timestamps: resolving conflicts without a coordinator
In a distributed system, two replicas can receive different updates in different orders. Cassandra’s default conflict rule is simple: last‑write‑wins (LWW) based on a microsecond‑precision timestamp associated with each cell. Clients may send timestamps explicitly; otherwise the server assigns them.
Because the value with the highest timestamp wins, replicas don’t need a global lock or a central sequencer. They can accept writes independently and reconcile later during reads, repairs, or compaction. The trade‑off is that LWW is not an CRDT: if two updates are simultaneous, the “later” one (by timestamp) will clobber the other rather than merge their intent. That’s usually fine for operational data, but it’s a design choice to make consciously.
Node A: k = 4 @ t100
Node B: k = 5 @ t101
Merge: k = 5 @ t101 (highest timestamp wins)
Delete at t102 writes a tombstone:
Node C: k = ⟂ @ t102 (tombstone beats any older value)
Tombstones: deletes that safely propagate
Deletes in Cassandra don’t remove data immediately. They write a tombstone—a marker that says “this cell (or range) was deleted at time T.” There are a few flavors:
- Cell tombstones for single columns/values
- Partition/row tombstones for broader scopes
- Range tombstones for slices in wide rows
- TTL expiry also creates tombstones when time runs out
Why a marker? Because an immediate physical delete would be dangerous: another replica might still hold an older version and later propagate it back. The tombstone, with its timestamp, beats that older value during reads and repairs, preventing resurrection.
Timeline
t100: write k=4
t101: write k=5
t102: delete k → tombstone(k @ t102)
Replicas right after t102:
R1: tombstone @ t102
R2: tombstone @ t102
R3: (missed) k=5 @ t101
Read at QUORUM → merge:
tombstone @ t102 wins → return "not found"
coordinator repairs R3 with tombstone
After gc_grace_seconds and successful repairs:
compaction drops k=5 and the tombstone (safe to purge)
Compaction: cleaning up once it’s safe
Compaction rewrites sets of SSTables into fewer, larger ones—dropping overwritten values and, eventually, purging tombstones. Cassandra uses strategies like Size‑Tiered, Leveled, and Time‑Window compaction to balance write amplification and read amplification for different workloads.
The key safety valve is gc_grace_seconds. Cassandra waits at least this long before discarding tombstones during compaction. That delay gives the cluster time to run repairs so that any replica missing the tombstone can learn about it. If you purge too early, an out‑of‑date replica could re‑introduce deleted data.
Before (overlapping data): [A] [B] [C] [D]
After (merged & sorted): [E]
Rules during merge:
- Keep the newest cell by timestamp
- Retain tombstones until gc_grace_seconds
- Drop fully shadowed older cells
Reads and repairs: reconciling on the fly
Two mechanisms keep replicas converging:
- Read path reconciliation: a read at QUORUM (or higher) merges versions by timestamp and tombstone. If replicas disagree, the coordinator can perform a read repair to write the reconciled value to the out‑of‑date replica(s).
- Anti‑entropy repair: an explicit background task (node‑to‑node) that compares data segments and streams the differences. Modern clusters favor incremental repair to limit scope and overhead.
Together with tombstones and LWW, these make progress inevitable: newer data wins, deletes beat stale values, and compaction eventually makes the “merged” view inexpensive to read.
Coordinator issues read to 3 replicas (RF=3, CL=QUORUM)
R1 → k=5 @ t101
R2 → k=5 @ t101
R3 → k=4 @ t100 (stale)
Merge result → k=5 @ t101 (return to client)
Repair: write k=5 @ t101 to R3
Gossip: how the cluster stays in touch
Cassandra nodes use a lightweight gossip protocol to exchange membership and liveness information. Each node periodically talks to a few peers, spreading state quickly without a central coordinator. A phi‑accrual failure detector estimates whether a node is down based on heartbeat inter‑arrival times rather than a fixed timeout.
Gossip doesn’t move your data—that’s handled by replication and streaming—but it tells the cluster who’s alive, which tokens they own, and when to reroute traffic. It’s the cluster’s shared sense of reality.
Round 0: A
Round 1: A → B
Round 2: A → C B → D
Round 3: A → D B → C C → E D → F
Few rounds spread state to all nodes without coordination.
Failure suspicion uses phi accrual, not fixed timeouts.
Consistency levels: choosing your trade‑offs
Per‑query consistency levels (e.g., ONE, QUORUM, ALL) let you trade availability and latency for stronger reads/writes when needed. Higher levels reduce the window where conflicting versions exist, but Cassandra’s tombstones, timestamps, and repairs mean even ONE eventually converges.
Stealable ideas for other systems
- Immutable storage + background merges: write fast, reconcile later.
- Explicit delete markers: avoid resurrection; purge only after a safety window.
- Simple conflict rule: LWW is easy to reason about; use CRDTs only when you truly need merge semantics.
- Health via gossip: spread liveness cheaply; avoid single‑point coordinators.
- Operator dials: compaction strategy,
gc_grace_seconds, and consistency level are powerful workload levers.
None of these ideas are Cassandra‑only. They’re general patterns for building systems that stay fast under stress while remaining correct enough for the domain.
Notes and practical cautions
- Very high tombstone density can hurt reads until compaction catches up; model deletes thoughtfully.
- Don’t set
gc_grace_secondstoo low unless you’re confident repairs complete within that window. - Be explicit about timestamp sources; clock skew plus client‑supplied timestamps can surprise you.