By Adham Ehab in database — 05 Nov 2024

Notes on Database ( Part 3 ) - Multi-Version Concurrency Control

Introduction

In the previous post, we covered transactions, ACID properties, and isolation levels. We discussed how databases handle concurrent transactions and the different isolation levels available. But we didn't dive into how Postgres actually implements these features.

This post will cover Multi-Version Concurrency Control (MVCC), which is the mechanism Postgres uses to implement transaction isolation. We'll see how Postgres maintains multiple versions of data, how it decides which version a transaction should see, and how it cleans up old versions.

What's MVCC?

Before we dive into MVCC, let's understand why we need it. In the previous post, we talked about locks and how they help manage concurrent access to data without partial failures. But locks have a significant drawback - they make readers wait for writers and vise versa.

Imagine you're building a content platform. You have a post with millions of views, and you want to update the post's content. With a simple locking mechanism, you'd have to:

Lock the post
Update it
Release the lock

During this time, no one can read the post. This is terrible for performance and user experience. This is where MVCC comes in.

The Core Idea

Instead of making readers wait for writers, MVCC maintains multiple versions of the same data. When you update a row, Postgres doesn't overwrite the old version. Instead, it creates a new version and keeps the old one around.

This means:

Readers can see a consistent snapshot of data without being blocked by writers
Writers can modify data without blocking readers
Different transactions can see different versions of the same data based on when they started

How MVCC Works in Postgres

Transaction IDs

Every transaction in Postgres gets a unique identifier called a Transaction ID (XID). This is just a 32-bit integer that increases monotonically. When a transaction starts, it gets an XID that's greater than all currently active transactions.

XID Lifecycle

Transaction IDs in Postgres follow a lifecycle:

Assigned when transaction starts
Active while transaction runs
Moves to committed/aborted state after completion
Eventually becomes "frozen" (more on this below)

Transaction ID Wraparound

Since XIDs are 32-bit integers, they can eventually wrap around (Resets to 0).

This can cause a problem where a new transaction that just started can have an XID smaller than the recent transactions' XID. This can cause a problem with the visibility semantics (discussed below) and mess all of the isolation levels.

Postgres handles this through treating XIDs as a circular linked list, this allows the comparison operators to handle wraparound transparently:

Counter starts at 0
Moves clockwise through safe zone (green)
Warning zone at ~1B transactions (yellow)
Critical zone requiring VACUUM (red)
Final approach to max value (blue)
Wraps back to 0

If the counter reaches the red zone, Postgres will perform a VACUUM operation to clean up old versions of data. On the other hand, if the number of transactions is small, the wraparound use reset points to reset the counter to a safe value.

Freezing

Very old transactions are marked as "frozen"
Frozen tuples are visible to all transactions
Prevents age-related visibility issues

Let's look at a simple example.

Transactions gets an XID.

-- Transaction 1 (XID: 100)
BEGIN;
UPDATE users SET status = 'active' WHERE id = 1;
COMMIT;

-- Transaction 2 (XID: 101)
BEGIN;
UPDATE users SET status = 'inactive' WHERE id = 1;
COMMIT;

Each row in Postgres has additional system columns that aren't visible by default:

xmin: XID of the transaction that created this version
xmax: XID of the transaction that deleted/updated this version
cmin: Command identifier for the transaction that created this version
cmax: Command identifier for the transaction that deleted/updated this version

Example:

BEGIN;  -- Transaction starts (XID: 100)

-- Step 1: Create user
INSERT INTO users (id, name, status) VALUES (1, 'John', 'pending');
-- Row: (1, 'John', 'pending')
-- xmin: 100 (creating transaction)
-- xmax: NULL (active version)
-- cmin: 0 (first command)
-- cmax: NULL (not deleted)

-- Step 2: Update status
UPDATE users SET status = 'active' WHERE id = 1;
-- Previous Row: (1, 'John', 'pending')
-- xmin: 100
-- xmax: 100 (marked by same transaction)
-- cmin: 0
-- cmax: 1 (updated by second command)

-- New Row: (1, 'John', 'active')
-- xmin: 100
-- xmax: NULL (active version)
-- cmin: 1 (second command)
-- cmax: NULL (not deleted)

-- Step 3: Update User name
UPDATE users SET name = 'Johnny' WHERE id = 1;
-- Previous Row: (1, 'John', 'active')
-- xmin: 100
-- xmax: 100 (marked by same transaction)
-- cmin: 1
-- cmax: 2 (deleted by third command)

-- New Row: (1, 'Johnny', 'active')
-- xmin: 100
-- xmax: NULL (active version)
-- cmin: 2 (third command)
-- cmax: NULL (not deleted)

-- Step 4: Delete user
DELETE FROM users WHERE id = 1;
-- Previous Row: (1, 'Johnny', 'active')
-- xmin: 100
-- xmax: 100 (marked by same transaction)
-- cmin: 2
-- cmax: 3 (deleted by fourth command)

COMMIT;

Command #	Operation	Data	xmin	xmax	cmin	cmax
1	INSERT	(1, 'John', 'pending')	100	NULL	0	NULL
2	UPDATE	(1, 'John', 'active')	100	100	0	1
3	UPDATE	(1, 'Johnny', 'active')	100	100	1	2
4	DELETE	(1, 'Johnny', 'active')	100	100	2	3

Transaction Snapshots

When a transaction starts, Postgres creates a snapshot of the database's state. This snapshot contains:

The transaction's own XID
All currently active transaction XIDs
The latest committed transaction XID

This information helps Postgres decide which versions of rows a transaction should see. It's like taking a photo of the database at a specific point in time.

Why Snapshots Matter

The snapshot mechanism is crucial for maintaining transaction isolation. Without it, transactions would see different data as other transactions commit changes, leading to inconsistency and unpredictable behavior. This is particularly important in applications that require strict strong consistency.

What's in a Snapshot?

A transaction snapshot in Postgres contains three essential pieces of information. These three pieces of information are used to determine which version of a row a transaction should see.

1. The Transaction's Own XID

Every snapshot knows its transaction's identifier (XID)
XID is used to determine if changes are made by the current transaction
Helps implement "read your own writes" behavior

2. All Currently Active Transaction XIDs

List of all transactions that were running when the snapshot was taken
Used to determine which changes should be invisible
Helps prevent reading "in-progress" modifications

3. Latest Committed Transaction XID

The most recent transaction that was committed when the snapshot was taken
Sets an upper bound for visible changes. Since XID is monotonically increasing, it helps quickly filter out future modifications
Helps implement "repeatable read" behavior

When a transaction starts, Postgres creates this snapshot and uses it throughout the transaction's lifetime. The snapshot never changes, even if thousands of other transactions commit while yours is running. This is what enables consistent reads.
Think of it like taking a picture of a whiteboard. Even if someone changes what's written on the physical whiteboard, your picture stays the same. This is exactly how Postgres maintains consistency - your transaction keeps seeing the database as it was when it started.

Visibility Check Rules

Now comes the interesting part - how does Postgres decide which version of a row a transaction should see?

Postgres uses the information in the snapshot to determine which version of a row a transaction should see. It uses the Visibility Algorithm.

A row version is visible if:

It was created by a transaction that committed before our snapshot. (xmin < snapshot.xmin)
It wasn't deleted by any transaction visible to our snapshot. (xmax > snapshot.xmin)

This ensures that a transaction only sees data that has been committed and not deleted by other transactions.

Vacuum Process

Those algorithms are great and very efficient but what happens to old versions that no one needs anymore? This is where VACUUM comes in.

VACUUM is the process that cleans up old versions of data. It removes dead tuples and frees up space. It's very common in the use cases where data is being updated frequently. Otherwise Postgres disk usage will grow indefinitely even if the data is deleted.

Dead Tuples

When you update or delete a row, Postgres doesn't immediately remove the old version. These old versions are called "dead tuples". They stick around because some long-running transaction might still need to see them.

-- This creates a dead tuple
UPDATE users SET status = 'inactive' WHERE id = 1;

Dead tuples take up space and can slow down queries because Postgres has to skip over them. This is where VACUUM helps.

How VACUUM works

VACUUM is a process that:

Scans tables for dead tuples.
Marks space occupied by dead tuples as reusable.
Updates indexes to remove references to dead tuples.(The most expensive part)
Updates the free space map.

Postgres also has an AutoVacuum process that runs automatically when:

A table has accumulated enough dead tuples
The table has grown significantly since the last vacuum

AutoVacuum is a background process that runs periodically and cleans up dead tuples. It's very efficient and doesn't block the main database operations, but it's not perfect. It can miss some dead tuples and it's not always able to clean up all the dead tuples. There's also a risk of running out of disk space if the vacuum process is not able to clean up dead tuples fast enough to keep up with the data growth.

Final Thoughts

MVCC is an advanced topic and takes a lot of effort in order to implement it correctly. It's not something that you can just throw together. It's a very complex system that requires a lot of thought and planning.

Recently, I've faced a problem where a database table that has very high write throughput had grown to 5x the size of the data even with AutoVacuum running. The problem was that the table was not designed to support HOT (Heap-Only Tuples) updates, specifically that the columns that was being updated frequently was part of multiple indexes.

There's also a lot of scientist and engineers who oppose the use of MVCC in the database world. They argue that it's not a good idea to have multiple versions of the same data and that it's not a good idea to have a system that's not ACID compliant.

This blog post is a very good detailed read about the problems with MVCC and how it's not a good idea to have multiple versions of the same data (in some use cases) and how it's not a good idea to have a system that's not ACID compliant.