Skip to content

MD5 Explained: How It Works and Why It's Broken

How the MD5 hash algorithm works internally — Merkle–Damgård, the 64-step compression function, padding — and why MD5 is cryptographically broken yet still used for checksums.

Published on 9 min read

MD5 is the most famous hash function in the world, and one of the most thoroughly broken. Understanding exactly how it works — and exactly where it fails — is a useful lens on how cryptographic hashing is built and how it dies.

A Short History

MD5 (Message Digest 5) was designed by Ronald Rivest at MIT and published in 1991 as the successor to MD4. It was standardized in 1992 as RFC 1321. Rivest built MD5 to be a strengthened drop-in replacement for MD4, which had already shown worrying weaknesses. For most of the 1990s and 2000s, MD5 was the default hash for everything: file integrity, password storage, digital certificates, and content addressing.

That ubiquity is exactly why its eventual collapse mattered so much. MD5 was wired into protocols and infrastructure long before practical attacks arrived, and that legacy lingers to this day. If you want the broader context on what a hash function is supposed to provide, start with our pillar guide on how cryptographic hashing works.

What MD5 Produces

MD5 maps an input of arbitrary length to a fixed 128-bit (16-byte) digest, conventionally written as 32 hexadecimal characters. Internally the algorithm maintains four 32-bit state words, labeled A, B, C, and D, which together hold those 128 bits. The message is processed in 512-bit blocks (16 words of 32 bits each).

The four state words are initialized to fixed constants specified in RFC 1321. After all message blocks are absorbed, the final values of A, B, C, D — concatenated in little-endian byte order — form the digest.

The Merkle–Damgård Structure

MD5 is a classic Merkle–Damgård construction, the same skeleton shared by MD4, SHA-1, and SHA-2. The idea is to turn a fixed-input compression function into a hash that accepts arbitrary-length messages:

  • Start from a fixed initialization vector (the initial A, B, C, D).
  • Pad the message so its length is a multiple of the block size.
  • For each 512-bit block, run the compression function, mixing the block into the running state.
  • The final state is the digest.

The running state is called the chaining value. Each block's output becomes the next block's input chaining value, which is what links the whole message together into one digest. This design is elegant and efficient, but it also bakes in structural properties — like length extension — that later prove relevant to MD5's security story.

Padding the Message

Before processing, MD5 pads every message to a multiple of 512 bits, regardless of whether it already happens to be a multiple. The padding scheme is deterministic and always applied:

  1. Append a single 1 bit — in practice the byte 0x80.
  2. Append 0 bits until the message length is congruent to 448 modulo 512 (i.e., 64 bits short of a full block).
  3. Append the original message length, in bits, as a 64-bit little-endian integer.

Appending the length is the Merkle–Damgård strengthening step. It ensures that messages of different lengths cannot trivially collide through padding ambiguity, and it is part of why naive constructions without length encoding are weaker.

Inside the Compression Function

The heart of MD5 is its compression function, which processes one 512-bit block through 64 operations, organized as 4 rounds of 16 steps each. Each step updates one of the four state words using the other three, a slice of the message block, an additive constant, and a bitwise rotation.

The Four Round Functions

Each round uses a different nonlinear mixing function of three 32-bit words. They are commonly called F, G, H, and I, and each round applies one of them across its 16 steps:

  • F (round 1): a bitwise selection — for each bit position, choose between two inputs based on a third (a multiplexer-style function).
  • G (round 2): another selection function, with the roles of the controlling and selected inputs rearranged.
  • H (round 3): a bitwise XOR of all three inputs — the most "linear" of the four.
  • I (round 4): a function combining one input with the OR of another and the complement of the third.

The purpose of using four distinct, nonlinear functions across the rounds is diffusion and confusion: a single bit flip in the input should cascade unpredictably through all 128 output bits.

Constants From the Sine Function

Each of the 64 steps adds a precomputed 32-bit constant, often written as T[i]. Rivest derived these from the sine function: each constant is the integer part of the absolute value of sine of the step index (in radians) scaled by 2³². These are textbook "nothing-up-my-sleeve" numbers — values chosen from a well-known mathematical sequence so that the designer cannot be accused of inserting a hidden trapdoor.

The Message Schedule and Rotations

Within each step the function consumes one 32-bit word of the message block. The order in which the 16 message words are consumed — the message schedule — differs per round: round 1 reads them in order, while rounds 2, 3, and 4 use fixed permutations that skip through the words at different strides. This ensures every message word influences several steps in scattered positions.

Each step also applies a left rotation by an amount that cycles through a small fixed set of values, repeating in a four-step pattern within each round and changing per round. The rotation amounts and the message-word order together are tuned to spread changes rapidly across the state.

One Step, Sketched

A single MD5 step transforms the four words. Here is a correct, general sketch — the specific constants, rotation amounts, and word indices come from RFC 1321:

# State words: a, b, c, d  (rotated roles each step)
# func    : F, G, H, or I depending on the round
# M[k]    : a 32-bit message word selected by the schedule
# T[i]    : the i-th sine-derived additive constant
# s       : the rotation amount for this step

for i in 0..63:
    func = select_round_function(i)   # F, G, H, I
    k    = message_index(i)           # per-round schedule
    s    = rotation_amount(i)

    tmp  = b + leftrotate( a + func(b, c, d) + M[k] + T[i], s )

    # rotate the roles of the four words
    a, b, c, d = d, tmp, b, c

After all 64 steps, the four working words are added back into the incoming chaining value.

Chaining and Feed-Forward

The final mechanism that makes MD5 a one-way function is the feed-forward (also called the Davies–Meyer-style addition). At the end of each block:

A = A_initial + a
B = B_initial + b
C = C_initial + c
D = D_initial + d

The output of the 64 steps is added modulo 2³² to the chaining value that entered the block. This modular addition is critical: without it, the round function would be invertible and the whole hash trivially reversible. Adding the input back in destroys that invertibility and produces the irreversibility we expect from a hash. These updated A, B, C, D become the chaining value for the next block — or the digest, if this was the last block.

Why MD5 Was Designed to Be Fast

MD5 was engineered for speed on 1990s 32-bit hardware. Its operations — modular addition, bitwise logic, rotations — map directly onto cheap CPU instructions, and 64 simple steps per block make it extremely fast. That was a virtue for checksums and integrity in an era of slow machines.

For security, raw speed is now a liability. A fast hash means an attacker can compute enormous numbers of candidates per second, which is catastrophic for password hashing and accelerates collision searches. Modern password hashing deliberately uses slow, memory-hard designs for exactly this reason — the opposite of MD5's design goals.

How MD5 Broke: The Collision Timeline

MD5's downfall is a collision story, not a preimage one. The milestones:

  • 2004 — Wang et al. Xiaoyun Wang and collaborators demonstrated practical, real collisions in MD5: two distinct 128-byte inputs producing the identical digest, found in hours on commodity hardware. This was the first practical break of MD5's collision resistance.
  • 2007–2009 — Chosen-prefix collisions and the rogue CA. Researchers extended the attack to chosen-prefix collisions, where the attacker controls meaningful differing prefixes. In 2008–2009 a team used this to forge a rogue Certificate Authority certificate, exploiting a CA that still signed with MD5 — a concrete demonstration that the break threatened the real web PKI.
  • 2012 — Flame malware. The Flame espionage malware used an MD5 collision to forge a Microsoft code-signing certificate, letting malicious code masquerade as a legitimate Windows update. This was a collision attack weaponized in the wild by a sophisticated actor.

Preimage vs. Collision: What's Actually Broken

It's worth being precise. As of 2026, MD5's collision resistance is dead — finding two inputs with the same hash is cheap and routine. But MD5 is not practically preimage-broken: given a target digest, there is no efficient general attack to find an input that produces it (the best known attacks remain far above brute force in cost, though theoretically below 2¹²⁸).

That distinction matters. Collision attacks require the attacker to control both inputs — devastating for digital signatures, certificates, and any setting where one party crafts content the other will trust. Preimage attacks would be needed to, say, reverse an arbitrary observed hash, and those remain infeasible. But you should never lean on this nuance: once collision resistance fails, the hash is unfit for any security purpose.

When (If Ever) to Use MD5 Today

The rules are simple:

  • Never use MD5 for digital signatures, certificates, password hashing, HMAC keys, or anything an adversary can influence. Use SHA-256 or SHA-3 for security.
  • Acceptable only for non-adversarial integrity — detecting accidental corruption of a file download, deduplication, or cache keys where no attacker is trying to engineer a collision. Even here, a non-cryptographic CRC32 checksum is often a better-fit tool for pure error detection, and SHA-256 is the safer default when in doubt.

MD5 sits between its predecessor SHA-1 — also collision-broken, see our SHA-1 deep dive — and the modern families. If you're weighing options, our comparison of MD5 vs SHA-256 vs SHA-3 lays out the tradeoffs.

Try It Yourself

You can compute an MD5 hash in your browser with Hash Generator. Everything runs 100% client-side via Rust compiled to WebAssembly — nothing you type is ever uploaded, which makes it safe for inspecting sensitive data or just learning how digests behave when you flip a single byte.

Conclusion

MD5 is a beautifully compact piece of engineering: a Merkle–Damgård hash built on a 64-step compression function with sine-derived constants and clever bit mixing. It is also a cautionary tale — fast by design, collision-broken in practice, and still lurking in legacy systems decades after it should have been retired. Use it only for casual integrity checks, reach for SHA-256 or SHA-3 for anything that matters, and try MD5 hashing in your browser to see the algorithm in action.

Related articles

A deep, developer-focused guide to how cryptographic hash functions work — properties, Merkle–Damgård vs sponge constructions, the birthday bound, and where each family fits.
MD5 vs SHA-256 vs SHA-3 compared — output size, internal construction, speed, security status, and a clear decision guide for integrity, security, and password use cases.
How SHA-1 works internally — 80 rounds, message expansion, Merkle–Damgård — and how the 2017 SHAttered collision finally broke it. Why SHA-1 is deprecated.