Skip to content

How Cryptographic Hashing Works: A Deep Technical Guide

A deep, developer-focused guide to how cryptographic hash functions work — properties, Merkle–Damgård vs sponge constructions, the birthday bound, and where each family fits.

Published on 8 min read

A cryptographic hash function is one of the smallest, most-used primitives in all of computing: it turns an arbitrary blob of bytes into a short, fixed-length fingerprint. This guide explains how that fingerprint is actually computed, why it is hard to forge, and where each major algorithm fits.

What a Cryptographic Hash Function Is

A hash function H maps a message M of any length to a fixed-size digest of n bits. SHA-256 produces 256 bits (32 bytes); SHA-1 produces 160 bits; BLAKE3 defaults to 256 bits but can emit any length you ask for.

The word cryptographic is doing real work here. A plain hash table function only needs to spread keys evenly across buckets. A cryptographic hash must additionally be hard to invert and hard to find collisions for, even against an adversary who knows the algorithm completely and can spend years of compute attacking it. There are no secret keys: the function is fully public and deterministic. Its security rests entirely on computational hardness, not on hidden information.

Core Properties

A function earns the label "cryptographic hash" only if it provides a specific set of guarantees:

  • Determinism. The same input always yields the same digest, on every machine, forever. This is what makes hashes useful as identifiers.
  • Fixed-size output. A one-byte file and a one-terabyte file both reduce to the same digest length. The mapping is therefore many-to-one — collisions must exist by counting; the security claim is only that you cannot find them.
  • Preimage resistance (one-wayness). Given a digest h, it should be infeasible to find any M with H(M) = h. For an ideal n-bit hash this costs roughly 2^n work.
  • Second-preimage resistance. Given a specific message M1, it should be infeasible to find a different M2 with H(M2) = H(M1). Also about 2^n work for an ideal function.
  • Collision resistance. It should be infeasible to find any pair M1 ≠ M2 with H(M1) = H(M2). This is the weakest of the three to attack — see the birthday bound below.
  • Avalanche effect. Flipping a single input bit should flip roughly half the output bits, with no statistical correlation between input and output changes. Avalanche is what makes the digest behave like a random function of the input.

These properties form a hierarchy: collision resistance implies second-preimage resistance, which is the strictest practical requirement for most signing and integrity uses.

The Birthday Bound

Preimage and second-preimage attacks scale with 2^n. Collisions do not, because the attacker is free to choose both messages. This is the birthday paradox applied to hashing: in a set of random n-bit values, you expect a repeat after drawing only about 2^(n/2) of them, not 2^n.

Practically, an n-bit hash gives you only about n/2 bits of collision security. SHA-256's 256-bit output yields roughly 128-bit collision resistance — comfortable. SHA-1's 160 bits implied only ~80-bit collision resistance, which is exactly why it fell: researchers eventually produced real colliding inputs. When you choose a digest size, size it for the collision case, not the preimage case.

The Merkle–Damgård Construction

Most classic hashes — MD5, SHA-1, and the SHA-2 family — are built with the Merkle–Damgård construction. The idea is to build a hash of arbitrary-length input from a fixed-size compression function f that takes a chaining value plus one message block and returns a new chaining value.

The pieces:

  • IV (initialization vector). A fixed, standardized starting chaining value baked into the spec.
  • Message blocks. The padded message is split into equal-size blocks (512 bits for MD5/SHA-1/SHA-256).
  • Padding and length encoding. The message is padded with a single 1 bit, then 0 bits, and finally the original message length is appended as a fixed-width integer. Appending the length is Merkle–Damgård strengthening, and it is what makes the construction collision-resistant given a collision-resistant f.
  • Chaining. Each block is mixed into the running chaining value; the final chaining value is the digest.
state = IV
for block in pad_and_split(message):
    state = f(state, block)
return state

Because the final state depends on every block in sequence, and the appended length prevents trivial padding tricks, finding a collision in H reduces to finding a collision in f. For a step-by-step walkthrough of a real compression function see SHA-256 Explained, and for the broken ancestors see MD5 Explained and SHA-1 Explained.

Length-Extension Attacks

Plain Merkle–Damgård has a structural flaw: the digest is the final internal state. If an attacker knows H(M) and the length of M — but not M itself — they can resume the chaining loop from that state and compute H(M ‖ padding ‖ M2) for an extension M2 of their choosing, without ever knowing M.

This breaks naive "secret-prefix" authentication like H(secret ‖ message). The standard fix is HMAC, which nests two hash calls with a keyed inner and outer pass so the raw internal state never leaks. SHA-2's truncated variants (SHA-512/256) and the sponge-based SHA-3 are not vulnerable to length extension, because the value you see is not the full internal state.

The Sponge Construction

SHA-3 (Keccak) abandoned Merkle–Damgård entirely in favor of a sponge. The internal state is a single large array of b bits, split conceptually into a rate r (the part that touches the message) and a capacity c (the part that never directly interacts with input, where b = r + c). The capacity is the security parameter.

A sponge runs in two phases:

  1. Absorb. XOR each message block into the rate portion of the state, then apply a fixed permutation that stirs the entire b-bit state.
  2. Squeeze. Read output from the rate, applying the permutation again between reads, until you have enough bits.

Because the capacity is never exposed in the output, sponges resist length extension natively and can produce variable-length output (extendable-output functions, or XOFs, like SHAKE). Full mechanics are covered in SHA-3 and Keccak Explained.

Tree and Parallel Hashing

Both Merkle–Damgård and the basic sponge are inherently sequential — block n+1 cannot start until block n finishes. That wastes modern multi-core, SIMD hardware. Tree hashing fixes this by hashing independent chunks in parallel as the leaves of a Merkle tree, then combining the leaf hashes up to a single root.

BLAKE3 is the flagship example: it splits input into chunks, hashes them concurrently across cores and vector lanes, and merges the results. The result is enormous throughput while keeping cryptographic strength, plus built-in keyed-hash and key-derivation modes. See BLAKE2 and BLAKE3 Explained. Merkle trees are also the backbone of integrity in Git, Bitcoin, and content-addressed storage.

The Major Families at a Glance

  • MD5 / SHA-1 — Merkle–Damgård, broken. MD5 has practical collisions and is trivial to attack; SHA-1 has demonstrated chosen-prefix collisions. Use only for non-security checksums, never for signatures or passwords.
  • SHA-2 (SHA-224/256/384/512) — Merkle–Damgård, still strong and the default for most TLS, signing, and integrity work in 2026. No practical breaks.
  • SHA-3 (Keccak) — sponge, a structurally different backup standard with a comfortable security margin and native length-extension resistance.
  • BLAKE (BLAKE2, BLAKE3) — fast, modern, secure. BLAKE3 adds parallel tree hashing. Excellent when speed matters and you control both ends.

For a head-to-head on output size, speed, and structure, read MD5 vs SHA-256 vs SHA-3.

Applications

  • File integrity. Publish a digest alongside a download; recompute it after transfer to detect corruption or tampering. You can hash any text or file in your browser right now to verify a checksum.
  • Deduplication and content addressing. Identical content yields identical digests, so systems store one copy keyed by hash. Git names every object by its hash.
  • Commitments and Merkle trees. Publish H(secret) now to bind yourself to a value you reveal later; Merkle trees let you prove one item belongs to a large set with a short proof.
  • Message authentication. HMAC combines a hash with a secret key to authenticate messages without the length-extension pitfall.

Why Password Storage Needs Slow Hashes

A fast hash is exactly wrong for passwords. SHA-256 is engineered to run billions of times per second on a GPU, so an attacker who steals a database of fast-hashed passwords can brute-force common ones almost instantly. Password storage needs deliberately slow, memory-hard functions with tunable cost: bcrypt, PBKDF2 and scrypt, and the current recommendation, Argon2. For the full decision tree, read Password Hashing Done Right. And avoid legacy schemes like NTLM, which is just unsalted, fast hashing.

Non-Cryptographic Hashes Are Not Security

Functions like CRC32 and Adler-32 are error-detection codes, not cryptographic hashes. They are fast and excellent at catching accidental bit flips from a noisy network or disk, but they offer zero resistance to a deliberate attacker — forging input with a target CRC is trivial linear algebra. Never use a checksum where you need tamper resistance. See CRC32 and Checksums Explained for what they are good for.

Conclusion

Cryptographic hashing reduces to a few durable ideas: a deterministic, fixed-size, avalanche-rich fingerprint; security measured in preimage and (halved, via the birthday bound) collision resistance; and two dominant constructions — sequential Merkle–Damgård and the sponge — joined now by parallel tree hashing. Pick SHA-2, SHA-3, or BLAKE3 for general use, HMAC for authentication, and a slow memory-hard function for passwords. To see any of this in action, hash text or files entirely in your browser — the computation runs 100% client-side in Rust-compiled WebAssembly, so nothing you enter is ever uploaded.

Related articles

MD5 vs SHA-256 vs SHA-3 compared — output size, internal construction, speed, security status, and a clear decision guide for integrity, security, and password use cases.
How the MD5 hash algorithm works internally — Merkle–Damgård, the 64-step compression function, padding — and why MD5 is cryptographically broken yet still used for checksums.
How SHA-1 works internally — 80 rounds, message expansion, Merkle–Damgård — and how the 2017 SHAttered collision finally broke it. Why SHA-1 is deprecated.