The Stutter In Scroll

When you open any major feed-based website & resize the browser window, browser stops everything it's doing to recalculate the geometry of the page, because some JavaScript asked a question the browser can only answer by running its full layout engine right now, in this moment, blocking everything else.

That blocked moment is what creates the stutter you feel when scrolling through a long Reddit thread or resizing a Gmail window. Code asking the browser a question the browser takes too long to answer.

The question is: how tall is this text block?

It sounds like nothing. It's expensive because of how browsers work internally. When JavaScript asks for an element's height via getBoundingClientRect() or offsetHeight, the browser can only guarantee a correct answer if its layout calculation is current. Anything could have changed since it last ran. So it throws away its cached layout, reruns the entire thing from scratch, answers the question, then unblocks JavaScript. The whole sequence takes 1 to 5 milliseconds per call. Paul Irish from the Chrome team documented over 30 JavaScript properties that trigger this - developers hit them constantly without realizing it.

One call? Fine. But this question gets asked in loops. A comment feed with 500 items asks it 500 times. Sequentially. On every window resize. That's potentially 2,500 milliseconds of blocked JavaScript, which is 156 dropped frames.


Why Does Any App Need Text Heights At All

Twitter can't load your entire timeline into the browser at once. It would consume gigabytes of memory and crash. So every feed-based app you use, including Twitter, Reddit, Gmail, Slack, and YouTube, does something clever: it only renders the 15 to 20 items currently on screen. As you scroll, it quietly destroys posts that slid off the top and builds new ones at the bottom. The trick is invisible when it works. This is called virtual scrolling.

But to position items correctly, the app needs to know each item's height before rendering it. And tweet heights aren't predictable. A two-word reply might be 60px. A long reply to that reply might be 220px. Same font, same container width, completely different heights depending on how many words wrapped onto new lines.

And when you resize the browser window, the container gets narrower or wider, which changes where every line breaks, which changes how tall every tweet is. Every item in the feed needs to be remeasured. So the app asks the browser. And the browser charges 2ms per question. It's the price of asking a question the browser can only answer one expensive way.


"But I Build React Apps and I've Never Seen This"

Fair. Most React developers will never hit this problem, for very specific reasons.

The reflow issue only bites when you're measuring heights in a loop. If you're rendering a list and just letting the browser lay it out - no height queries, no positioning math - reflow never enters the picture. The browser does its thing, you don't interfere, everything's fine.

If you've used react-window or react-virtuoso for a long list, those libraries are handling height measurement internally. react-window's simplest API asks you to pass a fixed itemSize prop - a hardcoded pixel height for every row. No measurement, no reflow, but also no variable-height items. The more advanced dynamic-height modes do measure, they just do it once per item on first render and cache the result. The reflow happens. You just never see it because the library hid it, and at the scale most apps run at, a one-time cost is acceptable.

Scale is the key variable. With 80 items in a list, the total reflow budget across all of them might be 160ms, spread across multiple frames, completely imperceptible. The problem Pretext solves surfaces when hundreds of items need remeasuring on every window resize, continuously, as the user drags the browser edge.

The library solves a problem that only exists when three conditions land together: a large number of items, variable and unpredictable heights, and repeated measurement on resize or scroll. Most apps never hit all three. When they do, at scale, the standard tools stop being good enough. That's the the real target Pretext is aimed at.


The Fix Is - Arithmetic

Cheng Lou built react-motion (21,800 GitHub stars, the physics-based animation library that underpins roughly half the animated React UIs built between 2015 and 2020). He worked on the React team at Facebook. He now runs the UI stack at Midjourney, a platform serving over 16 million users. And before Pretext, he built an earlier version of this same idea called text-layout, archived it when he found a cleaner approach, and started over.

His answer to the layout reflow problem is to stop asking the browser for text heights entirely.

Text height is pure arithmetic. Lines in a paragraph times the height of each line equals the total height. The browser doesn't need to be involved in that. The only hard part is figuring out how many lines a block of text wraps into at a given container width. And that depends on one thing: how wide each word is.

Pretext gets word widths from the browser's Canvas API. Canvas has a function called measureText() that returns a word's pixel width directly from the font engine. Unlike DOM reads, Canvas measurement doesn't trigger layout reflow. The browser just answers immediately, no layout engine required.

Pretext's prepare() function takes your text and a font string, breaks the text into words, measures each word's width via Canvas, and caches the results. For 500 comment-length texts, this costs about 19 ms total. Paid once when content loads.

Then layout(prepared, containerWidth, lineHeight) runs whenever you need a height. It walks the cached widths, counts how many words fit per line at the given container width, and multiplies by line height. Pure maths. No Canvas. No DOM. No browser involvement at all.

For 500 texts: 0.09 milliseconds. Window resize? Call layout() again. 0.09ms. The stutter is gone because the thing that caused it is gone.


AI Co-Developed This Library

Open the Pretext contributor list on GitHub. YOu will find - Claude, listed with actual commits. To understand what actually happened, you need to understand the specific problem that required it.

Pretext doesn't just need to work for English. It needs to predict, with pixel-level accuracy, where every browser on earth will wrap every line of text in every language. That's the hard part.

Every language has different rules for where a line can break. English breaks between words. Chinese and Japanese break between almost any two characters, but specific punctuation marks are forbidden from starting or ending a line. Japanese calls these kinsoku rules. Thai has no spaces at all - the only signals for valid break points are invisible zero-width characters embedded in the source text. Arabic flows right-to-left, and certain punctuation clusters must stay attached to neighboring words or the text's meaning changes. Myanmar has religious punctuation marks that can't be separated from the word they follow. None of this is in the Unicode spec in a form you can just read and implement. You find these rules by testing actual text and watching where your predictions diverge from what the browser does.

So Cheng built the accuracy workflow first, before building the fix. The workflow does this: render a corpus of real text in an actual browser at hundreds of different container widths, capture the real DOM heights, run Pretext on the same text at the same widths, compare the two, flag every width where they disagree.

That comparison produces a list of failures. Each failure has a language, a width, and a line count difference. And then someone has to figure out why.

The repo has a file called AGENTS.md. It describes every command available, every source file and what it owns, every constraint to respect before changing anything. It reads like a very thorough onboarding document for a new engineer who can run commands and edit files but needs to know where everything lives.

The commands the AI used look like this:

bun run accuracy-check runs the full browser sweep, comparing Pretext predictions against real DOM measurements across all languages at all tested widths. The output is a table of mismatches.

bun run corpus-sweep --id=arabic --samples=9 --font='20px Noto Naskh Arabic' takes the Arabic corpus and tests it at 9 different container widths, reporting exactly which widths produce a mismatch and by how many lines.

bun run corpus-taxonomy --id=arabic 300 450 600 classifies the mismatches at those widths into categories. Edge-fit problems, glue policy problems, boundary discovery problems, shaping context problems. Each category points to a different place in the codebase.

bun run probe-check --text='بكشء،ٍ' --width=527 --lang=ar isolates one specific piece of Arabic text at one specific pixel width and renders it in the real browser to see exactly what the browser does with it.

The text cluster بكشء،ٍ is an Arabic word followed by an Arabic comma followed by a tanwin kasratan diacritic mark. The standard text segmenter, Intl.Segmenter, splits this into two pieces at the punctuation boundary. The browser treats the entire cluster as one unbreakable unit. So when Pretext was predicting line breaks, it sometimes split the cluster across two lines. The browser kept it together. Different line count. Different height. Mismatch at 527px.

The fix was a preprocessing rule in src/analysis.ts: when a trailing punctuation cluster includes combining diacritic marks, merge the whole thing into a single segment before any layout calculations run. The browser confirmed it. The AI moved to the next language.

Japanese brought a different class of problem. Kana iteration marks (ゝ ゞ ヽ ヾ) are characters that mean "repeat the previous kana." Intl.Segmenter emits them as standalone word-like segments. The browser won't start a line with them, because they're meaningless without the character they follow. Pretext was letting them start lines. Fix: add them to the CJK line-start-prohibited character set.

Chinese had an opening punctuation issue. The corner bracket that opens a Chinese quotation, when it lands at the end of a CJK segment, gets moved by the browser to the beginning of the next segment. Chinese typography doesn't let a line end with an opening bracket. Pretext wasn't doing this, so brackets were landing at line ends that browsers never produce. Fix: carry trailing opening-punctuation clusters forward onto the next CJK segment during preprocessing.

And those are just three of the cases. The full accuracy run covered Arabic, Japanese (two corpora: 羅生門 and 蜘蛛の糸), Chinese (祝福 and 故鄉), Korean, Thai, Hindi, Hebrew, Urdu, Khmer from a Cambodian folklore anthology, and two Myanmar prose texts. Each language tested across Chrome, Safari, and Firefox at multiple widths. 7,680 test cases total.

A developer doing this alone would have spent months in the loop of write code, deploy, open browser, check, write code again. The AI compressed that loop substantially. Run accuracy-check, get the failing widths, run corpus-taxonomy to categorize the failure class, run probe-check to isolate the specific text cluster, edit analysis.ts, verify in the browser, move to the next language. The AGENTS.md file exists because Cheng built this as a sustained workflow, not a one-time experiment. The commands are structured to produce machine-readable output. The file tells the AI which source files own which behavior. It's software architecture written to include an AI as a contributor from the start.

What happened here is an AI reading browser output across nine languages, identifying specific Unicode clusters the segmenter was handling wrong, writing preprocessing rules that fixed them, and contributing that code to a shipped library. That's a different thing.


How This Compares to Karpathy's AutoResearch (And Where They Diverge)

Around the same time Pretext was circulating on GitHub, Andrej Karpathy published a project called AutoResearch that got 21,000 stars in days and 8.6 million views on his announcement post. The two projects use AI agents in structurally similar ways. But the differences tell you something interesting about where agentic development actually works and where it hits a wall.

AutoResearch works like this. You give an AI agent a small LLM training setup - about 630 lines of Python - and one instruction: improve the validation score. The agent reads the code, proposes a change (maybe a different learning rate, maybe a different architecture depth), runs a 5-minute training, checks whether the validation metric improved, keeps the commit if yes, reverts it via git reset if no, then starts the next cycle. No human in the loop. Karpathy's goal was stated plainly: "The goal is to engineer your agents to make faster research progress indefinitely and without any of your own involvement." Shopify's CEO tried it overnight and reported a 19% performance improvement after 37 autonomous experiments. (Fortune, March 2026)

The surface structure looks identical to what Cheng did. Both projects have a human-written markdown file that instructs the AI (AutoResearch uses program.md, Pretext uses AGENTS.md). Both run tests, compare results against a target, and loop. Both compress work that would take a human days into hours.

But the loop mechanic is fundamentally different.

AutoResearch is a search problem. The metric is a single number - validation bits-per-byte. Lower is better. The agent doesn't need to understand why a change worked. It just needs to know whether it did. If trying a wider attention layer drops the metric, keep it. If it doesn't, discard it. The agent can try things semi-randomly and still make progress because "did it improve?" is a binary question with a clear answer every 5 minutes.

Pretext is a debugging problem. The metric - does our line count match the browser's line count - is also binary per test case. But when it fails, random changes to the code won't fix it. The failure happens because Arabic punctuation clusters follow Unicode combining rules that aren't documented anywhere obvious, or because Japanese iteration marks are assigned the wrong break class, or because Chinese opening brackets have a typographic rule that predates any web standard. The AI can't randomly mutate analysis.ts and stumble onto a fix. It has to understand what the browser is doing with that specific text at that specific width and write a preprocessing rule that matches it.

That's why Cheng's workflow has the corpus-taxonomy and probe-check commands. They exist to force the AI into diagnosis before it writes any code. AutoResearch skips diagnosis entirely because the metric rewards any successful change, whatever its cause. Pretext can't do that. A change that fixes Arabic at 527px might break Thai at 400px. The AI has to understand the failure to fix it without creating new ones.

The other structural difference: In pretext Cheng stays in the loop. AutoResearch is designed to run while the human sleeps. Pretext has Cheng verifying each fix against the browser before moving on, making judgment calls about changes that help one browser and hurt another (the Myanmar case documented in AGENTS.md is a good example - a fix that improved Chrome accuracy hurt Safari, so Cheng didn't ship it). AutoResearch doesn't have this problem because there's one metric across one environment. Cheng has three browsers and 9 language families, and "correct" sometimes means different things depending on which browser you ask.

What both projects share is the architectural insight Karpathy put into words: the human's job is to design the environment and define what "better" means. The agent's job is to iterate inside that environment toward the target. In AutoResearch, the environment is a GPU and a training script. In Pretext, it's a browser and a text corpus. Both humans wrote a markdown file, set up an evaluation harness, and got out of the way - partially.

Karpathy called AutoResearch "the final boss battle" for AI labs and framed the next step as running swarms of parallel agents collaborating like a research community. Cheng's approach is quieter. One AI, one human, one language at a time. But the result - 7,680 tests, 9 languages, 100% accuracy - is the kind of number that a single developer working alone wouldn't have chased, because the testing infrastructure needed to get there would have cost more time than the problem seemed worth.


What You Can Build Now That Didn't Work Before

A company called aiia.ro built a demo where a dragon follows your cursor through a full page of text. As the cursor moves, every line of text reflows around the dragon's body in real time, splitting on both sides, flowing into every gap, then closing again. Sixty frames per second. No DOM reads at all. The full-page layout computation runs under half a millisecond per frame.

That's Pretext's layoutNextLine() function in action. You pass a different maximum width to each individual line. Lines beside the dragon: narrower. Lines away from it: full width. Pretext calculates the breaks for each line independently using cached word widths, with no browser contact at any point in the animation.

Chat bubbles have a problem that messaging apps have accepted rather than fixed. When the last line of a message bubble is two words, those two words sit inside a full-width container with a lot of empty space beside them. It looks cheap. The correct version is a bubble that's only as wide as its widest line. CSS can't compute that. There's no CSS property that asks "what's the narrowest width that still fits this text without adding extra lines?" But Pretext's walkLineRanges() does it by binary-searching widths and checking line counts, all using arithmetic on cached data, in under a millisecond for a full conversation. iMessage, Telegram, WhatsApp - every major messaging interface has this problem. Pretext makes solving it free.

For virtual scrolling specifically: call prepare() once when content loads. Cache the prepared objects. Call layout() whenever you need heights. Never measure text via the DOM again. Window resize becomes a layout() pass instead of a DOM read pass. The stutter was caused by the DOM reads. With them gone, the stutter goes with them.


Not Every Website Need Pretext

Pretext is not a drop-in upgrade for every website. Cheng Lou says this in the README and he means it.

The 0.09ms figure is the hot path after prepare() has already run. The prepare() call itself costs about 19ms for 500 texts. For a blog, a marketing site, anything where text gets measured once and doesn't change, Pretext adds overhead rather than removing it. The performance gains are specifically for applications that measure text heights repeatedly, at scale, in tight loops: virtualized feeds, real-time resize handling, animation-driven layouts. If your app doesn't do those things, the browser's native layout engine is fine and this library won't help you.

The library is also still early. The AGENTS.md file, the same one that guided the AI's work, documents several unresolved problems openly. Myanmar has a case where a fix that improves Chrome accuracy hurts Safari, so Cheng hasn't shipped it. Chinese shows a one-line discrepancy field at narrow widths that appears to depend on whether the user has PingFang SC or Songti SC installed. Urdu at 300px container width is off by two lines in both Chrome and Safari and the root cause isn't identified yet.

What the library is: production-quality for common text scenarios in 9+ languages, built with a development methodology that's genuinely new, published free, with the browser bugs it uncovered filed against the browsers themselves. Cheng Lou also wrote 1,055 lines of public research notes in RESEARCH.md documenting every dead end, every failed approach, every fix that worked on one browser and broke another.

Very rare to see an open source project published with the research trail left in it.