Kun Chen

How I built a starry night in TUI

Kun Chen — Tue, 28 Apr 2026 22:03:48 GMT

I’ve been building gnhf, a CLI that orchestrates coding agents to get work done overnight. The name stands for “good night, have fun”, so the TUI leans into the bedtime metaphor: a centered status panel sitting under a slow-twinkling starfield, with a moon strip tracking each iteration.

I wanted the background to feel alive without ever stealing attention from the actual run state.

I ended up implementing the renderer with no TUI framework, no “small game engine” - just a small cell grid, a seeded random number generator, and an ANSI diff at 5 FPS.

This post walks through how it works. The whole renderer is a few hundred lines of TypeScript and you can read it in src/renderer.ts and src/utils/stars.ts.

Why no framework

I actually started with Ink. It’s the obvious choice for a Node TUI - React mental model, flexbox layout, good ergonomics. For a static panel it would have been fine.

The problem showed up the moment I added the starfield. Ink re-renders through React’s reconciler: every animation tick walks a virtual DOM, diffs it, and hands the result to Yoga for layout. For a 120x80 grid where ~30 cells change per frame, that’s a lot of machinery to move a handful of characters. I could see the CPU spike on every tick, and on slower terminals the redraw lagged behind the 200ms tick.

The actual problem is very simple. The TUI is a fixed grid of cells - no scrolling, no input widgets, no focus, no flexbox. What I needed was a 2D Cell[][] buffer, a function that diffs two of them, and an ANSI emitter for the diff. That’s about 100 lines, it has zero per-frame allocation overhead beyond the buffer itself, and it scales with changed cells instead of total cells. At 5 FPS with 30-cell deltas, the renderer is essentially free.

We shouldn’t assume a framework is needed before analyzing the problem we’re trying to solve. Frameworks should earn their way into our projects.

Picking the star characters

The first thing I tried was the obvious one: *. It looked terrible. Asterisks are loud, they sit on the baseline, and a screen full of them reads as code, not sky.

What I actually wanted was the visual weight of a real night sky - mostly faint dots, a few brighter accents, nothing that competes with the text. I went through the Unicode block of misc symbols and settled on a small palette, weighted toward the quietest characters:

const STAR_CHARS = [
  "·",
  "·",
  "·",
  "·",
  "·",
  "·",
  "✧",
  "⋆",
  "⋆",
  "⋆",
  "°",
  "°",
] as const;

The duplication is the weighting. When I pick a character with Math.floor(rand() * STAR_CHARS.length), half the stars come out as middle dots, a quarter as four-pointed stars, the rest split between the bright ✧ and the small ring °. That ratio is what makes it read as a sky instead of a pattern.

A few things that mattered:

All single-width characters. Wide graphemes wreck a fixed cell grid - one stray emoji and every column to its right shifts.
All visually centered in their cell. * sits low; · and ⋆ sit in the middle and don’t fight the line height.
No characters that look like punctuation in context. . and ' would have been disastrous next to the prompt text.

Animating the stars

Real stars don’t blink in unison. The cheapest way to fake that is to give every star its own clock.

When I generate the field, each star gets a random phase offset and a random period somewhere between 10 and 25 seconds:

stars.push({
  x,
  y,
  char: STAR_CHARS[charIdx],
  phase: rand() * Math.PI * 2,
  period: 10_000 + rand() * 15_000,
  rest,
});

rest is the state the star sits in most of the time - mostly bright, sometimes dim, occasionally hidden. The hidden ones are important: they’re empty cells that occasionally blink into view, which is what stops the field from looking static.

The animation itself is a tiny state machine driven by wall-clock time. Each star is in its rest state for ~95% of its period, then runs through a brief blink envelope:

export function getStarState(star: Star, now: number): StarState {
  const t =
    ((now % star.period) / star.period + star.phase / (Math.PI * 2)) % 1;
  if (t > 0.05) return star.rest;
  if (star.rest === "bright" || star.rest === "hidden") {
    const opposite = star.rest === "bright" ? "hidden" : "bright";
    if (t > 0.0325) return "dim";
    if (t > 0.0175) return opposite;
    return "dim";
  }
  if (t > 0.025) return "bright";
  return "dim";
}

The shape is dim → opposite → dim → rest. A bright star fades down before it disappears; a hidden star fades up before it shows. That little three-step ramp is the difference between “twinkling” and “flickering”. Sharp on/off transitions look like rendering bugs.

Three states map cleanly to ANSI styles, no truecolor needed:

function starStyle(state) {
  if (state === "bright") return "bold";
  if (state === "dim") return "dim";
  return "normal";
}

bold and dim are SGR 1 and SGR 2, supported by every terminal I care about. Hidden stars render as a literal space.

One detail worth calling out: the star positions come from a seeded PRNG (pseudo-random number generator - a function that produces a deterministic sequence of “random” numbers from a starting seed). I used a Park-Miller LCG, which is one multiply and one modulo per call:

let s = seed;
const rand = () => {
  s = (s * 16807 + 0) % 2147483647;
  return s / 2147483647;
};

Determinism matters here - if the field regenerated freshly on every frame, the stars would jump around. Seeding it means the same terminal size always produces the same sky. Math.random() would have worked for the randomness, but it isn’t seedable in Node, so a few lines of LCG was the simpler answer.

Avoiding the main content

Another interesting problem was making sure the stars never overlapped the actual content we need to display.

The naive approach is to render stars under everything and overdraw. That works visually but it’s wasteful, and it creates a flicker risk if the diff order is wrong.

Instead, I split the screen into three regions and only generate stars where they’re allowed:

┌────────────────────────────────────┐
│      top stars  (full width)       │
│                                    │
│ side  │   content panel   │ side   │
│ stars │                   │ stars  │
│       │   (no stars)      │        │
│                                    │
│     bottom stars (full width)      │
└────────────────────────────────────┘

The content panel is a fixed CONTENT_WIDTH. Above and below it, stars get the full terminal width. To either side, only the margin gets stars. The frame builder stitches them together row by row:

for (let i = 0; i < contentRows.length; i++) {
  const left = renderSideStarsCells(sideStars, i, 0, sideWidth, now);
  const center = centerLineCells(contentRows[i], CONTENT_WIDTH);
  const right = renderSideStarsCells(
    sideStars,
    i,
    terminalWidth - sideWidth,
    sideWidth,
    now,
  );
  frame.push([...left, ...center, ...right]);
}

Three independent star fields - top, bottom, sides - each with their own seed. The side field gets generated at full terminal width but placeStarsInCells only emits stars whose x falls inside the requested column range, so a star that would have landed under the panel just doesn’t exist.

The win here is that there’s no z-ordering, no occlusion test, no overdraw. Each cell in the final frame has exactly one writer. When the panel grows or shrinks (the layout drops sections when the terminal is short), the regions recompute and the stars follow. No leaks, no half-overwritten characters.

Diffing frames

5 FPS doesn’t sound like much, but a 120x80 terminal is 9,600 cells. Repainting all of them every 200ms gives you a visible flicker on slow terminals and burns bandwidth over SSH.

So the renderer keeps the previous frame in memory, builds the next one as a 2D array of Cell objects, and only emits ANSI for cells that actually changed:

export function diffFrames(prev: Cell[][], next: Cell[][]): Change[] {
  const changes: Change[] = [];
  const rows = Math.min(prev.length, next.length);
  for (let r = 0; r < rows; r++) {
    const prevRow = prev[r];
    const nextRow = next[r];
    const cols = Math.min(prevRow.length, nextRow.length);
    for (let c = 0; c < cols; c++) {
      const n = nextRow[c];
      if (n.width === 0) continue;
      const p = prevRow[c];
      if (p.char !== n.char || p.style !== n.style || p.width !== n.width) {
        changes.push({ row: r, col: c, cell: n });
      }
    }
  }
  return changes;
}

Cell is the unit the whole renderer trades in:

export interface Cell {
  char: string;
  style: Style; // "normal" | "bold" | "dim"
  width: number; // 1 normal, 2 wide, 0 continuation
}

The width: 0 field deserves a little more attention, because it’s where a real bug used to live.

The moon strip uses emoji - 🌑🌒🌓🌔🌕🌖🌗🌘. In every modern terminal these render two columns wide, but JavaScript sees them as one grapheme made of one or two UTF-16 code units. If you naively put one moon in one cell of your buffer, your buffer thinks column 5 is occupied but the terminal has actually painted columns 5 and 6. Every cell to the right of the moon is now off by one, and the strip smears into the stars next to it.

The fix is to make the buffer agree with the terminal. A wide grapheme occupies two adjacent cells: the first holds the character with width: 2, the second is a placeholder with width: 0 and an empty char. The diff skips continuation cells outright:

if (n.width === 0) continue;

That single line is what keeps the moon strip aligned. When the active moon advances one phase, the diff sees exactly one changed cell, emits one cursor move and one character, and the continuation slot stays untouched. Without it we’d either double-emit (paint the moon, then paint a stray space over its right half) or get half-rendered moons on frames where only one of the two cells “changed”.

The same mechanism handles CJK characters that show up in agent output. No special case for emoji, no special case for Chinese - one width field, one rule.

Emitting the diff is also boring on purpose. Move the cursor with CSI row;col H, set the style if it changed, write the character, advance the cursor cursor in memory:

for (const { row, col, cell } of changes) {
  if (row !== cursorRow || col !== cursorCol) {
    result += `\x1b[${row + 1};${col + 1}H`;
  }
  if (cell.style !== currentStyle) {
    result += "\x1b[0m";
    if (cell.style === "bold") result += "\x1b[1m";
    else if (cell.style === "dim") result += "\x1b[2m";
    currentStyle = cell.style;
  }
  result += cell.char;
  cursorRow = row;
  cursorCol = col + cell.width;
}

Two small optimizations carry most of the win:

Skip the cursor move if it’s already there. Adjacent changed cells become bare characters with no escape sequence in between.
Don’t re-emit style codes that haven’t changed. A run of bright stars on the same row is one \x1b[1m followed by the characters.

In practice a typical frame in steady state is maybe 10-30 cell changes - a handful of stars transitioning, the elapsed timer ticking, and the active moon advancing one phase. That’s a few hundred bytes per frame, comfortably under any terminal’s redraw budget.

The render loop is one line:

this.interval = setInterval(() => this.render(), TICK_MS); // TICK_MS = 200

render() builds a frame, diffs it against the previous one, writes the diff, stores the new frame as the previous. That’s the whole engine.

Wrapping up

The starfield ended up looking really nice and gives me the feeling of zen.

If you want to poke at it, the code is at github.com/kunchenguid/gnhf. The two files worth reading are src/utils/stars.ts (~80 lines) and src/renderer-diff.ts (~120 lines). Have fun!

Org-Bench: Let’s Simulate the Org Charts Meme with Agents and See Who Wins

Kun Chen — Wed, 22 Apr 2026 19:40:23 GMT

You might have seen Manu Cornet’s org charts picture at some point. It became a popular meme because the stereotypes felt so accurate.

But have you ever wondered how these org structures would actually perform, if we put them to work and compare results side by side?

I say it’s time to get some data! We have agents now - let’s reproduce these org charts with agent teams and see how they work. Same input, same model, six org archetypes wired up as multi-agent topologies, each one asked to ship the same product.

Who will ship the best product in the shortest time? Let’s find out!

Spoiler alert: I ended up spending hundreds of millions of tokens and these agent teams took days to run. The result really blew my mind. I can’t wait to share that with you.

Setup

Here’s the benchmark setup I designed for running the simulation. Everything from the benchmark harness to the result datasets are all open sourced at https://github.com/kunchenguid/org-bench.

The task: Build an in-browser spreadsheet in vanilla HTML/CSS/JS. The project brief is in configs/brief.md.

The agent harness: We use opencode which is a CLI agent harness. The biggest thing I like about opencode is that it’s model-provider agnostic, which allows us to easily test different models down the road.

One agent = one opencode session. Every agent is a separate opencode session running in its own subprocess. Every agent and the judge all run on openai/gpt-5.4. Any output difference between topologies comes from topology, since the agent harness and models are all consistent.

Per-run isolation. Every run gets a disposable sandbox. At the end of each topology run the whole directory gets wiped, so two topology runs never see each other’s work.

Topology config. A topology is a plain TypeScript object: an array of agent names, a list of bidirectional edges (who can message whom), a named leader, a list of developers, a list of integrators (the agents allowed to merge PRs to main), and a culture definition string. The six configs all live in configs/topologies/.

Inter-agent communication. Messages route through per-agent inboxes. The orchestrator enforces adjacency: If an agent tries to send to someone they don’t have an edge with, the message gets dropped. This is the whole “org structure” machinery - the edge list enforces the org structure.

A round. Every agent wakes every round. Within a round all agents execute in parallel (each opencode session gets one prompt, runs tools, writes a JSON reply); the round only ends when the last one finishes or a safety timer fires. Rounds are sequential - round N+1 doesn’t start until round N is complete. We cap at a total of 28 rounds to avoid infinite rabbit holes.

Per-round prompt. Each agent’s prompt is assembled by the orchestrator from:

How many rounds remain
The agent’s persona - leader vs developer vs integrator
The team charter (everyone’s expectations, not just yours, so agents can infer what peers are working on)
The agent’s direct neighbors and their roles
The culture summary for that topology
The path to the brief
The agent’s current inbox messages
The agent-browser CLI reference
A required reply format: JSON with a messages: [{to, tag?, content}] array and an optional summary

Git flow. Developers commit to their own per-agent branch, push, open a PR targeting run//main (or a staging branch under a sub-lead for Amazon/Oracle). Only integrators (defined per topology) can merge PRs; everyone else’s merge attempts bounce.

Finalize. When the leader emits THIS_IS_MY_FINAL_SUBMISSION, the orchestrator considers the work done and sends the judge a single prompt asking it to drive the result app through agent-browser like a user and return an 8-axis rubric JSON. We report the average score across the rubric.

The orchestrator is otherwise hands-off. No human in the loop, no scoring intervention, no retries of failed PRs.

The results

Without further ado, drumroll please... the final results are here!

The best way to understand the differences though is to dive into the details, and play with what they built yourself.

Now, let’s dive deeper into the super interesting details.

Apple (judge 3.00, 24 rounds, 7.20M tokens)

Try it: kunchenguid.github.io/org-bench/apple

      Alice   Ben   Carol   Dave
          \    |     |     /
           \   |     |    /
            \  |     |   /
             \ |     |  /
                Steve
             / |     |  \
            /  |     |   \
           /   |     |    \
          /    |     |     \
       Emma  Frank  Grace  Henry

How this topology is set up

Hub and spoke, to reflect what’s in the meme, which is of course not exactly how Apple the company works.

Steve is the leader, the sole integrator, and the single communication hub.

Eight workers (Alice, Ben, Carol, Dave, Emma, Frank, Grace, Henry) each have one bidirectional edge to Steve. No worker can talk to any other worker directly.

Culture overlay: “taste bar + secrecy. Polish-first. Quality over schedule.”

This is the most centralized topology in the benchmark - compare to Facebook where every pair of agents has an edge, or Google where workers report up through middle integrators. Steve is the only one who sees the whole picture, and he’s the only one who can ship.

What happened

Steve decomposed the brief into eight clean subsystems on day one and held review authority over every integration. Twelve PRs landed on main. Shell, formula engine, clipboard, persistence, structural edits, visual polish, all merged individually. The app rendered beautifully.

Then the judge typed =SUM(A1:A3) and watched it render as the literal string =SUM(A1:A3). The formula engine was merged. The shell was merged. But nobody wired the engine into the render layer properly.

The bottleneck was structural. Workers couldn’t negotiate seam contracts directly, so every scope collision had to be caught by Steve on review - Carol’s editing PR overlapping Ben’s shell, Alice’s Node-only engine export that no browser peer checked, Dave’s clipboard hardwiring the wrong engine export path. Steve was sending six to eight messages per round to eight different workers, also doing all conflict resolution and all live browser QA.

The one thing nobody was specifically looking at, the one thing a peer-review culture would have cross-checked - the render layer calling the formula engine - slipped through, because the only person looking for it was also the only person doing everything else.

Some conversations happened during the run:

1. Carol gets rejected for a conflict. Round 4, Steve → Carol:

“Request changes on PR #328. Ben’s shell is merged, so your PR now overlaps the shipped entry surface (index.html, app.js, styles.css) and exceeds your boundary. Rework on top of current run/apple/main so you deliver only single-cell interaction behavior wired into the existing shell.”

Carol opened #328 before Ben’s shell merged. Now her PR wanted to replace files Ben owned. Steve rejected the whole thing on scope, not content.

I’m sure this is far from how Apple actually works, but here we’re following the meme, and the hub and spoke topology prevents peer to peer coordination which inevitably results in conflicts like this. And even worse, Steve ended up being a bottleneck to rationalize all such conflicts.

2. Grace’s visual polish is the one PR Steve merges without comment. Round 5, Steve → Grace:

“Merged PR #330. The diff stayed inside styles.css, preserved Ben’s structure, and raised the polish bar cleanly. Shell presentation ownership remains with you for any follow-up visual defects I find in live QA.”

Compare to every other sub-team’s review. Alice: rework. Carol: rework. Dave: rework. Henry: rework. Grace: merged, complimented, held on retainer for future defects.

Steve likes visual polish.

3. Live QA finds the real product broken. Round 14, Steve → Carol:

“Live bug report on your subsystem from the merged app in run/apple/main: open index.html, click cell A1, type 2, then press Enter. Expected: commit 2 into A1 and move selection to A2. Actual: the in-cell editor remains open inside A1, the name box stays on A1, and the commit does not complete. I reproduced this in the real browser.”

Same round, Steve → Grace:

“I started the live browser pass on current main. The first blocker is functional rather than visual: single-cell Enter commit is broken, so I’m holding visual polish notes until that core edit flow works in the real app.”

This is round 14 of 24. The team has shipped eight subsystems and still can’t type a 2 into A1 and hit Enter. Steve did hold a high bar.

4. Steve’s final call. Round 24, Steve → Grace:

“I completed the final clean browser pass and did not find a concrete visual blocker that justifies another round. We are at ship quality on the current artifact.”

The judge’s verdict on the same artifact: “the browser snapshot still exposed only cell addresses, not evaluated results; source inspection explains why: app.js renderGridValues() writes state.cells[address] directly, and the formula engine is not wired into UI rendering.”

Steve’s bar was very high. But relying on a single leader to catch all bugs was unrealistic and didn’t work well.

Amazon (judge 3.12, 13 rounds, 3.50M tokens)

Try it: kunchenguid.github.io/org-bench/amazon

                  Jeff
                 /    \
              Alice    Ben
             /    \    / \
          Carol  Dave Frank Emma
                           /   \
                        Grace  Henry

How this topology is set up

Three-level tree with staging branches.

Jeff at the top, two tech leads (Alice and Ben) under him, each running their own subtree. Alice leads Carol and Dave. Ben leads Frank and a sub-sub-lead Emma, who runs her own two-person team (Grace and Henry).

Integrators: Jeff, Alice, Ben, Emma - but with a twist. Code doesn’t just flow into main. Each tech lead owns a staging branch (run/amazon/Alice, run/amazon/Ben, run/amazon/Emma), merges their subtree’s PRs into it, and then opens an integration PR upward.

Jeff is the only one who can merge to run/amazon/main. Grace and Henry’s work travels through three merges to reach main. Communication is hierarchical: workers only talk to their lead.

Culture overlay: “PR/FAQ writing + customer obsession + frugality.”

What happened

Amazon finished fastest of any topology - 13 rounds. The artifact looked good in isolation: real shell, real formulas, real undo, visible insert-row controls. But the score landed at 3.12 with specific feature gaps: reload persistence didn’t reload anything, and shift-range clear didn’t work across multiple cells. Both features were explicitly in the brief, and both were called out in Jeff’s own PR/FAQ on day one.

Two things happened. First, the three-layer hierarchy lost information in both directions. Ben told Emma one thing in round 5, a different thing in round 7, and Grace and Henry at the bottom of the tree never got re-briefed. When Grace opened a structural-edits PR with specific delete-row and delete-column controls, Emma blocked it claiming “Ben has it upstream.” The judge later confirmed exactly what Grace was building was missing from the shipped artifact. Information that would have crossed a peer edge in one hop had to traverse three layers, and it decayed.

Second, the PR/FAQ turned out to be Jeff’s chain-of-thought more than a team execution tool. It was referenced exactly twice in 165 messages - both by Jeff in round 1, to Alice and Ben, and never again. Zero citations by any of the eight other agents. Downstream decisions were made on informal claims (”Ben says he has it”) rather than against the PR/FAQ’s explicit feature list. Writing the customer story on day one helped Jeff think clearly. It didn’t help anyone else ship to it.

1. Jeff’s opening PR/FAQ. Round 2, Jeff → Alice and Ben (abridged):

“PR/FAQ: We are shipping a static in-browser spreadsheet that opens instantly from file:// and feels production-ready on first click. Customers can type directly into a 26x100 grid, use a visible formula bar, evaluate formulas with references and ranges... Build the foundation customers feel immediately: crisp grid rendering, unmistakable single-cell selection plus rectangular range selection, edit mode behavior, formula bar parity, keyboard navigation, and persistence namespace support.”

Every other topology dropped right into decomposition. Amazon wrote a press release first and tried to use it to guide execution. Reload persistence and range selection are right there in the opening sentences. They’re also exactly what the judge flagged broken 11 rounds later.

2. The directive that reversed and never propagated. Round 5, Ben → Emma:

“Keep Henry focused on fixing the formula-bar sync bug before you roll the broader interaction slice upward; once that is clean, your subtree can layer range selection and clipboard flows onto a stable foundation.”

Round 7, two rounds later, Ben → Emma again:

“Your subtree no longer needs to cover that ground; focus your upward work on the interaction gaps we still do not have on Ben, especially range selection and clipboard behavior.”

Ben reversed himself cleanly. Emma updated her instructions to Grace and Henry. But she compressed both messages into her own framing, and by the time Grace was deciding what to build in round 10, “Ben has range selection and clipboard” had become the operative assumption. The original directive, the reversal, and the actual state of Ben’s branch were all different things, and only Ben knew which was current.

3. Emma blocks Grace on a feature Ben didn’t actually have. Round 13, Emma → Grace on PR #437:

“Ben has already moved his branch to 0fabd81 with browser-visible row and column insert/delete controls plus the supporting model behavior, so landing the same structural-edit surface here would duplicate upstream work instead of closing the next customer gap.”

Grace had built specific delete-row and delete-column controls. Emma’s claim was inherited from Ben’s high-level status, not verified by reading Ben’s branch. The judge later: “the DOM snapshot shows only + row controls” - insert-only, no delete. Exactly what Grace was adding. In a mesh, Grace could have pinged Ben directly. In a tree, Emma’s translation of Ben’s claim is the only channel, and “Ben has structural edits” didn’t distinguish “insert” from “insert and delete.”

Facebook (judge 3.38, 20 rounds, 5.99M tokens)

Try it: kunchenguid.github.io/org-bench/facebook

     Mark   Alice   Ben   Carol   Dave
        \    |      |      |     /
         \   |      |      |    /
    (full mesh: all 36 edges present)
         /   |      |      |    
        /    |      |      |     
     Emma  Frank  Grace   Henry

How this topology is set up

Full mesh. Nine agents, and every pair has a bidirectional edge.

Mark is the named leader but the adjacency gives him no structural advantage - everyone can reach everyone.

Every agent is both a developer and an integrator. Anyone can review and merge someone else’s PR.

Mark sets direction and removes blockers; he doesn’t gate-keep. Peers are expected to respond to review requests in the same round.

Culture overlay: “move fast. A merged imperfect change beats a perfect unmerged one.” This is the opposite extreme from Apple. No chokepoints, no single reviewer, and the only centralized authority is conventional, not enforced.

What happened

Facebook scored 3.38 with a quite polished UI. Then the judge entered 10, 20, 30 into A1, A2, A3 and asked for =SUM(A1:A3). The answer was 40. The formula engine was wrong.

The mesh meant peers didn’t wait for permission. Alice unilaterally merged Ben’s spreadsheet foundation as the team’s baseline without Mark’s sign-off. When she couldn’t get to a review in a round, she reassigned it to Carol directly. Emma pulled trunk, saw her own undo/redo branch was superseded, and stood down without being told.

Bug reports routed peer-to-peer: Henry told Mark, Mark told Alice in the same round, Alice shipped the fix in the same round. No leader bottleneck, parallel decisions, same-round turnarounds. This is what people mean when they say a flat org “moves fast.”

It’s also what they mean when they say accountability “diffuses.” Late in the run, Alice and Ben jointly concluded that the remaining keyboard-typing bug was a test-harness artifact rather than a product bug - verified each other’s instrumentation, agreed on the call, stood down together. Nobody above them was asked to validate the conclusion. That same distributed-trust dynamic is why SUM(10, 20, 30) returned 40 and shipped. In a mesh, the answer to “who’s supposed to verify the math?” is everyone, which in practice means no one.

1. Alice unilaterally picks the team’s baseline. Round 3, Alice → Ben, cc Mark:

“Merged your PR: #390. I reviewed the diff, confirmed it targets run/facebook/main, and noted that the included workbook tests cover evaluation, recalculation, circular refs, and formula shifting. Merged so the team can build on the stronger spreadsheet baseline this round.”

Alice had her own competing foundation PR. She reviewed Ben’s, concluded it was stronger, and merged it as the new trunk baseline. Mark was notified after the fact. In Apple this would have routed to Steve and cost rounds of review cycle. In Amazon it would have needed Jeff’s approval. In the mesh, two peers negotiated whose direction to adopt and the answer was binding.

2. Emma self-disqualifies her own branch. Round 3, Emma → Mark, Dave, Carol:

“I pulled the latest origin/run/facebook/main while working on undo/redo and found the merged mainline spreadsheet already has session history in app.js with 50-deep stacks plus Cmd/Ctrl+Z, Cmd/Ctrl+Shift+Z, and Cmd/Ctrl+Y. I’m not pushing my divergent duplicate branch because it is now based on the older scaffold and would just conflict with the merged shell.”

Nobody told Emma to stand down. She checked her own work against the new trunk and concluded it was redundant. In a mesh peers can verify their own relevance. In a tree, someone upstream has to notice and tell them.

3. Review reassigned directly between peers. Round 4, Alice → Carol:

“Can you take over review ownership for Frank’s PR this round: PR #395? I had to use this turn to sync and repair my stale PR against the merged trunk. Please either merge it or send concrete change requests back to Frank this round.”

Carol picked it up, verified the PR was already merged, and reported back. No leader involvement. Every integrator is fungible, so a peer can route around their own round-level constraint by pointing at another peer.

4. Jointly accepting an unresolved bug. Round 19, Alice → Ben, followed by Ben → Alice:

Alice: “I finished the instrumented debug pass on the clean local entrypoint. For the exact click A1 -> type 2 replay, the DOM log only recorded a focusin on BUTTON:A1 and no subsequent keydown, beforeinput, or input events. That matches your hold-off instinct - the failing path looks like agent-browser type never delivers printable-input events to the grid button, so I’m not pushing the local direct-entry experiments as a PR.”
Ben: “Your instrumented clean-session result matches what I was converging on. Since the exact shared replay only records focusin on BUTTON:A1 and no printable-input events afterward, I’m standing down on the local direct-entry experiments too instead of creating another low-confidence PR.”

Two peers instrumented the same bug, compared notes directly, reached the same conclusion, and both stood down. This is the mesh working exactly as designed - co-authored technical judgment without leader arbitration. It’s also how a bug that the judge would later find real gets collectively accepted as a tool issue.

5. The math was wrong and nobody owned the check. Post-run, from docs/facebook/trajectory/analysis.json:

“The shipped build still mis-evaluated some basic formulas such as observed SUM, AVERAGE, and CONCAT outputs, so the run ended with a shipped integrated product and an unresolved gap between the team’s final acceptance picture and the judge’s formula-correctness result.”

Every peer tested the formula bar. Every peer tested edit/undo/paste. Every peer saw the result “render” in a cell. Nobody checked whether the rendered value was arithmetically correct. The mesh’s virtue - distributed verification - was also the mechanism that spread correctness ownership so thin it disappeared.

Google (judge 3.62, 15 rounds, 5.84M tokens)

Try it: kunchenguid.github.io/org-bench/google

                     Eric
                 /  /    \   \
              Alice Ben Carol Dave     <- middle managers
                \ \  |  / /
       (all 16 middle -> worker edges)
                / /   |   \ \
              Emma Frank Grace Henry   <- workers

How this topology is set up

Two-layer bipartite. Eric at the top, four middle integrators (Alice, Ben, Carol, Dave), four workers (Emma, Frank, Grace, Henry).

Every middle integrator has edges to all four workers; workers have no edges to each other. Integrators: Eric plus all four middles.

Workers’ role prompts say “every substantive change starts with a short design doc shared with connected middle integrators”; middles’ prompts say “reviews design docs from connected workers, asks for data or metrics when claims are made, and merges only after consensus forms in the doc comments.”

Culture overlay: “design docs + data-driven consensus. Claims need data.”

What happened

Google got the top score (3.62) with 25 passing app-level checks and the widest working feature set of any topology. It finished in 15 rounds on 5.84M tokens - not particularly long, not particularly heavy.

The biggest contributing factor to the success was design-doc discipline, and it was actually doing work rather than just existing.

Eric’s established the rule that no substantive code merges until a design doc with a TDD plan and claim-to-check mapping has consensus - turned review from taste-based judgment into mechanical comparison.

Once a design was approved and landed, every subsequent PR could be evaluated against it. Ben rejected Henry’s later PR #360 not because it lacked a doc but because it diverged from the design that had already landed. Carol rejected Grace’s PR #356 by comparing her design doc to Emma’s approved convergence path. By round 12 Dave could name the exact remaining blocker by diffing the committed design against what was visible in the browser.

Four middle integrators reviewing in parallel against the same written criteria catches more than one Steve ever could. The docs weren’t there to check a process box. They were the yardstick every later decision measured itself against.

1. The first block: a doc without a TDD plan isn’t a doc. Round 3, Alice → Henry on PR #353:

“The architecture and execution order look reasonable, but before I can merge I need one doc update: add an explicit TDD plan for the first slices and map concrete acceptance checks to the major claims, especially around the workbook model, parser/evaluator, dependency recalculation, persistence namespace handling, and grid interactions.”

Alice isn’t asking for more words. She’s asking for traceability: every architectural claim must be tied to a test that would fail if the claim breaks. Eric backs the block in round 4 with the principle: “We need doc consensus before substantive code lands.”

2. Henry splits the PR. That unblocks the rest of the run. Round 6, Henry:

“I split the doc review from the product code as requested. New doc-only PR: #358. It contains just design-doc-henry-round1.md with the architecture, TDD plan, and claim-to-check mapping. I also closed stale mixed-scope PR #353.”

Three rounds of gating to get the doc right. Then the gate stays open for the rest of the run - every subsequent feature inherits the pattern. The upfront cost pays down across twelve more rounds of mechanical review.

3. The approved design becomes the rejection yardstick. Round 8, Ben → Henry on PR #360:

“I reviewed PR #360 and ran tests; your 10 core tests pass. I am not merging it, though, because it duplicates formula-evaluation logic that is already shipped on main through PR #357 instead of extending the product path that the app actually uses. That would split the architecture and create two sources of truth.”

Henry’s doc existed. His tests passed. Ben still rejected him, because the code diverged from the landed design. This is what makes the gate a mechanism rather than ceremony. Once Emma’s design had won consensus and her code had landed, “matching the landed design” became the test, and reviewers could decide without arguing from taste.

4. An architectural gap caught at design review, not at QA. Round 5, Alice on PR #354 (PR comment):

“UI only supports a single active cell and single selection; no rectangular range highlight, no Shift+Click or Shift+Arrow extension in the view, and Delete/Backspace clear is scoped to the active cell only, not the selection range.”

Range selection was a brief requirement. Alice caught it missing at round 5, in design review, by reading the doc and the code together. Compare to Apple, where the equivalent bug (render layer not calling the formula engine) was never caught at all - Steve only did live QA, not doc review against the brief.

5. By round 12 the doc names the exact remaining gap. Round 12, Dave:

“I validated the final judged gap on merged run/google/main. Code search found no row or column insert-delete implementation beyond ordinary cell clearing, and the live browser snapshot shows only plain row labels and column labels with no discoverable structural-edit affordances. So the remaining blocker is explicit now: there is no user-facing row-column insert-delete action to exercise.”

Dave isn’t guessing. He’s diffing the committed design against what’s actually rendering. By round 12 the acceptance criteria are so precise that the remaining gap has one sentence. The team closed it in the next PR.

Microsoft (judge 3.00, 15 rounds, 6.47M tokens)

Try it: kunchenguid.github.io/org-bench/microsoft

                         Bill
                      /       \
                 Diana   ===   Edward       <- division heads
                /  |  \        /  |  \
             Alice Ben Carol Dave Emma Frank
             \__ Diana's __/ \__ Edward's __/

How this topology is set up

Two rival divisions plus a leader. Bill at top, with edges to Diana and Edward. Integrators: Bill, Diana, Edward.

Diana runs a division of three (Alice, Ben, Carol) with edges only to her team. Edward runs a parallel division of three (Dave, Emma, Frank) with edges only to his team. Diana and Edward have an edge to each other but their workers do not cross.

Both divisions are given overlapping scope by design. Bill’s prompt says he should “create urgency by playing the divisions off each other.” The losing division’s workers get redistributed to the winner.

Culture overlay: “two divisions fighting for survival. Stack-rank Ballmer-era energy.”

Bill runs this as a winner-take-all race. Diana and Edward each get the full brief, with contested surfaces on purpose, and Bill tells each of them that shipping a complete product first keeps their team intact.

What happened

The race incentive did exactly what you’d expect. Five workers opened competing foundation PRs in round 2 alone. The clipboard delta rebased four times across rounds 8, 9, 10, and 12 because Edward’s division kept landing work on the trunk while Diana was mid-review. Carol’s structural-edit work (PR #479, rebased to PR #485) passed Diana’s review and then died because Edward shipped PR #484 on the same surface first. Edward himself kept rebuilding clipboard locally across eight rounds without ever opening a PR - he wanted to land it as a single clean flag-plant, and the trunk kept moving out from under him.

This shows up in the tokens. Microsoft burned 6.47M tokens on 15 rounds, almost twice Amazon’s budget for a similar number of rounds. Conservatively, ~1.5M of those tokens went to work that was completely abandoned, superseded, or never left a worker’s local branch. That’s the measurable cost of the race.

It also shows up in the score. Microsoft landed at 3.00, tied for lowest, on three specific judge complaints. Reload persistence didn’t persist (the code existed, but no one validated end-to-end). Keyboard copy/paste sent the browser to a blank page (Frank reported the crash in round 8, Edward kept “fixing” it in his local branch across the next seven rounds and never opened a reviewable PR, the crash shipped). The UI had no visible insert/delete row or column controls (Carol’s engine logic shipped twice, but no UI affordance was ever added, because all UI work went to the contested shell PRs that stayed dirty).

Three acceptance-critical features in the brief, all uncontested, all absent from the final artifact. Both divisions were busy fighting each other for the visible surfaces.

1. Bill plays the divisions against each other. Round 9, Bill → Edward:

“Diana now has reviewed evidence that a clipboard branch already achieves the judge-critical relative-shifting behavior, so your division no longer has the inside lane by default.”

Edward’s response was to ship clipboard faster. That’s what the incentive rewards: feature A already shipped by the rival, so double down on feature A. Feature B (reload persistence, uncontested) gets nobody’s attention.

2. Carol’s verified structural-edits work dies to Edward’s flag-plant. Round 11, Edward → Bill:

“Merged Emma’s structural-edit PR #484 into run/microsoft/main this round so Diana cannot take the row/column insert-delete surface uncontested.”

Diana to Carol, same round, on PR #485:

“I verified PR 485’s behavior and the implementation itself looked sound, but main advanced again during review and landed structural edit semantics in PR #484, which made 485 dirty before I could merge it. Treat the structural lane as effectively closed unless Bill explicitly asks for comparison follow-up.”

Carol shipped the same feature twice (PRs #479 and #485), Diana reviewed both, both were technically sound, both died because Edward’s division landed first on the same surface. The race rewards speed on contested ground. The cost is the other team’s verified work converting to waste.

3. Frank’s crash report goes nowhere. Round 8, Frank → Edward:

“Ran the focused browser proof on the live grid using the formula bar plus real keyboard copy/paste on the shared trunk build. I could set up A1=2, A2=3, and B1==A1, but after selecting B1, sending Meta+C, selecting C1, and sending Meta+V, the browser session collapsed to an empty page.”

Frank filed a clean, concrete repro. Edward said the crash was “gone on Edward” in round 9 (his local branch). But Edward never opened a PR with the fix. He kept rebuilding clipboard locally across rounds 9, 10, 11, 12 as main moved under him, and by round 14 Bill declared final submission. The clipboard code that shipped was Ben’s PR #477 - which didn’t include the keyboard-path fix Frank had reported. The crash shipped.

Oracle (judge 3.25, 28 rounds, 4.36M tokens)

Try it: kunchenguid.github.io/org-bench/oracle

                        Larry
                  /  /  /  |  \  \
                 /  /  /   |   \  \
              Alice Ben Carol Dave Quinn     <- Quinn
             (legal)(sec)(priv)(accs)  |    (eng director)
             <------ reviewers ----->           / | \
                                               /  |  \
                                           Emma Frank Grace   <- engineers

How this topology is set up

Hierarchical with a named gatekeeper layer.

Larry is the leader and the only agent that can merge to run/oracle/main.

Quinn is the engineering director - explicitly non-coding - who runs a three-engineer team (Emma, Frank, Grace) on a staging branch (run/oracle/Quinn), opens one integration PR upward when the staging branch is ready, and personally drives the composed app through agent-browser as QA. Integrators: Larry, Quinn.

Alice, Ben, Carol, Dave are dedicated reviewers, each locked to a single angle: Alice = legal, Ben = security, Carol = privacy, Dave = accessibility. Their role prompts explicitly forbid them from commenting outside their lane. Approvals go through a specific convention: each reviewer posts a PR comment whose first line starts with APPROVED (): or BLOCKED ():; Larry merges once all four APPROVED comments are present at the current head.

Engineers don’t contact reviewers directly; Quinn mediates.

Culture overlay: “hierarchical, process-first, multi-angle review. Slowness is by design; bypassing is forbidden.” This is the only topology with dedicated reviewers and a hard approval gate.

What happened

Oracle ran the full round cap (28 rounds, longest in the benchmark) and the multi-angle review gate did roughly what you’d expect from a process-first org: it caught real lane-specific bugs, it took forever to clear, and it left actual product functionality untested.

The review layer was helpful to a degree, but it was expensive. The four angle reviewers spent rounds 2-14 in pure standby while Quinn and the engineers built. Once the integration PR opened, Dave (accessibility) alone bounced it three times across three different heads, each bounce about a subtler Tab-key behavior than the last.

The review layer didn’t care about whether the product worked. Carol (privacy) caught a real bug: src/app.js hardcoded the localStorage namespace instead of using the harness-injected run-scoped namespace. Dave (accessibility) caught a real ARIA grid-pattern violation: Tab was walking cells instead of exiting the widget per WCAG.

But nobody was watching for actual product value. The same PR that got three sign-offs and an eventual accessibility approval still had persistence that didn’t persist on reload, and still had the documented known issue that click A1 -> type "1" -> Enter didn’t commit into A1.

The judge, driving the app through browser like a real user, gave Oracle 2/4 on functional completeness. The internal red tapes didn’t act as a bar on product quality. It’s a set of compliance gates stacked on top of each other, and nobody owns the question “does this thing actually work?”

1. Larry forces the ship. Round 12, Larry → Quinn:

“I checked both PR lanes directly again and there is still no open PR in either. Parallelizing the fix is fine, but the next useful milestone is not more diagnosis - it is an actual PR. Get whichever of Frank or Emma has the first working minimal edit-flow patch to open immediately against run/oracle/Quinn, merge it, and then either proceed to integration if the browser path is green or open the integration PR with the known issue documented if this still slips. We cannot keep spending rounds with no review artifact.”

This is the one thing that kept Oracle from looping forever. Without Larry pushing for a review artifact over a perfect artifact, Oracle would have shipped nothing.

3. Quinn opens the integration PR with a documented known issue. Round 14, Quinn → Larry:

“I opened the integration PR on the documented-known-issue path so review can start immediately: PR #503. It targets run/oracle/main from run/oracle/Quinn, includes the required labels, and documents the remaining blocker from my acceptance pass - the click A1 -> type 1 -> Enter flow is not yet a trustworthy visible commit-and-advance path. Please watch for the four angle signoff comments and merge once they are all present at the current head.”

Note what Quinn did and didn’t do. He shipped a PR with a broken core interaction clearly labeled. Four specialized reviewers are about to look at this. None of them will block on click A1 -> type 1 -> Enter being broken, because it’s not in anyone’s lane.

4. Carol catches a real privacy bug. Round 15, Carol on PR #503:

“BLOCKED (privacy) on PR #503. Privacy blocker: src/app.js hardcodes localStorage namespace oracle-sheet instead of using the harness-provided run-scoped namespace, so saved cell contents and selection can collide across runs in the same browser profile. I posted the blocking PR comment with details.”

This is exactly what you hope a privacy reviewer catches - a concrete cross-run data-hygiene bug that breaks an isolation assumption. The gate paid for itself on this one catch. Frank shipped a fix in the next round.

5. Dave blocks three times on progressively subtler Tab behavior. Rounds 15, 20, and 24, same reviewer, same PR, same “lane”:

Round 15: “the grid exposes every cell as its own tab stop instead of using a single focusable grid/roving-tabindex model, so Tab walks cell-by-cell across the full matrix.”
Round 20: “Tab still moves the active cell from A1 to B1, so keyboard users are still traversing the grid cell by cell instead of exiting the grid.”
Round 24: “Tab still advances within the grid (B1 -> C1) instead of exiting the spreadsheet widget.”

Three PR heads, three accessibility fixes from Frank, three increasingly narrow complaints about where Tab goes. Each bounce cost a full round-trip of fix + merge to staging + re-review. By the third bounce the argument was whether one specific focus-move edge case obeyed the ARIA grid pattern correctly. One reviewer with a narrow bar and no cross-check seems to become a bottleneck.

6. The gate’s blind spot, post-merge. After the PR finally merged at round 27 (all four APPROVED comments present at head cb24008), the judge drove the shipped app through the browser like a user and scored it 2/4 on functional completeness. From the judge’s rationale on PR #503:

“The main failure is persistence: reloading
http://127.0.0.1:54833
restored a blank sheet, losing all entered contents, so by the stated floor functional completeness cannot exceed 2. I also could not verify copy/paste relative-reference shifting or row/column insert-delete behavior in the UI; there were no visible insert/delete affordances, and a copy/paste attempt led to unstable behavior during capture.”

Three review cycles, four specialized reviewers, and these gaps sailed through. Carol was focused on the storage namespace, not whether persistence actually round-tripped on reload. Dave was focused on the ARIA Tab model, not whether a user could type a value and see it commit. Alice and Ben approved on first pass and never looked back. The review gate was thorough in its lanes and completely silent on the thing a user would notice first.

What I take away from this

Same model, same brief, same time budget. Six very different outcomes.

Apple care about polish but the hub-and-spoke structure made Steve the only person who could see the whole picture, which became a bottleneck and resulted in gaps.
Amazon shipped fast and hit the finish line but with specific brief requirements unmet. The three-layer hierarchy caused information loss on the way up and the way down, and the PR/FAQ turned out to be the leader’s chain-of-thought rather than the team’s execution tool.
Facebook shipped a beautifully polished spreadsheet where SUM(10,20,30) returned 40. The mesh let peers move fast without the leader, but also caused diffusion of responsibility.
Google shipped the widest working feature set in the benchmark. Design-doc discipline turned review into mechanical comparison against approved criteria, so four middle integrators could catch in parallel what no single reviewer could catch serially.
Microsoft shipped a broken product and wasted tokens doing it. Two rival divisions racing on contested surfaces duplicated their clipboard work four times and left the uncontested features (reload persistence, keyboard copy/paste, insert/delete UI) broken or absent.
Oracle took the longest (28 rounds, longest in the benchmark) to ship a mid-pack product. Internal red tapes caused significant slow down yet didn’t help catch real product problems a customer would care about.

I’m genuinely blown away by how different org structures and culture can have such a visible impact on their outcome, and how many interesting observations we can have by watching agents simulate human collaboration.

Full trajectories, PR comments, judge output, and shipped artifacts are all shared in docs//. Every quote in this post is verbatim from docs//trajectory/messages.jsonl or from the PRs raised by agents.

Feel free to go dig in yourself and see what else you’ll find. I’d be keen to hear your thoughts!

Making a Polished TUI Demo Video Without a Video Editor

Kun Chen — Tue, 21 Apr 2026 16:42:44 GMT

I recently put together a TUI demo gif for my no-mistakes tool’s readme and came out of the process pretty happy with it: crisp text, a zoom on the key command, sensible pacing, about 700KB, and the whole thing regenerates with one make command.

no-mistakes TUI demo gif

I was a little surprised how far you can get with just a couple of off-the-shelf tools and some tuning. No video editor, no screen recording software, no manual export step. If you’re shipping a CLI or TUI and thinking about a readme gif, I figured the setup is worth writing up.

Here’s how it works.

The Stack

Three tools, each doing one thing:

vhs drives the terminal and captures frames
ffmpeg handles zoom, speedup, and color optimization
make glues it together

That’s the whole pipeline.

VHS: Reproducible Script to Record Terminal Programs

VHS from Charm is the thing that makes this reproducible. You write a .tape file that describes terminal dimensions, env vars, and a sequence of Type/Enter/Sleep commands. VHS spins up a headless terminal, executes the script, and spits out a gif.

Here’s a snippet of my demo.tape:

Set FontSize 50
Set Width 2750
Set Height 1625
Set Theme "Catppuccin Mocha"

Sleep 1s
Type "git push"
Sleep 1s
Type " no-mistakes"
Sleep 3s
Enter
Sleep 2s
Type "no-mistakes"
Sleep 3s
Enter

Sleep 9s            # wait for review step to surface findings
Sleep 2.5s          # linger on the approval screen
Type "f"            # press f to fix
Sleep 29s
Sleep 2s

A few things worth calling out.

Record at a huge resolution. The canvas is 2750x1625 at 50pt font. That’s way bigger than any terminal I actually use, and way bigger than the final gif. The main reason is to leave some headroom for zoom: later in the pipeline, ffmpeg crops a small region and upscales it for the intro zoom effect. If the source is low-res, that crop ends up pixelated. Recording big means I can zoom into any region and still get sharp output. Crisp text at the final output size is a nice bonus - downscaling 2750px to ~800px with a good filter keeps every character readable.

Use Hide for offscreen setup. VHS has a Hide / Show pair that lets you run a setup block before the user-visible portion starts. In my tape, the Hide block creates a scratch git repo, initializes the tool’s config, sets up a bare upstream, and clears the screen. Not interesting to watch. Absolutely necessary for the demo to actually do something. Show kicks in and the recording begins.

Sleep values are hand-tuned. There’s no shortcut. I ran the tape, watched the output, bumped a number, ran again. This is the tedious part, but it’s also where the rhythm of the video comes from - the pauses are the difference between “watchable” and “what am I looking at.”

Mock a Deterministic Demo

One thing that comes up fast: if your tool does real work with real network calls, or real LLM agents, the recording is at the mercy of a stochastic system for something that needs to be identical every time. A review step that takes 30s today takes 45s tomorrow. Agents take different paths. Networks hiccup.

For no-mistakes, I added a demo mode behind an env var:

Env NM_DEMO "1"

Inside my program, that flag swaps out the real implementation for a canned mock. The TUI doesn’t know the difference - same step names, same log streaming, same approval flow, same step completion durations. The only thing that changes is what’s running underneath.

You don’t need this for every tool. If your CLI is deterministic and fast, skip it. But if your flagship flow takes minutes or talks to the outside world, you’ll want some version of it.

The key design decision, if there is one: the demo mode swap lives at the pipeline layer, not the UI layer. The TUI is identical between real and demo runs, which means the demo gif is also a low-key integration test. If the UI breaks, make demo shows it.

Pacing: Real Time vs Displayed Time

This is where it gets fun.

The real pipeline takes minutes. A review is maybe 30-45s. Tests can be a minute. CI is several minutes. Recording that is unusable.

But I also don’t want the TUI to show “Review (0.2s)” - that breaks the realism of the demo. The whole point is that it looks like a real run.

So every demo step carries two durations:

&demoStep{
    name:       types.StepReview,
    delay:      5 * time.Second,     // actually block this long
    displayDur: 45 * time.Second,    // report this to the TUI
    ...
}

And the executor honors the override when reporting:

durationMS := executionMS + time.Since(phaseStart).Milliseconds()
if durationOverrideMS > 0 {
    durationMS = durationOverrideMS
}

The UI cheerfully renders “Review - 45s” in the completed-step list, even though only 5 seconds of wall clock went by during recording.

The other half of pacing is log streaming. If you dump a wall of text in a single frame, the effect is jarring and unreadable. Spread the lines across the step’s duration instead:

pause := total / time.Duration(len(lines))
if pause < 50*time.Millisecond {
    pause = 50 * time.Millisecond
}
for i, line := range lines {
    if i > 0 {
        demoWait(ctx, pause)
    }
    sctx.Log(line)
}

So “Reviewing diff against main...” / “Analyzing changed files...” / “Checking for bugs...” appear at human-readable intervals. Same idea as a loading shimmer: it’s not about truth, it’s about communicating progress at the speed a viewer can follow.

FFmpeg: The 20-Line Polish Pass

This is the part that surprised me. I’d assumed “real” demo videos needed After Effects or at least something like iMovie for basic editing like zoom, transitions, and speed ramps. Turns out ffmpeg does all of it in a single filter chain.

VHS outputs a raw gif. Two ffmpeg passes turn it into the final gif and mp4.

Here’s the gif pass:

ffmpeg -i demo_raw.gif -filter_complex "\
    [0:v]split[orig][zoom_src];\
    [zoom_src]crop=963:570:0:0,scale=1100:650:flags=lanczos[zoomed];\
    [orig]scale=1100:650:flags=lanczos[base];\
    [base][zoomed]overlay=0:0:enable='lt(t,4.04)',setpts=1.9*PTS,\
    split[s0][s1];\
    [s0]palettegen=max_colors=128[p];\
    [s1][p]paletteuse=dither=sierra2_4a\
" -r 10 -y demo.gif

Three effects, stacked.

Zoom-then-reveal. The first 4 seconds of the demo is the user typing git push no-mistakes, which is the whole pitch of the tool. Zooming in makes it unmissable. The filter splits the video into two streams, crops and upscales one (zoomed view of the top-left), and overlays it on the base stream only while t < 4.04s via enable='lt(t,4.04)'. After that, the overlay is disabled and the full TUI reveals itself - which happens to be the moment the TUI actually launches. Visually it reads as “you typed this, now watch what happens.”

1.9x speedup via setpts=1.9*PTS. Even with display durations clamped, the full demo runs about 53 seconds. Too long for a readme gif. 1.9x compresses it to about 28 seconds without anything feeling rushed, because the mock step pacing was tuned with this speedup in mind. You can (and should) tune your pacing and your speedup together as one loop.

Palette optimization. palettegen samples the frames and picks 128 optimal colors, paletteuse applies them with Sierra2-4a dithering. Without this, the gif is either oversized or has ugly banding on text edges. With it, the final output sits around 700KB for a 28-second animation.

The mp4 pass is the same zoom and speedup filter chain, minus the palette dance, encoded to H.264. Twitter and most docs renderers prefer the mp4, readme uses the gif, both come out of the same source.

The `make demo` Target

All of it lives in one target:

demo: build
    vhs demo.tape
    ffmpeg -i demo_raw.gif ... -y demo.gif
    ffmpeg -i demo_raw.gif ... -y demo.mp4
    rm -f demo_raw.gif

make demo. Gif updates, mp4 updates, intermediate file goes away. Runs in CI if I want. Produces the same output every time.

Summary

If you’re shipping a CLI or TUI, this is a really high leverage setup. My rough advice:

Use VHS, not screen recording. Scripted, deterministic, no cursor wobble.
Record big. High resolution, large font. Downscale at the ffmpeg stage.
Put Hide/Show around your setup. Your viewer doesn’t want to see mktemp -d.
Tune pacing by ear. There’s no formula. Watch the output, adjust the sleeps, run again.
Let ffmpeg do the flashy stuff. Zoom overlays, speed ramps, and palette optimization are all one filter chain away. No video editor required.
If your tool is slow or non-deterministic, gate a mocked demo mode behind an env var.

An hour of work and a 20-line Makefile target gets you a demo that’s deterministic, easy to regenerate, and nice to look at. That’s a trade I’d happily make again, and hopefully this writeup saves you some of the figuring-out I had to do.

How I Built a Reproducible Mac Setup with Nix

Kun Chen — Sun, 05 Apr 2026 20:54:14 GMT

Setting up a new Mac always sounds easier than it actually is.

You tell yourself it will take an hour. Install a few apps. Copy some dotfiles. Tweak a few settings. Done.

Then a full weekend disappears.

Some of your setup lives in shell config. Some is buried in macOS settings. Some is in packages you installed years ago and forgot about. Some is in app configs that only make sense after months of iteration. None of it feels hard while you are building it gradually. It only becomes painful when you have to do it again.

That was the problem I wanted to solve. I wanted a reproducible core for my Mac setup. A setup I could reapply on a new machine. A setup I could open source. A setup structured enough to be dependable, but not so rigid that it becomes annoying to maintain.

That led me to this stack:

All the source code I covered in this article can be found here:

https://github.com/kunchenguid/dotfiles-mac-nix

It’s a public, reusable core of my Mac setup. It is meant to be forked and adapted, not copied as a complete snapshot as is.

In this post, I will walk through the ideas behind it and how I built each piece.

What Nix, nix-darwin, and Home Manager actually do

If you have never used this stack before, here is the short version.

Nix

Nix is a package manager and configuration system.

The reason people like it is that it lets you describe an environment declaratively. Instead of manually installing packages and hoping you remember what you did six months later, you define the environment in code.

For me, the value is simple: I want my machine setup written down in a form I can version, reapply, and evolve.

nix-darwin

nix-darwin brings that model to macOS.

It lets you configure machine-level parts of your Mac, including things like:

system defaults
login shell
system packages
Homebrew integration
primary user configuration

So if Nix is the foundation, nix-darwin is the layer that makes it useful for a Mac.

Home Manager

Home Manager does something similar, but for your user environment.

Instead of configuring the machine itself, it configures the things that live in your home directory and shape your day-to-day workflow:

user packages
Git config
shell behavior
fonts
application config files
environment variables

I like this split because it keeps system concerns and user concerns from getting mixed together.

Declarative Homebrew

Even if you use Nix on macOS, Homebrew is still useful.

A lot of Mac apps are easiest to install that way, especially GUI apps. So instead of pretending Homebrew should disappear, I let nix-darwin manage it declaratively.

That gives me a setup where both Nix packages and Homebrew apps live in source control.

Step 1: Bootstrap the machine once

Before the declarative setup can take over, a fresh Mac still needs a small bootstrap step.

The reason is simple: on a brand new machine, the tools that apply the real configuration do not exist yet.

For this repo, the bootstrap layer lives in setup/mac.sh.

Its job is to install the minimum core tools needed to get the rest of the setup working:

Determinate Nix Installer for installing Nix
Homebrew for the macOS package/app layer managed by nix-darwin
darwin-rebuild to apply the system configuration
nvm and Node.js for a practical JavaScript/TypeScript runtime baseline

Here is the bootstrap script:

#!/bin/bash

set -euo pipefail

DOTFILES_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && cd .. && pwd )

# Fail early if placeholder values have not been customized yet
if grep -R -n -E 'yourname|/Users/yourname|Your Name|you@example.com' \
  "$DOTFILES_DIR/flake.nix" \
  "$DOTFILES_DIR/nix" >/dev/null 2>&1; then
  echo "Placeholder values are still present in the repo."
  echo "Please replace values like 'yourname', '/Users/yourname', 'Your Name', and 'you@example.com' before running setup/mac.sh."
  exit 1
fi

# Install Nix via Determinate if missing
if ! command -v nix &> /dev/null; then
  curl --proto '=https' --tlsv1.2 -sSf -L https://install.determinate.sh/nix | sh -s -- install
fi

# Install Homebrew if missing
if ! command -v brew &> /dev/null; then
  /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
fi

# Apply the Nix configuration
if [ -x /run/current-system/sw/bin/darwin-rebuild ]; then
  sudo /run/current-system/sw/bin/darwin-rebuild switch --flake "$DOTFILES_DIR#mac"
else
  sudo nix run github:nix-darwin/nix-darwin -- switch --flake "$DOTFILES_DIR#mac"
fi

# Install nvm and a default Node.js if missing
export NVM_DIR="$HOME/.nvm"
if [ ! -d "$NVM_DIR" ]; then
  PROFILE=/dev/null bash -c 'curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash'
  [ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"
  nvm install --lts
fi

The system is now split in two phases:

Bootstrap phase: install the minimum needed to get going
Declarative phase: let Nix, nix-darwin, and Home Manager manage the durable setup

That bootstrap script is what you run on a brand new Mac, after cloning the repo and replacing the placeholder values with your own username, home directory, and Git identity. The script now checks for those placeholder values and fails early if you forgot.

In other words, the order is:

Clone the repo
Replace placeholders like yourname, /Users/yourname, and your Git identity
Run bash setup/mac.sh
Let the declarative setup take over from there

After that first bootstrap, ongoing changes should mostly be made by editing the Nix config and running darwin-rebuild switch --flake ~/github/dotfiles-mac-nix#mac.

I also like having a small convenience alias for this. In the public repo, I added an opinionated version that assumes the repo lives at ~/github/dotfiles-mac-nix:

rebuild = "/run/current-system/sw/bin/darwin-rebuild switch --flake ~/github/dotfiles-mac-nix#mac";

That makes the common update loop a lot simpler: edit config, run rebuild, verify the result.

Step 2: Create a flake as the entry point

The first thing I did was create a flake.nix file.

A flake is just the top-level definition of the setup. It declares the dependencies and how they are wired together.

In my case, I wanted three inputs:

nixpkgs for packages
nix-darwin for macOS system configuration
home-manager for user configuration

The file looks like this:

{
  description = "Minimal macOS Nix setup with nix-darwin + Home Manager";

  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-unstable";
    nix-darwin = {
      url = "github:LnL7/nix-darwin";
      inputs.nixpkgs.follows = "nixpkgs";
    };
    home-manager = {
      url = "github:nix-community/home-manager";
      inputs.nixpkgs.follows = "nixpkgs";
    };
  };

  outputs = { nixpkgs, nix-darwin, home-manager, ... }: {
    darwinConfigurations.mac = nix-darwin.lib.darwinSystem {
      system = "aarch64-darwin";
      modules = [
        ./nix/host.nix
        home-manager.darwinModules.home-manager
        {
          home-manager.useGlobalPkgs = true;
          home-manager.useUserPackages = true;
          home-manager.backupFileExtension = "backup";
          home-manager.users.yourname = import ./nix/user.nix;
        }
      ];
    };
  };
}

This is the file that turns the repo from a pile of config into a coherent system.

Step 3: Define the machine-level setup with nix-darwin

Next I created nix/host.nix.

This file handles the machine-level parts of the setup: macOS defaults, Homebrew packages, the main user, the login shell, and system-level packages.

Here is the version from the public repo:

{ pkgs, ... }:

{
  # If you use Determinate Nix Installer (recommended), let it manage Nix itself.
  nix.enable = false;

  nixpkgs.config.allowUnfree = true;

  homebrew = {
    enable = true;
    onActivation.cleanup = "zap";
    taps = [ ];
    brews = [
      "autoconf"
    ];
    casks = [
      "wezterm"
      "amethyst"
    ];
  };

  environment.systemPackages = with pkgs; [
    starship
  ];

  system.primaryUser = "yourname";
  users.users.yourname = {
    home = "/Users/yourname";
    shell = pkgs.zsh;
  };

  system.defaults = {
    NSGlobalDomain = {
      AppleInterfaceStyle = "Dark";
      KeyRepeat = 2;
      InitialKeyRepeat = 15;
      "com.apple.swipescrolldirection" = false;
      NSAutomaticCapitalizationEnabled = false;
      NSAutomaticPeriodSubstitutionEnabled = false;
      NSAutomaticSpellingCorrectionEnabled = false;
      NSAutomaticQuoteSubstitutionEnabled = false;
      NSNavPanelExpandedStateForSaveMode = true;
      NSNavPanelExpandedStateForSaveMode2 = true;
      AppleShowAllExtensions = true;
    };

    finder = {
      AppleShowAllExtensions = true;
      ShowPathbar = true;
    };

    trackpad = {
      Clicking = true;
    };
  };

  environment.systemPath = [
    "/run/current-system/sw/bin"
    "/etc/profiles/per-user/yourname/bin"
  ];

  system.stateVersion = 6;
}

This is where I put all the decisions that shape the machine itself.

For me, this is one of the highest-leverage parts of the setup. If I get a new Mac, I do not want to remember which settings I toggled manually in five different places. I want those decisions encoded once and re-applied.

Step 4: Define the user environment with Home Manager

After that, I created nix/user.nix.

This is the user-level configuration. It includes packages, fonts, Git settings, prompt configuration, shell behavior, and dotfile symlinks.

{ config, pkgs, ... }:

let
  dotfilesDir = "${config.home.homeDirectory}/github/dotfiles-mac-nix";
in
{
  home.username = "yourname";
  home.homeDirectory = "/Users/yourname";
  home.stateVersion = "23.11";
  home.language.base = "en_US.UTF-8";

  home.packages = with pkgs; [
    git
    curl
    wget
    jq
    fd
    fastfetch
    ripgrep
    killall
    lazygit
    tree
    bun
    rustup
    zip
    unzip
    nerd-fonts.hack
    roboto
    noto-fonts
    noto-fonts-cjk-sans
    noto-fonts-color-emoji
    font-awesome
  ];

  fonts.fontconfig.enable = true;

  home.sessionVariables = {
    EDITOR = "vim";
  };

  programs.git = {
    enable = true;
    lfs.enable = true;
    signing.format = null;
    settings = {
      user = {
        name = "Your Name";
        email = "you@example.com";
      };
      core.editor = "vim";
      color.ui = true;
      push.autoSetupRemote = true;
      pull.rebase = true;
      rebase.updateRefs = true;
    };
  };

  programs.starship = {
    enable = true;
    settings = {
      command_timeout = 1000;
      add_newline = false;
      format = "$username$hostname$directory$git_branch$git_state$git_status$cmd_duration$line_break$character";
    };
  };

  programs.zsh = {
    enable = true;
    autosuggestion.enable = true;
    syntaxHighlighting.enable = true;
    shellAliases = {
      ".." = "cd ..";
      m = "git switch main";
      mst = "git switch master";
      pull = "git pull";
      push = "git push";
      pushf = "git push --force";
      add = "git add .";
      amend = "git commit --amend";
      reset = "git reset --soft HEAD^";
      rebasem = "git rebase -i main";
      rebasemst = "git rebase -i master";
      rebuild = "/run/current-system/sw/bin/darwin-rebuild switch --flake ~/github/dotfiles-mac-nix#mac";
    };
    initContent = ''
      bindkey '^f' autosuggest-accept
    '';
  };

  home.file = {
    ".config/wezterm".source = config.lib.file.mkOutOfStoreSymlink "${dotfilesDir}/files/.config/wezterm";
  };
}

The exact package list is not the important part. The structure is.

This is the layer where I define the baseline environment I want in my user account, including identity, packages, shell config, and dotfile symlinks all in one place.

Step 5: Add one real app config as an example

I did not want this repo to be just Nix modules and placeholders, so I added one real application config: WezTerm.

The config lives in:

files/.config/wezterm/wezterm.lua

And it gets linked into ~/.config/wezterm through Home Manager.

The file itself is simple, but that is the point. It shows how to keep app config in the repo without turning the whole repo into a giant dump of personal preferences. I picked WezTerm because it is real enough to demonstrate the pattern while still being general enough for a public starter repo.

local wezterm = require("wezterm")

local config = wezterm.config_builder()

local is_windows = os.getenv("OS") and os.getenv("OS"):lower():find("windows")
local is_macos = wezterm.target_triple:lower():find("darwin") ~= nil

config.color_scheme = "rose-pine-moon"
config.max_fps = 120
config.font = wezterm.font("Hack Nerd Font", { weight = "DemiBold" })
config.window_decorations = "INTEGRATED_BUTTONS|RESIZE"
config.window_frame = {
  font = wezterm.font("Hack Nerd Font", { weight = "Bold" }),
}
config.inactive_pane_hsb = {
  saturation = 0.0,
  brightness = 0.5,
}

if is_windows then
  config.win32_system_backdrop = "Acrylic"
  config.window_background_opacity = 0.7
  config.window_frame.font_size = 10.0
end

if is_macos then
  config.window_background_opacity = 0.8
  config.macos_window_background_blur = 50
  config.font_size = 15.0
  config.window_frame.font_size = 13.0
end

return config

After Step 5: How I add more tools later

Once the base setup is in place, the next question is obvious: how do I install more stuff over time?

My rule of thumb is simple.

Use Nix / Home Manager for things that should be part of the reproducible environment

That usually means:

CLI tools I use regularly
fonts
shell utilities
language toolchains that I want declared in the repo
packages that belong in my default user environment

For example, adding another CLI package usually means editing nix/user.nix and adding it to home.packages, then running:

rebuild

Use Homebrew for Mac apps that fit naturally there

For GUI apps and some macOS-native tools, Homebrew is often still the right place.

That means editing nix/host.nix and adding a formula to brews or an app to casks, then applying the config again.

Use ecosystem-specific package managers when that is the right abstraction

Sometimes the right answer is not Nix or Homebrew.

For example:

npm for global JavaScript tooling when that fits your workflow
language-native package managers for project-specific dependencies

I do not think a good setup means forcing every possible tool through one package manager. I think it means being clear about which layer owns what.

My rough mental model is:

Nix / Home Manager for reproducible baseline environment
Homebrew for macOS apps and tools that fit naturally there
language-specific package managers for ecosystem-specific or project-specific tooling

How to use this repo

The repo is meant to be copied and adapted.

At a high level:

Clone the repo under your home directory
Replace the placeholders for username, home directory, and Git identity
If you are on Intel, change the system target from aarch64-darwin to x86_64-darwin
On a fresh Mac, run bash setup/mac.sh
For later changes, edit the Nix config and run darwin-rebuild switch --flake ~/github/dotfiles-mac-nix#mac

Once your setup is reproducible, you stop relying on memory and habit to rebuild it. You can now also get a new Mac up and running with the exact same setup within seconds.

Zero to One — Handbook for Entrepreneurial Engineers

Kun Chen — Thu, 02 Apr 2026 20:59:49 GMT

But wait… who am I now? And what real experience do I have to share?

I’m an L8 senior principal engineer previously at Meta, Microsoft, and Atlassian. I see myself as an engineer whose specialization isn’t in any particular tech stack, but in taking things from zero to one.

Over the years, I accumulated experience from -

Having built over a hundred side projects. First time I made money was with a SaaS I built in high school over 20 years ago.
Being a founding engineer of Facebook Instant Games, and taking it to a large business with thousands of games and millions of players.
Then started MSN Games at Microsoft, again from zero and turned it into a business.
Recently at Atlassian I helped start a suite of AI products for software developers, from a pitch, to a prototype, to a state-of-the-art AI product portfolio.

Perhaps the most counter-intuitive learning I’ve had from this journey is that you can actually have a startup founder experience while working in large companies. You may not be able to get a $100 million exit, but you’ll never have a month without paycheck, and over time the reward gets close as well.

If that sounds like something you are passionate about as well, this post is for you!

Let’s start with how to find great ideas to work on —

Want good ideas? Go find some problems.

There are countless ways to generate ideas -

Sometimes we see a new technology coming out, and think “what can we do with this”?
Sometimes we find something painful, and think “can I solve this for myself”?
Other times, an apple fell off a tree and the rest was history.

The world is fully of inspirations. More often than not, we don’t need just ideas, we need good ones that are worth our time.

A good idea is one that can succeed. An idea that can succeed is one that can provide value. An idea can provide value when it can solve real problems.

So the fundamental question is — where can we find real problems? If we can find a real problem and identify a viable solution, we have a great idea to work on.

I typically think of problems in the following taxonomy -

User problems — pain points that exist for end users. For example, having to load dishes into a dishwasher and other house chores everyday is a pain for me, and I would pay a reasonable price if a robot can help me get it done.
Ecosystem problems — suppliers who produce parts for a robot may suffer from various challenges, such as high overhead and cost in managing logistics and customer support.
Company problems — let’s say we have a company who’s building those robot. We may have our own problems, such as LLM bills being out of control.

To know what real problems exist, you’d need to find ways to hang out and talk with people who may have those problems. Follow people who vent on twitter and listen to what they have to say; jump into a reddit community and find out what they are complaining about; go to events and talk to people, understand what they are going through and hear their problems from their own mouths.

The way you know you have found real problems is when you can name a few real people who will be eager to hear about a solution whenever you have one. When you have that, you will have much, much better intuition about whether an idea is good vs not. You may also get new ideas during that process.

Credit — Austin Distel, Unsplash

If you aren’t doing this yet today, I highly recommend starting here.

What can you do today?

In a corporate environment, finding real problems is actually a lot easier than when you are working solo.

Look up your company and teams’ OKRs — every OKR is a problem that the group already collectively identified as a real problem waiting to be solved. If you think any of the problems may have a great solution that’s not being worked on yet, you have a new idea!
When you talk to people, always be curious what problems they are going through. If you’ve heard the same thing from multiple folks, there’s a good chance you just found a real problem.
Building network across teams and organizations. Read other teams’ newsletters and updates to understand what they are doing. I can’t even count how many times a great idea came from seeing “Team A is doing something that can solve Team B’s problem with just a bit more work”.

No big ideas yet? Build skills, credibility, and trust.

It’s absolutely okay if you aren’t always working on “your own idea”. I certainly wasn’t — very often I even deliberately carved out time to contribute to existing projects even when I have an exciting new idea to work on.

That’s because in order to succeed in taking a new idea from zero to one, you need skills, credibility and trust. You should be building these things when you aren’t building a new idea.

Skills

Both hard skills and soft skills are quite important.

Hard skills can be acquired by learning new things you haven’t done before — for a year or two when I was at Microsoft, I wrote more SQL than “real code” in order to run large map-reduce jobs. To be honest that wasn’t the most interesting thing to write, but it wasn’t until a few years later on when I discovered I was the only engineer on my team that could do data analysis fluently, and could constantly use that skill to identify growth opportunity for our new products, that I realized what I gained from those SQL jobs.

You can also build hard skills by strengthening knowledge in a specific domain that has enough depth, such as machine learning. By going deep, you can become an expert that can solve problems in ways others can’t.

Even if you aren’t increasing either breadth or depth, you can still increase your efficiency at doing the same things, which is a hard skill as well. For example, I can debug problems and do performance profiling very efficiently — this was from years of pushing myself “how can I do it faster next time”.

Soft skills on the other hand are often overlooked by us engineers, probably because we can go pretty far without being intentional about them.

But to be an entrepreneurial engineer, I believe soft skills are incredibly important. We need to communicate a lot — to understand different people’s problems, to sell a vision to stakeholders or a solution to customers, or to rally a team towards the same direction.

If you have no idea where to start on soft skills, I’ll give some book recommendations here (I’m not affiliated). These books genuinely changed me as a person.

https://www.amazon.com/Never-Split-Difference-Negotiating-Depended/dp/0062407805 — this book talks about negotiations with terrorists but deeply under the hood it’s about understanding others and using that understanding to guide how you communicate

https://www.amazon.com/Mom-Test-customers-business-everyone/dp/1492180742 — this book gives you the skills you need to get meaningful insights from others, and helps you avoid being misguided by your confirmation bias and noise that comes from others “being nice”

Credibility & trust

Credibility is about establishing that you are capable.

If a random person on the street comes to you and say — “I have a great idea here! Can you give me $10 to help me build it?” How likely would you give them the money, or even listen to their idea, compared to when the exact same pitch comes from a serial-entrepreneur who you know has a track record of building useful things?

By consistently nailing the tasks you work on, you will be establishing invaluable credibility that makes it easy for others to not just believe in your idea, but also believe in your ability to make it happen.

Trust is slightly different — a person can be a well-known expert in their domain, but do you trust them enough to co-found a company together? You probably won’t until you’ve worked with this person for a while and got to know whether they are easy to talk to, if they do what they say, and respect your inputs.

You can build trust by having positive interactions with others — demonstrate that you understand what others care about, help people out, and (maybe surprisingly) even by getting others to help you. One of my previous articles “Building Trust” here dives deep into this topic.

When you aren’t working on a new idea but instead contributing to a large existing project, that’s a great opportunity to build credibility and trust with others, so that when you do have a great idea to work on, it’s easy for you to find support. It also helps increase the chance that others would bring good ideas to you.

Finding time to build things

Now comes the hard part — we all have day jobs. How do we find time to build something new?

When we want to work on a new idea, we must first acknowledge that we’re taking on a risk. We simply won’t know whether an idea will work or not until we spent some time building and testing them in the real world. What we need is just enough resource (time and/or money) to pursue and validate the idea.

There are a grand total of two (legal) ways to get resource -

You can use your own time or money. You would use evenings and weekends, or by quitting your day job and living on money you saved.
You can convince others to give you time or money. For example, attracting investors to fund your startup company.

When in a corporate environment, I generally recommend going with #2, which is significantly easier than in the startup world — you mainly need to align with stakeholders on your prioritization. If everyone agrees that’s the most valuable thing you can work on, you are all good to go. Many tech companies also have dedicated time carved out, such as as hackathons, that give everyone the time needed to get a new idea off the ground.

Going with #1 is a last resort but also a valid option — you don’t need anyone’s permission to use your own time as the resource to do what you think should be done. Build it, show it, and go from there.

However, if you find yourself repeatedly resorting to this option, then there’s a more fundamental misalignment between what you vs your team believe to be important. You need to understand and tackle this disconnection instead of always consuming your own resource and time, which can lead to burnout.

Being scrappy

Regardless of the approach you take, you’ll find it significantly easier when you stay scrappy and use as little resource as possible. Asking to go dark for 3 months is very different from asking for a few days to do a spike.

But how to be scrappy? Mainly 5 things -

Minimize the initial scope. Avoid falling into the trap of assuming your idea will work well and try to build a perfect version of it that can launch straight to a million users worldwide. Instead, ask yourself “what’s the minimum version of a product that can either prove or disprove this idea?” Your prototype probably doesn’t need to support Mac, Windows and a hundred different Linux distributions all at the same time. Who’s the first user you’ll give the product to? Ask what they use and build just that.
Write as little code as possible. What existing building blocks can you pull together to make a functional prototype of the idea? We’re living in a wonderful world full of amazing open source software and ML models. If you find yourself planning to build a complex system even for a prototype that doesn’t have to scale, you should be alarmed and question if you can assemble it more quickly.

Wait, did I say 5 things? Nope, those two plus the meme I grabbed from the Internet should work for now! If you need more, let me know and I’ll iterate on it.

See what I did there?

I just built a thing. Now what?

If you did it right, by now you should be able to name a few people who are waiting to hear about your solution. And it’s time to reach out to those people whose problems you believe your solution will help with, and show them what you have!

What’s important though is to set the right expectations with yourself — this is not the time when you’ll hear people say

What an amazing product! Here take my $299/month lifetime non-refundable subscription fee plus tax!

Remember you probably just spent a few days hacking together a quick prototype. What you really want to get to are —

Do they think this is the right solution to their problems? Why or why not?
How did they use your solution? Does that match what you intended
What key challenges do you need to tackle in order for this to work well?

Iterate!

All of those above are extremely valuable information that’ll help you refine your approach, and get it closer to something useful. If possible, you should iterate on their feedback immediately -

Not “let me put this in the backlog and prioritize in our next sprint planning which happens in two weeks”
But “hold my beer — let me see if I can fix it in the next 15 minutes!”

The reason an extremely fast iteration loop is necessary for zero to one products is that you should generally expect that there are a lot of iterations needed to arrive at something that truly works. The next problem often isn’t visible until the first problem is fixed. The faster you iterate, the more likely you can get to a conclusive state before you run out of resources to continue funding this idea.

And if people did validate it’s the right direction, you will have a lot of concrete insights about how much more resource is needed to deliver a working solution, and what the return will look like once you deliver it — those are the key things you need in order to attract another round of investment to keep it going.

What if no one responds?

Another very possible outcome here is that no one even responded to you (except for a few emoji reactions). This can be demotivating for sure, and some people would simply stop here. But that’s not the entrepreneurial way! As the founder and CEO of your idea, it’s on you to take actions because nothing else will happen until you do.

One more slack message. Throw a calendar invite. Try reaching out to more people in case you were wrong about who had the problems. People are busy, and the world is increasingly distracting. You need to think creatively to stand out from the noise and get a conclusive read on your idea, so that you never walk away without meaningful learnings.

Play as a team

Think about all the products you’ve used today, and see how many of them were built by a solo engineer? The more things I built, the more I saw the power of having a team with diverse skillsets working well towards the same direction.

That said, a dysfunctional team can be counter-productive, and it’s not easy at all to set up a well functioning one. I would not claim to be an expert here — I still make lots of mistakes and not doing nearly enough, but I can share a few learnings that I found useful.

Inviting others early

Generally speaking, we can benefit from looping in potential collaborators as early as possible. Many companies’ co founders met each other before their product was built — including my last company Atlassian’s founders Mike and Scott!

Having a close partner means you always have someone to brainstorm with, many different perspectives for what may work well vs not, and you can hold each other accountable to make progress. More people working together also means you can make progress and deliver more quickly, which helps build traction and momentum for the project.

There are many ways you can do this. Mike sent an email to the entire class and got Scott — as simple as that. In a corporate environment, you can simply share your insights, ideas, progress and results more actively — if they resonated, people will come to you and ask to work together. You should also actively subscribe to what others are building and reach out if you see good opportunities to collaborate.

Value cross-functional expertise

We engineers are a privileged group because we are the only discipline within a tech company that can solo build a new thing that works, all by ourselves.

This privilege would sometimes create a perception that we don’t need other disciplines to build something successful. Granted, there are solo entrepreneurs like https://x.com/levelsio who have indeed been operating mostly alone, the reality is that these are exceptions not the norm.

One of my fondest memory building new things was when I locked myself and a designer in the same room and we said “no one leaves until we have a good working demo”. We would both be at his laptop where he explained to me why the new button shouldn’t be a big red box, and we would both look at my screen when I told him the beautiful animation he put together in 3 minutes would take me hours to optimize.

In a corporate environment, we all have a huge advantage as we’re already surrounded by people who collectively have all the skills needed to take an idea all the way to market. Whether you can leverage this advantage will make a massive difference in your chance of success.

Share ownership and success

I’ve witnessed many times how Founder’s syndrome caused founding members to struggle and ultimately leave the team when the team started to grow.

When you had a great idea and got it off the ground, you often have a particular vision and strong sense of ownership. That is good and necessary, as otherwise you probably wouldn’t even have gotten this far.

Now, one way or another, you have more team members joining you as a result of the success you established. You start getting questions on why your vision is the right one and everyone seems to be picturing something different. You start seeing the team come up with plans that don’t align with your thinking. You start finding yourself constantly correcting others — to varying degree of success — and you feel exhausted.

These are all signs that you need to share ownership with others, instead of controlling it.

Instead of telling everyone what’s the right thing to do, take them through the same journey that allowed you to arrive at the conclusion.
Focus more on communicating the “why” and let people decide the “what” and “how”.
Be open that you might be wrong, and others may have better ideas that can lead to a better outcome, which is what you want.
Be okay with the fact that great things can happen without your involvement. Celebrate others people’s impact and don’t take their credit.

The difference is between being a small part of something big, vs a big part of something small. You want to be the former — it’s a good thing both for yourself and for the group.

Assume full accountability

One downside of having a team is that it dilutes accountability. When a problem rises, you may think “oh someone else will get to it” not realizing that everyone was thinking the same, resulting in problems falling through the cracks.

The solution to this is to assume you have full accountability by default, for every single thing you work on. It’s a mindset that says -

If this thing fails, it’s on me. No matter what.

This is an extremely powerful mindset. When working on things from zero to one, there are often problems that don’t strictly fall under anyone’s predefined responsibility. By adopting this mentality, you are empowering yourself to take actions during ambiguity and do whatever is necessary to keep things moving in the right direction.

The beauty of this mindset is that you don’t need anyone to give you the permission. You don’t need to be given a CEO title, or be the most senior person in the room. You can just assume the accountability on yourself and start taking actions you think are necessary for the project to succeed but not being done, and that’s typically how great leaders emerge.

When should I keep going, and when should I stop?

We often hear the term “fail fast” — prove something doesn’t work and move to to a different idea immediately.

But people also say “dig deeper” — maybe the real solution is just one iteration away. Thomas Edison failed thousands of times before being able to build a light bulb that works.

Well.. that’s pretty confusing — how do we know which case we’re in?

I’m not wise enough to give you a classifier at 100% accuracy. But here are a few things to chew on when you confront the question yourself -

Make a clear distinction between “the problem is not real” vs “the solution doesn’t work” and figure out which case it is. Sometimes we think people have a big problem when they actually don’t care about it that much. If the feedback you are getting suggests that there’s no strong demand for solving the problem, it’s time to go back to the drawing board and try to identify a problem that’s real.
If people confirmed they have the problem, but they are just not happy with the solution you provided, figure out why. Ideally even before you began to solve the problem, you should have studied what solutions were tried in the past, why the problem is still there, so you can learn from them and avoid repeating the same attempt. Once you know why the solution doesn’t work, you can determine whether those factors are within your control or not, and decide next steps accordingly.
Assess whether you have the resource to build a working solution. Sometimes what you discover is that a working solution requires massive upfront investment or different domain expertise that you don’t have. You could either try and get those missing pieces, or move to a smaller, more manageable goal which sometimes can be a stepping stone towards the bigger dream.

Managing the ups and downs

Working on things from zero to one is inherently risky. Most startup companies don’t make it, and the founders get zero (or even negative) financial gains.

The same holds in a corporate environment — we can’t ask for the freedom to work on risky ideas that may or may not generate value, and expect a GE rating regardless of the outcome just because the ideas are cool and you worked hard on them.

You will need to be comfortable with the rollercoaster ride that every entrepreneur will tell you they went through, and learn to manage the risks.

Know the bottom line

You can’t keep founding new startups if your family is starving — plain and simple. You need to regularly align with stakeholders and make sure you are meeting their core expectations and can keep the lights on.

Focus on long term growth

Understand that your performance rating is not the only thing you are gaining from your career. The rating rates your impact over the last 6 months, not you as a person. Even Steve Jobs had some bad performance reviews — so bad that he got removed from the company he founded — but no one today would look back on the story and say Jobs was incompetent.

What’s more important is your personal growth, which unfortunately is not as visible. You can understand it by asking yourself -

What did I accomplish this month that I couldn’t before?

As you acquire and strengthen your skills, trust that you will be making larger impact down the road. Good performance rating and financial gains will come as side effects when that happens, not a leading indicator.

Manage expectations

While entrepreneurs may enjoy the thrill from a rollercoaster ride, investors generally don’t like unpleasant surprises. This is why in every public company’s earnings call, good CEOs will set expectations clearly if they are anticipating anything to go slow next quarter.

The same idea applies when we work on zero to one projects — because how ambiguous things are, different stakeholders may have their own assumptions on how things will go. As the person who are working on the idea, you likely have the most insights about what to expect — make sure to over-communicate, especially any bad news, challenges and risks, so others also know what’s to come and can plan accordingly instead of getting a huge surprise when the results are revealed and it’s miles away their expectations.

Closing thoughts

I hope these experiences can be helpful to you, especially if going from zero to one is your passion as well. Please feel free to share with anyone else who may also find this interesting.

If you would like for me to dive into anything more deeply, or touch on something not covered, please feel free to leave a comment! If you also have thoughts on this topic, please also raise below so others coming to this page can see too.

Thanks for reading — until next time!

Kun Chen

How I built a starry night in TUI

Why no framework

Picking the star characters

Animating the stars

Avoiding the main content

Diffing frames

Wrapping up

Org-Bench: Let’s Simulate the Org Charts Meme with Agents and See Who Wins

Setup

The results

Apple (judge 3.00, 24 rounds, 7.20M tokens)

How this topology is set up

What happened

Amazon (judge 3.12, 13 rounds, 3.50M tokens)

How this topology is set up

What happened

Facebook (judge 3.38, 20 rounds, 5.99M tokens)

How this topology is set up

What happened

Google (judge 3.62, 15 rounds, 5.84M tokens)

How this topology is set up

What happened

Microsoft (judge 3.00, 15 rounds, 6.47M tokens)

How this topology is set up

What happened

Oracle (judge 3.25, 28 rounds, 4.36M tokens)

How this topology is set up

What happened

What I take away from this

Making a Polished TUI Demo Video Without a Video Editor

The Stack

VHS: Reproducible Script to Record Terminal Programs

Mock a Deterministic Demo

Pacing: Real Time vs Displayed Time

FFmpeg: The 20-Line Polish Pass

The make demo Target

Summary

How I Built a Reproducible Mac Setup with Nix

What Nix, nix-darwin, and Home Manager actually do

Nix

nix-darwin

Home Manager

Declarative Homebrew

Step 1: Bootstrap the machine once

Step 2: Create a flake as the entry point

Step 3: Define the machine-level setup with nix-darwin

Step 4: Define the user environment with Home Manager

Step 5: Add one real app config as an example

After Step 5: How I add more tools later

Use Nix / Home Manager for things that should be part of the reproducible environment

Use Homebrew for Mac apps that fit naturally there

Use ecosystem-specific package managers when that is the right abstraction

How to use this repo

Zero to One — Handbook for Entrepreneurial Engineers

Want good ideas? Go find some problems.

What can you do today?

No big ideas yet? Build skills, credibility, and trust.

Skills

Credibility & trust

Finding time to build things

Being scrappy

I just built a thing. Now what?

Iterate!

What if no one responds?

Play as a team

Inviting others early

Value cross-functional expertise

Share ownership and success

Assume full accountability

When should I keep going, and when should I stop?

Managing the ups and downs

Know the bottom line

Focus on long term growth

Manage expectations

Closing thoughts

The `make demo` Target