Keep Working: A One-Button Prompt Deck Where Every Card Earned Its Place

Keep Working is the button you wish lived next to Send. At minute 47 of a long LLM session you know the model needs another push, but you are too tired to write a good one. Keep Working serves it as a card: press Space for the next prompt, S for the same category one rung sharper, C to copy, F to flip the card and read why that prompt works. Every card has a shareable permalink, and the first card is server rendered, so the page is useful from the very first paint.

You can open the live product here: keepworking.urbanodx.com.

The Quality Bar: Evals Before Shipping

The deck looks bigger than it is, on purpose. It ships in two tiers:

Core: 4 categories, about 86 prompts. Every core category beat both "are you sure?" and "review your last answer for errors" in a blind head-to-head judged by three independent models, built on 184 blind judgments.

Experimental: 12 categories, about 275 prompts. Promising but unproven, or shown to make results worse. Available only by explicit opt-in.

The default draw only uses the core tier. Promoting a category from experimental to core requires a fresh evaluation, and the eval history is kept in the repo, including the iterations that regressed the deck before it stabilized. The honest finding behind the product: most prompt advice fails a blind test against trivial baselines, so the default deck only contains what passed.

One Dataset, Three Surfaces

Keep Working is a small FastAPI service. The web app, the JSON API, and the CLI all read the same prompts file, so every surface stays in sync, and the file is mounted read-only into the container, so the deck can be hot swapped without a rebuild. Interactive API docs ship with the service at /docs, and /health reports prompt and category counts for uptime monitoring.

What It Demonstrates

Evaluation discipline at the smallest possible scale. The same eval-set thinking we apply to client AI workflows: if a pattern cannot beat a baseline in a blind test, it does not ship as a default. We wrote about the workflow version of this in human review in AI workflows.

API-first product design. One canonical dataset feeding a web UI, a documented JSON API, and a CLI, with health reporting from day one.

Small is shippable. A one-button product with honest evals is more useful than a sprawling prompt library nobody has tested.

If you want this discipline applied to your own AI workflow, that is literally our job: a fixed-scope AI Workflow Teardown tells you which steps of your process would survive a blind evaluation, before you build anything.