LLM prompt and workflow evaluation template
Use this template to test whether an LLM workflow is reliable enough for a real pilot.
Evaluation dimensions
The goal is not a perfect prompt. The goal is a workflow that behaves predictably enough for human-reviewed use.
- Task success
- Evidence quality
- Latency
- Reviewer acceptance
- Fallback behavior
- Audit logging
What to test
Evaluate the whole workflow, not just the prompt text. The prompt may be fine while the retrieval, data shape, UI, or review path is weak.
- Representative examples
- Edge cases
- Bad input
- Missing source evidence
- Reviewer corrections
What to measure
A useful evaluation has a small test set, expected behavior, reviewer notes, and a decision about whether the workflow is ready for a pilot.
- Pass/fail criteria
- Accepted with edit
- Rejected output
- Latency range
- Fallback rate
Evaluation template output
- Test set: Representative examples, edge cases, and known failure cases.
- Scoring rubric: Criteria for success, evidence quality, reviewer effort, and fallback behavior.
- Result log: Outputs, reviewer notes, accepted edits, rejected responses, and prompt versions.
- Pilot gate: A recommendation to pilot, revise, narrow, or stop the workflow.
Preguntas frecuentes
- How many examples do we need?
- Start small: 20-50 representative cases can reveal many workflow problems before a larger evaluation.
- Should evaluation include latency?
- Yes. A correct answer that is too slow may still fail as a product workflow.