Support Email Triage With LLM Review
Support queues often hide product and operations pain. Feature requests, bug reports, billing issues, onboarding questions, and urgent account problems arrive in the same inbox or helpdesk.
LLM triage can help if it is designed as decision support rather than final decision-making. The difference is operational, not philosophical: the model reads, classifies, and drafts; the human approves and sends. That structure is what lets a support team adopt AI without exposing the company's voice to a model's bad day.
A Narrow First Scope
The first sprint should classify messages, extract key details, and prepare a response draft for review.
Classify message type, urgency, product area, and customer tier
Extract account IDs, error text, plan details, and requested actions
Detect missing information
Route tickets to the right queue
Draft replies in the product's tone
The support or success team still approves the reply. The system reduces reading, routing, and drafting time.
A reasonable reference architecture for the first sprint:
```
inbound: Gmail API / IMAP / Zendesk webhook / Intercom webhook
→ normalize to Ticket { id, channel, from, subject, body, attachments, customer_id? }
→ enrich: look up customer in CRM/billing, attach plan and recent activity
→ LLM classification (JSON schema):
{
type: "bug" | "billing" | "onboarding" | "feature_request" | "urgent",
urgency: "low" | "normal" | "high" | "critical",
product_area: "...",
customer_tier: "free" | "pro" | "enterprise",
missing_info: ["account_id" | "error_message" | ...],
suggested_queue: "tier1" | "billing" | "engineering" | "csm",
confidence: { type: 0.93, urgency: 0.71, ... }
}
→ RAG step: retrieve top knowledge-base articles, prior tickets, recent release notes
→ LLM draft (with citations to retrieved sources, tone profile, signature)
→ human review queue in a small Next.js admin or directly inside Zendesk macros
→ on approve: send via the original channel, write back to CRM/Zendesk,
log full run for analytics
```
Two design choices that matter:
Customer context before classification. The same ticket from a free user and an enterprise user is not the same ticket. Enriching first improves both routing and draft quality.
Draft must cite. Every claim in the draft (refund policy, feature availability, troubleshooting step) is grounded in a knowledge-base article or release note. The reviewer sees the citation and can verify in seconds.
What Makes It Safe
Customer communication affects trust, so the workflow should show source text, confidence, and the reason for each suggested action. Low-confidence messages should stay in the manual queue.
Specific guardrails worth implementing from the start:
No outbound messages without a human approve action. Even a single click is enough — the point is auditability.
Tone profile per brand. Defined as a short style guide passed in the prompt, with a few approved/rejected reply examples. Without it, the model defaults to a generic helpdesk voice.
Forbidden topics list. Refunds above a threshold, legal language, security disclosures, and pricing exceptions are routed to a stricter queue regardless of confidence.
PII redaction at the boundary. Mask credit-card-like patterns, government IDs, and OAuth tokens before sending text to the model.
Per-customer rate limit on automated drafts. Even with human approval, the same customer should not receive five AI-drafted replies in an hour.
The product team should also get visibility into repeated bugs, confusing UX, and documentation gaps. The classification data is a free product-research stream: weekly clusters of "missing_info: account_id" point at an onboarding bug; spikes in "feature_request" for the same area inform the roadmap.
What To Measure
Track first response time, backlog size, routing accuracy, draft acceptance rate, escalation rate, and recurring issue categories.
Concrete targets that have held up in real pilots:
First-response time. Median cut by 40-70% in the first month, mostly from instant draft preparation.
Routing accuracy. 90%+ for the four largest categories, measured against the team's own reclassification.
Draft acceptance rate. 40-60% accepted with no edit, another 20-30% accepted with light edits, the rest rewritten. The "no edit" rate is the cleanest signal of trust.
Escalation rate. Should not increase. If it does, the model is missing nuance and the prompt or routing policy needs adjustment.
Reviewer time per ticket. Median reading + decision time. The win is here, not in the absolute number of tickets handled.
If the first sprint works, the same pattern can expand to in-app chat, knowledge-base suggestions, CRM updates, and product feedback summaries. The same triage stack — classify, enrich, draft, review, log — is the substrate for all of them.