The Customer Support Triage Roadmap

1

Phase 01

Sort 60 days of past tickets into your real categories

Before any AI touches inbound mail, you need to know what the inbound actually looks like. Most teams discover that four or five categories cover 80 to 90 percent of tickets — and that two of those categories were a surprise.

Export → cluster → rename → ship

Tools you will use

ChatGPT or Claude — for clustering ticket subjects into candidate categories.
{your helpdesk} or {your shared inbox} — to export the last 60 days of subject lines plus first-message bodies.
A spreadsheet — for the final category list and example tickets per category.

Time + cost estimate

60 to 90 minutes. No additional cost beyond your existing helpdesk and AI subscription.

What you ship at the end

A category list — usually four to seven items — with two or three real example tickets pasted under each. This document is the source of truth for every later phase. If a phase 3 draft is wrong, the cause is almost always a missing or muddled category here.

Common failure modes

Over-categorising. If you end up with twelve categories, you are slicing too thin. Merge anything with fewer than three examples in 60 days; it is not yet a category, it is a one-off.
Mistaking subject lines for content. Two tickets titled "Question about billing" can be entirely different problems. Read the first message body, not just the subject, before assigning a category.
Letting the model name your categories. The model will suggest generic labels ("Technical Issues", "Account Help"). Rename them to language your team already uses. Your categories should sound like you, not like a CRM.

Decision gate before phase 2 Open three random tickets from the past week and tag each one using your new category list. If you can do this in under 30 seconds per ticket without ambiguity, proceed. If you stall on two or more, the category list is wrong — go back and refine it before automating anything.

2

Phase 02

Tag inbound tickets automatically on arrival

Now the AI does one job and one job only: read each new ticket and tag it with a category from your phase-1 list. No replies, no drafts. Just sorting. This is the smallest possible AI surface area and the easiest to verify.

Trigger → single call → label → sample-audit

Tools you will use

Zapier or Make — to trigger on new ticket and call the AI.
ChatGPT or Claude API — for the actual categorisation call. The API path is required here; the chat UI cannot be wired into your helpdesk reliably.
{your helpdesk} — to receive the tag back as a label or custom field.

Time + cost estimate

90 to 120 minutes to set up. Ongoing cost: roughly $0.001 to $0.005 per ticket on the cheapest current model tier. A team taking 500 tickets a month spends under $5.

What you ship at the end

Every new ticket arrives in your helpdesk with a category tag attached within seconds. Existing routing rules (e.g. assign refund tickets to the founder, assign technical questions to the lead engineer) now run on those tags.

Common failure modes

The model invents a category. Constrain the prompt to "Pick exactly one of the following labels. If none fit, return UNCATEGORISED." Then have a human review UNCATEGORISED tickets daily for the first two weeks — these are how you discover missing categories.
Multilingual tickets get mis-tagged. If a meaningful share of inbound is not in English, test the prompt in each language. Most current models handle major European languages well, but tone-laden categories (e.g. "Complaint") get confused.
Long-thread tickets time out. Pass only the first inbound message to the categoriser, not the full thread. Subsequent replies do not change the category.

Decision gate before phase 3 Run phase 2 for five business days. At day five, audit 30 randomly-tagged tickets manually. If 27 or more are correctly categorised (90 percent), proceed. If fewer, the issue is almost always category overlap from phase 1, not model error — return to phase 1.

3

Phase 03

Draft replies for the two safest categories only

Resist the urge to draft replies for every category at once. Pick the two categories where the right reply is most formulaic — typically password resets, shipping status, or basic feature questions. Refunds and complaints stay manual until phase 4.

Tagged → drafted → unsent → human-sent

Tools you will use

ChatGPT or Claude with custom instructions (or a Project / GPT) that contains: your brand voice notes, your three to five most common phrasings, and the policy for each safe category.
{your helpdesk} — to receive the draft as an internal note or unsent reply, not a sent reply.
A short brand-voice document — three paragraphs is enough. Cover: greeting style, sign-off style, two phrases you would never use.

Time + cost estimate

Two to three hours for the brand-voice doc and prompt iteration. Ongoing cost: roughly $0.005 to $0.02 per drafted reply, depending on model and length.

What you ship at the end

For tickets in your two safest categories, a draft reply appears in your helpdesk as an unsent reply within seconds. A human still hits send. Nothing goes out without a person reading it.

Common failure modes

The drafts sound generic. The brand-voice doc is too short or too abstract. Add three real examples of replies you have sent recently — the model learns from concrete examples, not adjectives like "friendly" or "professional".
The drafts hallucinate policy. The model invents a refund window or a feature that does not exist. Fix: every category prompt must explicitly state "If the customer asks about anything not listed in the policies below, do not invent an answer. Reply with: I will check on this and get back to you within a business day." and let the human handle the unknowns.
The drafts auto-send. Double-check the integration: drafts go to the internal-note field or unsent-draft field. They never trigger a send. A single misconfigured Zap can send 200 wrong replies in an hour.

Decision gate before phase 4 Run phase 3 for five business days. Track edit rate: how often you ship the draft as-is, how often you edit, how often you discard. If edit rate is under 30 percent and discard rate is under 10 percent, the drafts are good enough — proceed. Higher, and the prompt or brand-voice doc needs another pass.

4

Phase 04

Add the human-approval queue for sensitive categories

Refunds, complaints, churn-risk threads, and anything from a customer who is upset — these get drafted in the same way as phase 3 but flow into a review queue, not directly into the helpdesk reply field. A human reads each one, edits as needed, and only then sends.

Sensitive → drafted with guardrails → queued → human-edited send

Tools you will use

The same ChatGPT or Claude API integration from phase 3.
A review surface — a Slack channel, a shared Notion page, or your helpdesk's internal-notes field. Whichever your team already checks daily.
A second prompt that includes explicit refund-language guardrails: never promise a refund timeline you do not control, never blame the customer, never escalate tone.

Time + cost estimate

One to two hours for the second prompt and the queue setup. Ongoing cost: same per-ticket as phase 3, plus the human review time — typically 30 to 90 seconds per sensitive ticket.

What you ship at the end

Every refund, complaint, or upset-customer ticket arrives in the review queue with a drafted reply and the original ticket attached. A human reads both, edits or rewrites, and sends. No sensitive reply leaves the system without explicit human approval.

Common failure modes

The queue becomes its own inbox. If the review queue accumulates faster than you clear it, you have just moved the problem. Cap the queue size — if it crosses a threshold, sensitive replies fall back to manual handling until you catch up. The system is allowed to step back.
Reviewers stop reading the draft. After three weeks, humans start trusting the draft and skim. Build a weekly audit where one reviewer reads five sent sensitive replies in full and grades them on tone and accuracy. Catches drift before customers do.
The model softens too much. Models trained on broad data have a tendency to over-apologise. If your brand voice is direct and factual (per Klem HQ's example), instruct the prompt explicitly: "Do not apologise unless we are at fault. Acknowledge the customer's concern in one sentence, then move to the substantive reply."

Decision gate before phase 5 Run phase 4 for two weeks. Pull ten randomly-selected sensitive replies that were sent during this period. Read them as if you were the customer. If you would be satisfied with all ten, proceed. If even one would have damaged a relationship, the prompt or the review process needs more work — do not move to measurement until trust in the system is real.

5

Phase 05

Measure honestly and adjust

By the end of phase 4 you have a working system. The remaining question is whether it is actually saving time and preserving — or improving — customer satisfaction. Most teams skip this phase and end up with a system that feels efficient but is quietly slipping. Measure for a month before deciding what to scale.

Metrics → review → drift check → adjust

Tools you will use

{your helpdesk} reporting — for response time and resolution time, broken down by category.
A short customer-side measurement — a one-question survey on resolved tickets, or a quarterly NPS, whichever you already run.
A spreadsheet or {your notes tool} page tracking the four numbers below.

Time + cost estimate

30 minutes a week for the first month. After that, 30 minutes a month is enough.

What you ship at the end

A monthly check that tracks four numbers: total tickets handled, your time spent per ticket (sampled), edit-rate on AI drafts, and customer satisfaction on resolved tickets. If any of these drift in the wrong direction for two months running, you adjust — usually by tightening a prompt or rolling a category back to manual.

Common failure modes

Measuring only volume. Tickets-per-hour up does not mean the system is working. Pair it with customer satisfaction. If satisfaction is flat or down while volume is up, you are processing more, not helping more.
Ignoring slow drift. A draft prompt that worked in month one slowly degrades as your product changes, your customer base shifts, or the underlying model updates. The weekly audit from phase 4 plus this monthly check are how you catch drift before customers do.
Scaling categories too fast. Adding a sixth or seventh draft category before the first four are stable doubles the failure surface. Add one category at a time, run it for two weeks at draft-only, then move it to send.

Ongoing decision gate If your measured edit rate climbs above 50 percent on any category, that category is no longer safe for automated drafting. Roll it back to manual. The system is allowed to lose ground; pretending it still works costs more than admitting it does not.

The Customer Support Triage Roadmap

Before you start

What you will have at the end

What you need before phase 1

Wall-of-inbox, every morning

Pre-sorted, draft-ready, gated

The roadmap

Sort 60 days of past tickets into your real categories

Tag inbound tickets automatically on arrival

Draft replies for the two safest categories only

Add the human-approval queue for sensitive categories

Measure honestly and adjust

What the AI cannot do

Honest limits

After you finish

Maintenance cadence