Why Every Alert Needs an LLM-Generated Explainer (And How to Do It Cheaply)

A pager goes off at 03:14 with: M6.2 earthquake 47km from Cikampek refinery, depth 15km, USGS event_id us7000lmnk. The on-call lead has thirty seconds to decide whether to wake the shift manager. They open Wikipedia, read three pages, miss the bit about subduction-zone depth-vs-shaking, escalate anyway. By the time the shift manager is up, the situation is 20 minutes old.

Now the same pager fires with: M6.2 earthquake 47km from Cikampek refinery. Deep enough (15km) that ground shaking will be moderate; PAGER yellow alert; no expected fatalities. Comparable to 2019 Banten event — no operational impact reported then. The on-call lead reads the paragraph, decides not to escalate, goes back to bed.

That's the LLM explainer pattern. AI alert summarisation is the single highest-ROI feature you can add to a monitoring system in 2026.

The cost equation

The reason this wasn't viable two years ago: LLM inference was expensive enough that you couldn't justify a $0.05 explainer on every M2.5+ earthquake.

That changed. Three free / cheap providers ship Llama-class models for under $0.001 per explainer:

Gemini 1.5 Flash — 15 RPM free tier, ~$0.075/1M tokens paid
Groq Llama 3.3 70B — 30 RPM free tier
OpenRouter free Llama 3.3 — generous free quota, fallback path

For an ops team running 100 alerts/day, the explainer layer costs $0. Below 10,000 alerts/day, it stays under $5/month.

The prompt that works

The dumbest version that works:

You are an operations analyst. In one paragraph (max 80 words),
explain to an on-call engineer why this alert matters.

Be specific. State the comparable historical event if there is one.
State the expected operational impact. Do not speculate beyond the data.

ALERT:
- Event: {event_title}
- Source: {event_source}
- Severity: {event_severity}/100
- Location: {event_lat}, {event_lon}
- Zone: {zone_label} ({zone_asset_type})
- Payload: {event_payload_json}

Three iterations refine it:

Add few-shot examples. Two example alerts + two example explainers in the prompt prefix. Output quality jumps significantly.
Constrain the output schema. Require a JSON response with headline, confidence (low/medium/high), factors (3-5 bullet points), recommendation. Easier to render.
Cache by event_id. Multiple zones can match the same earthquake. Don't re-generate the explainer — write to a cache keyed by event_external_id, look up on subsequent matches.

The cache is the single biggest cost saver. Most explainer pipelines we audit are re-generating the same explainer 5-10 times because no caching layer exists.

What explainers should NOT do

A few failure modes:

Don't speculate beyond the data. "This earthquake may indicate increased seismic activity in the region" is the model hallucinating. Constrain the prompt to "explain based only on the provided data."

Don't generate action recommendations the user didn't ask for. "Recommend immediate evacuation" is way out of scope. The explainer's job is comprehension, not decision-making.

Don't include irrelevant context. GDELT articles often carry boilerplate ("Reuters - The world's leading provider of business news…"). Strip before prompting.

Don't translate technical terms unless asked. "PAGER yellow" means something specific to seismologists; spelling it out as "USGS automated impact estimation, yellow tier" is more useful than expanding the acronym.

Latency budget

End-to-end "alert fires → explainer in Slack":

Event match → dispatcher: under 100ms
Cache lookup: ~20ms (Postgres single row)
LLM call (cache miss): 500-2000ms (Gemini Flash median ~800ms)
Slack delivery: 1-3s

For Slack blocks, attach the explainer in the same message. For email, the digest layer can wait for the cache to fill (most alerts have multi-zone matches that warm the cache).

The fallback chain

Free tiers throttle. Wire a fallback chain:

const PROVIDERS = [
  { name: "gemini",    call: callGemini    },  // free 15rpm
  { name: "groq",      call: callGroq      },  // free 30rpm
  { name: "openrouter", call: callOpenRouter },  // free model
  // Paid providers go behind ENABLE_PAID_LLM=true gate
];

for (const p of PROVIDERS) {
  try {
    const result = await p.call(prompt);
    if (result.text) return { ...result, provider: p.name };
  } catch (err) {
    if (err.status === 429) continue;  // rate limit, try next
    if (err.status >= 500) continue;
    throw err;
  }
}
return { text: null, provider: "none" };  // graceful no-op

The graceful no-op matters. An alert without an explainer is still useful. An alert that fails to deliver because the LLM was down is not.

Routing per channel kind

Different channels want different explainer formats:

Slack: Use Slack blocks. Italicise the explainer as a context block.
Discord: Plain text. Append after the alert headline.
Email digest: HTML blockquote. Multi-paragraph allowed.
Webhook: JSON field. Let the receiver decide format.
Telegram: Markdown. Short paragraph.

The dispatcher should be aware of channel kind and format accordingly. Don't ship Slack markdown to Discord.

When the explainer is wrong

You'll occasionally get a wrong explainer. Two ways to handle:

Confidence field. Have the LLM rate its own confidence. Render low-confidence explainers with a "(verify before acting)" suffix.
Resolution feedback. When a user marks an alert as a false positive, capture that signal. Over time, build a feedback dataset to fine-tune or filter the explainer prompts.

We've seen both work. The confidence field is the easier shipping win.

Audit + compliance

For regulated industries, the explainer carries an audit obligation. Three rules:

Log every LLM call: prompt hash, response, model name, timestamp
Never let the explainer make claims that aren't traceable to the event payload
Surface the model name in the UI so users know what generated the text

Augur's audit log captures all three by default. For DIY, log to your own audit table.

Putting it together

The explainer layer:

Costs effectively zero on free LLM tiers
Cuts mean-time-to-action by 60%+ (measured across 10+ deployments)
Is the single feature ops teams ask about first when comparing OSINT platforms
Should cache by event_id, fallback across providers, gracefully no-op on failure

Augur ships it out of the box on Pro and Enterprise plans, with a Gemini→Groq→OpenRouter free chain and Postgres-backed cache. Or build your own — the prompt above is a 30-line implementation.

← Back to blog · Start free