This page is for the curious — what case-calendar actually does between “new docket entry” and “calendar event”. You don’t need it to run the tool, but if you’re going to modify it (or just want to understand the trade-offs), this is the map.
The exhaustive design-decisions reference lives in
AGENTS.md
in the repo. This page is the concise version.
The pipeline at a glance
CourtListener docket
│
▼
┌───────────────────┐
│ regex pre-filter │ cheap. drops 80%+ of entries before any LLM call.
└─────────┬─────────┘
│ hearings, deadlines, briefing schedules, etc.
▼
┌───────────────────┐
│ LLM extractor │ small/fast tier (Claude Haiku, gpt-5.4-nano, Gemini Flash Lite).
│ per docket entry │ returns ADD / RESCHEDULE / CANCEL / MARK_HELD / ...
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ SQLite store │ stable (case_id, hearing_key) rows.
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ end-of-sync │ verify-pass LLM checks each live hearing /
│ confidence checks │ deadline against the docket. Catches missed
└─────────┬─────────┘ reschedules, etc.
│
▼
┌───────────────────┐
│ renderers │ ICS, Google Calendar, M365 Outlook, index.html.
└───────────────────┘
Two delivery modes feed the pipeline:
- Polling.
case-calendar syncwalks every docket inconfig.yaml, pulls anything newer than the store’s high-water mark, runs it through the pipeline, and re-emits affected calendars. Designed to run on a cron. - Webhooks.
case-calendar servelistens for CourtListenerDOCKET_ALERTevents and runs the sameprocess_entryfunction on each delivery. One entry per HTTP request, one calendar re-emit per delivery, in seconds.
Both paths share the same code beneath the entry processor. A hearing extracted via webhook is byte-identical to one extracted via polling.
Two LLM tracks
case-calendar uses large-language-model calls for two distinct jobs. Throughout the codebase and these docs:
- Extraction is the act of reading a single docket entry and pulling out the structured facts that turn into calendar events: the what (a sentencing hearing, a response-brief deadline), the when (date and time in the court’s local zone), the which (is this a brand-new hearing, a reschedule of an existing one, a cancellation, or evidence that an earlier one already happened), and the significance (does this rise to the level of a public-calendar event, or is it procedural noise). Extraction runs on every relevant docket entry — high volume, narrow output, classification-shaped. Cheap models do it well.
- Summarization is the act of reading a docket’s primary document (indictment, complaint, etc.) plus any disposition documents (judgments, plea agreements, dismissals) and producing 2-4 sentences of prose that tell a subscriber what the case is about and where it stands. Summarization runs at most once per docket, only when a new primary document or disposition lands. Low volume, long context, synthesis-heavy. Higher-tier models earn their keep.
The two jobs have different cost / quality trade-offs, so they’re wired to independent provider and model knobs:
| Track | Volume | Default model | Why |
|---|---|---|---|
| Extraction | High (one call per relevant entry) | Claude Haiku / gpt-5.4-nano / Gemini Flash Lite | Structured-output classification — date, key, significance. The cheap tier handles it fine, and the per-case cost stays in the cents-per-day range. |
| Summarization | Low (one call per docket, rarely re-run) | Sonnet / GPT-5.4 / Gemini Pro | Synthesis from 30-100k tokens of legal prose. Worth the upgrade; pennies per docket. |
The two tracks have independent provider / model knobs
(LLM_PROVIDER / LLM_MODEL for the extractor; LLM_SUMMARY_PROVIDER /
LLM_SUMMARY_MODEL for summaries) so changing one doesn’t affect the
other.
Why LLM-driven extraction, not regex?
Courts describe hearings inconsistently. The same event can show up as:
Set/Reset Hearings(a clerk’s minute entry)ELECTRONIC NOTICE OF RESCHEDULINGOrder on Stipulation for Continuance- A scheduling order with the date embedded in the PDF text
- A paperless minute entry with no document attached
Maintaining regexes per court is a treadmill — and a new clerk’s habits
break them silently. Instead, the LLM sees the entry plus the case’s
known-hearings list, and decides ADD vs RESCHEDULE vs UPDATE vs
CANCEL in one call. A cheap regex pre-filter still runs before the LLM
to drop the obvious non-hearings (briefs, attorney appearances, sealed
placeholders) for free.
Stable hearing keys
Each logical hearing — say, “sentencing for Smith” — gets a stable
hearing_key (kebab-case, e.g. smith-sentencing) assigned on first
observation. Reschedules and detail updates land on the same row. The
Google Calendar event id is derived deterministically from
sha1(case_id::hearing_key), so the same logical hearing is the same
calendar event across syncs, reschedules, and database restores.
Filing deadlines work the same way, in a parallel deadlines table with
a separate deadline_key. Renderers don’t care which is which — both are
projected into the same shape before the ICS / gcal layer ever sees them.
Three-tier short-circuit
Quiet days cost almost nothing because the syncer short-circuits at three levels:
- Per-docket — if the docket’s
date_modifiedhasn’t advanced since the last sync, skip everything. No entries API call, no LLM. - Per-entry —
iter_entries(modified_after=cutoff)filters server-side to entries newer than the local high-water mark. - Per-fingerprint — even if an entry comes back, dedup against
(docket_id, entry_id, content_fingerprint)skips re-LLM-ing entries whose substantive content didn’t change.
On a busy docket with a real update, this still pays for one LLM call. On a quiet day across 30 dockets, it pays for one cheap CourtListener request per docket and zero LLM calls.
What’s in the fingerprint
The third short-circuit is the interesting one. case-calendar can’t trust
“have we seen this entry_id before?” alone, because RECAP entries
evolve after they first appear — a sealed PDF gets unsealed, or a
previously-missing PDF finally gets uploaded to RECAP. We want to
re-process those entries, but ignore cosmetic churn that didn’t change
anything meaningful.
The fingerprint is a SHA-1 of just the entry state that matters:
- The entry’s
descriptionandshort_description(the docket text). - Its
date_filed. - For each attached document: the document’s description, whether it’s available on RECAP, whether it’s sealed, and whether any plain text has been extracted from it yet.
Those second-group flags are what makes “PDF finally appeared on RECAP” or “sealed PDF was unsealed” re-trigger processing automatically: the flag flips → the fingerprint changes → the entry no longer matches its cached row → the syncer re-runs the LLM on it. Everything else — re-sorted metadata fields, unrelated audit columns — leaves the fingerprint stable, so the re-sync is a no-op.
End-of-sync confidence pass
After per-entry extraction, every scheduled or recently-changed hearing
gets a separate focused LLM call (verify_hearing). The model sees just
the candidate hearing plus the last 15 hearing-relevant entries on its
docket, and returns one of:
CONFIRM— no-op.RESCHEDULE— the docket says the hearing moved; update the row.CANCEL— the docket cancelled it.MARK_HELD— there’s evidence the hearing happened (minute entry, verdict, transcript, judgment).REINSTATE— the row is marked cancelled but the docket doesn’t actually support that cancellation.DELETE_HALLUCINATION— the row was never a real hearing.UNCLEAR— leave it alone, re-check next sync.
This catches the classes of bug that per-entry extraction can’t see: reschedules across multiple entries, trials that got mooted by a plea but never explicitly vacated, and (rare) hallucinated rows.
There’s a parallel verify pass for filing deadlines when those are enabled on the case.
The data model
The SQLite store has five operational tables:
dockets— id, lastdate_modified(the short-circuit watermark), last filing date, cached court metadata.entries— dedup of already-processed entries, keyed by(docket_id, entry_id)with a content fingerprint. Description and document body are persisted only for entries that matter to either the extractor or the summary pipeline; everything else gets a fingerprint-only stub.hearings— per-case logical hearings keyed by(case_id, hearing_key). Includes significance, status, calendar event ids (for idempotency across pushes), and the source-entry list for audit trails.deadlines— parallel structure to hearings, with statusespending/met/passed/cancelled.case_summaries— per-docket prose summary plus astaleflag the syncer flips whenever a new primary document or disposition lands.webhook_events— idempotency-key dedup for the webhook receiver.
WAL journaling + a 5-second busy_timeout let the polling sync
process and the long-running serve process safely share the same
SQLite file. The webhook server also serializes its own worker threads
with a server-wide lock.
Why “primary document”?
The summary pipeline talks about each docket’s primary document — the indictment, superseding indictment, information, complaint, amended complaint, or petition that establishes what the case is about. Earlier in the project this was called “operative pleading”, which is a real civil-practice term but reads oddly when applied to criminal indictments. “Primary document” connects to the established “primary source” concept and works across criminal and civil practice. See case summaries for what gets matched and how it’s used.
Data quality guardrails
Several of the codebase’s stricter behaviors exist to prevent specific failure modes seen on real dockets — hallucinations, false-positive “held” verdicts, calendar drift across timezones, cross-docket contamination. They look conservative on first read, and that’s the point: a wrong event on a public calendar erodes subscriber trust far more than a missing one.
- Past-date alone is not evidence a hearing happened. Trials get
continued or vacated by plea agreement without an explicit cancellation
entry; the calendar date passes; the verify pass refuses to mark the
row “held” without affirmative evidence (a minute entry, verdict,
transcript, or judgment-after-trial). A past-dated
scheduledrow accurately communicates “outcome not confirmed”. - The summary LLM is told to refuse, not fabricate. When the inputs don’t support a confident summary, the model emits a fixed sentence (“Documents available for this docket are insufficient to generate a reliable summary.”) which the renderer surfaces verbatim. The alternative — letting the model invent plausible-sounding facts to fill the gap — produced exactly that kind of hallucination during early development.
- Court-local timezones are preserved on each event rather than normalizing to UTC. A 3 PM Pacific hearing displayed in a New York viewer’s calendar still says “3 PM Pacific / 6 PM Eastern”, and the semantic “this is when the courthouse is open” survives DST transitions and travel.
- Cross-court siblings are isolated. A case can span multiple dockets across different courts (district + circuit appeal, parallel filings under different statutes). The per-entry LLM context only shows it siblings in the same court — a “stay appellate proceedings” order on the circuit docket must not trigger cancellations on the district docket’s hearings.
AGENTS.md and the runtime prompts
The full set of those guardrails — plus the reasoning behind each one,
the architectural conventions every module follows, and the testing
philosophy — lives in
AGENTS.md
at the repo root. That file is the project’s contract with any
agentic AI programmer working in the codebase: Claude Code, GitHub
Copilot, Cursor, Codex, Aider, or any other tool that has a
“follow this project’s conventions” surface. The reason rules live in
AGENTS.md (rather than in each agent’s private memory) is portability —
every collaborator, human or otherwise, picks them up the same way, and
the rules survive when one agent’s session ends or a different agent
joins the project. The same file is @-included from CLAUDE.md so
Claude Code reads it on every invocation; other agents read it the same
way under their own conventions.
The data-quality guardrails described above were the source material for the LLM prompts the project uses at runtime. Same rules, encoded in two places: once as English for the human and agent contributors who write the code, and once as English for the model that’s about to classify a real docket entry or read a real indictment. When a rule gets sharpened (e.g., the no-fabrication refusal, or the “trial-date-is-not-evidence-of-a-trial” invariant), it gets sharpened in both places.
The runtime prompts all live in
case_calendar/llm.py:
SIGNIFICANCE_RULES— the major-vs-minor classification rubric, interpolated into the main extractor prompt.SYSTEM_PROMPT— per-entry hearing extraction (and, with the addendum below, deadlines).DEADLINE_PROMPT_ADDENDUM— appended toSYSTEM_PROMPTfor cases that opt into filing-deadline tracking.VERIFY_SYSTEM_PROMPT— the end-of-sync hearing verify pass.VERIFY_DEADLINE_SYSTEM_PROMPT— the parallel verify pass for filing deadlines.DEDUPE_HEARING_SYSTEM_PROMPT— same-docket same-slot duplicate resolver.SUMMARY_SYSTEM_PROMPT— the higher-tier case-summary prompt.
Reading any of those alongside the corresponding entry in
AGENTS.md
is the fastest way to see how a particular guardrail moves from “rule
for the human / agent writing this code” to “rule the model follows
when processing a docket”.
See also
- Configuration — the surface area visible to operators.
- CLI reference — every subcommand.
- Real-time webhooks — the push-mode delivery path.
- Case summaries — the second LLM track.