This page is for contributors: how to get a working development environment, run the app against real dockets, run the tests, and iterate on the LLM-driven parts of the pipeline without spending a fortune doing it. If you only want to use Case Calendar, start at Installation instead.
Prerequisites
- Python 3.13 or newer.
- uv for dependency management. It creates a project-local virtual environment and pins exact versions, so everyone — and CI — runs the same thing.
- A CourtListener API token (free account at courtlistener.com).
- One LLM API key — Anthropic, Google (Gemini), or OpenAI. The recommended default is a split: Gemini for extraction, Anthropic for summaries; see Installation for why.
- Optional but recommended: poppler and tesseract for the local OCR fallback. Without them the pipeline still runs — it just skips PDFs whose text it can’t extract any other way and retries them on a later sync. See Installation → local OCR tools for the install commands.
Get the code
git clone https://github.com/seanthegeek/case-calendar
cd case-calendar
uv sync --extra test --extra lint
uv sync reads pyproject.toml and installs everything into
.venv/. The two extras pull in the test and lint toolchains
(pytest + coverage, and the version-pinned Ruff) so you can run the full
check suite. Prefix every command with uv run and uv handles activation
for you.
The test suite needs nothing else. If you want to run the model-comparison
benchmark against the committed, frozen input snapshot, fetch it once with
git lfs install && git lfs pull (it’s a Git LFS object — see
model-comparison/README.md).
Configure secrets
Copy the example env file and fill in your two required secrets:
cp .env.example .env
COURTLISTENER_TOKEN=... # from your CourtListener profile page
GEMINI_API_KEY=...
ANTHROPIC_API_KEY=...
The CLI loads .env automatically before any module reads an environment
variable. Nothing else is required to run a sync — Google Calendar and
Microsoft 365 push are opt-in and need their own one-time OAuth
(setup gcal / setup m365).
Pick what to track
You have two starting points:
-
config.example.yaml→config.yaml— the full template, documented inline. Copy it and edit thecases:list to the dockets you want.cp config.example.yaml config.yaml -
config.dev.yaml— a checked-in dev config covering only the cases that have driven a documented regression in one of the LLM-driven layers (extractor, verify pass, dedupe sweeps, summary pipeline), each annotated with the failure mode it exercises. Use it with-c config.dev.yamlon any command; it’s the fast inner loop for prompt and model work (see Iterating on prompts and models cheaply).
config.yaml is gitignored (it’s your personal caseload). config.dev.yaml
is tracked, because the dockets in it are public CourtListener records and the
dev config is useful to everyone working on the project.
First run, from scratch
uv run case-calendar -c config.dev.yaml sync
The first sync pulls each configured docket from CourtListener, runs the
regex pre-filter, sends the surviving entries through the LLM extractor, and
writes the resulting hearings and deadlines into the SQLite store at
store_path. It then renders the ICS files (and the index page, if
index_path is set) for every affected calendar. You’ll see one progress
line per case, then one per calendar written:
[us-v-moucka] dockets_skipped=0 entries_seen=42 processed=11 actions=8
[cybercrime (dev)] wrote 14 events -> out/dev/cybercrime.ics
Useful commands once the store is warm:
uv run case-calendar -c config.dev.yaml show # dump current hearings + deadlines
uv run case-calendar -c config.dev.yaml summarize # generate AI case summaries (opt-in)
uv run case-calendar -c config.dev.yaml emit # re-render ICS/index without a CourtListener pull
uv run case-calendar -c config.dev.yaml serve # run the webhook receiver instead of polling
The store is the source of truth; emit re-renders from it for free, so you
can iterate on the renderers without re-syncing.
Run the tests
The suite is hermetic — no real HTTP, no real LLM, no real Google /
Microsoft Graph, no real keyring. Every external dependency is stubbed or
monkey-patched, and an autouse fixture strips any real *_API_KEY from your
shell so a test can never hit a live provider by accident.
uv run pytest # full suite, ~600 tests in ~25s
uv run pytest tests/test_sync_integration.py # one file
uv run pytest -k verify # by keyword
uv run pytest --cov=case_calendar --cov-branch --cov-report=term-missing
Every behavior change ships with the test that proves it. CI runs the full suite with branch coverage on every push and pull request and fails the build under 90% project coverage — but 90% is a floor, not a target. The local rule is stronger: no commit should reduce coverage at the module level. Run the coverage command above before declaring a change done and confirm the modules you touched held or gained coverage.
Test files mirror the modules they cover (tests/test_store.py ↔
case_calendar/store.py), so a coverage gap is one file away from its fix.
Cross-module flows live in tests/test_sync_integration.py and
tests/test_serve.py.
Lint, format, and type-check
These are exactly what CI runs, so run them before you push:
uv run ruff check . # lint
uv run ruff format --check . # formatting (drop --check to apply)
PYRIGHT_PYTHON_FORCE_VERSION=latest uv run pyright # static type check
Ruff is version-pinned in pyproject.toml so local and CI never disagree on
formatting. The pyright env var overrides the wrapper’s pinned release so you
type-check against the latest pyright, as CI does.
Iterating on prompts and models cheaply
The extractor and summary prompts live in
case_calendar/llm.py; the per-prompt rules are
reproduced in LLM prompts. The unit tests pin prompt
structure, but they can’t tell you whether a wording change actually
improves what the model extracts — for that you have to run the real model
against real dockets. Doing that against your whole caseload on every tweak is
expensive, so the project gives you three levers, cheapest first.
1. The dev config
The dev config is the cheapest lever: -c config.dev.yaml
exercises only the ~18 regression cases instead of a full caseload, so a prompt
change meant to fix one of those failure modes is checked against exactly the
cases that surfaced it.
2. The provider-comparison harness
model-comparison/build_provider_stores.py
builds a complete store + rendered output per LLM provider from the same
cached CourtListener data, so you can compare cost and output side by side
before changing a default. Point it at the dev config to keep it cheap:
# Plumbing check with synthetic tokens — no API calls, no spend:
uv run python model-comparison/build_provider_stores.py --config config.dev.yaml --fake
# Real build of one provider column against the dev config:
uv run python model-comparison/build_provider_stores.py --config config.dev.yaml --variants anthropic
It copies the store and never mutates the live file, replays the real
pipeline (extractor + verify/dedupe sweeps + summaries), and prints a per-
provider, per-track cost report. See
model-comparison/SCORECARD.md for the
analysis behind the current default-provider choice.
3. The persistent LLM-response cache
The harness keeps a content-addressed cache of every LLM response on disk
(data/llm-cache.sqlite), on by default. Because every hosted-provider call runs
at temperature=0 by default — including local (Ollama) models, which forward the
same greedy pin — a response is a pure function of its request, so the cache keys on
the full request (provider, model, prompts, max_tokens, temperature) and replays
any identical call for free. The opt-in Ollama sampling knobs are the one wrinkle:
if you set OLLAMA_TEMPERATURE / OLLAMA_SEED (see
Sampling and determinism), the cache key
folds them in for the ollama provider so a sampled run re-keys rather than
replaying a greedy entry — and you should set OLLAMA_SEED for the most reproducible
sampled benchmark. Even then a seed is necessary but not sufficient on GPU: it
reliably reproduces the extraction track on a non-reasoning model, but summaries and
a reasoning model’s output can still diverge run-to-run (the cache replays its own
stored response, which just may not equal a fresh call).
The payoff is automatic per-track scoping: after a first build warms the
cache, a second build following a single-track prompt tweak re-bills only
that track — a summary-prompt edit replays every extraction and verify call
from cache and pays only for the summaries. The end-of-run log prints
per-column hit/miss counts so you can see it working. Pass --no-llm-cache
for a guaranteed-fresh build, or delete the sidecar file to invalidate every
entry.
Reserve a full-caseload build (-c config.yaml, no dev config) for the final
check before you commit a prompt or model change.
Where things live
case_calendar/ the package
cli.py subcommands; the shared emit pipeline
courtlistener.py REST v4 client (retry/backoff, pagination)
sync.py per-case orchestration (extract → verify → dedupe)
llm.py domain prompts + extraction / summary entry points
llmkit/ provider-agnostic LLM call layer + token telemetry
summary.py per-docket case-summary pipeline
store.py SQLite state
pdf.py PDF text extraction with OCR fallback
calendars/ ICS, Google Calendar, Microsoft 365, index.html renderers
serve.py webhook receiver
tests/ mirror the modules they cover; hermetic
docs/ these pages
model-comparison/ the provider-comparison harness + scoring
scripts/ one-shot maintenance + deployment scripts
Architecture walks the pipeline end to end.
Deployment scripts (scripts/)
Alongside the one-shot maintenance scripts, scripts/ holds the
deployment-management wrappers used to run the public deployment — they run
syncs on the production host, move the SQLite store between dev and prod, and
apply upgrades. They’re committed (not host-specific) because none of them
hardcode a server address, login user, or install path.
Each one reads its connection settings from .env through
scripts/_prod-env.sh. Those values are deployment-identifying, so they live
in the gitignored .env and are deliberately not mirrored into
.env.example (which documents only what a fresh checkout needs). Set these in
your own .env if you adopt the scripts:
| Variable | What it is |
|---|---|
CC_PROD_HOST |
prod server hostname or IP for ssh / scp |
CC_PROD_SSH_USER |
ssh login user on the prod host (e.g. root) |
CC_PROD_APP_DIR |
absolute path to the install on prod (e.g. /opt/case-calendar) |
CC_PROD_SERVICE |
systemd unit name + unix service account (assumed identical) |
CC_PROD_STAGE_DIR |
a directory on prod writable by the ssh user, used to stage scp drops |
scripts/_prod-env.sh loads and validates these — failing loudly if any is
missing — and is sourced by every deployment script. Its header comment is the
authoritative reference for each variable; each script’s header documents the
subset it uses.
The scripts, all run from the repo root:
sync-prod— runcase-calendar syncon prod over ssh, streaming output back to your terminal. Forwards extra args (--case …,--force-summaries). Doesn’t stop the service; sync and serve coexist safely under WAL journaling.sync-via-prod— push your localconfig.yamlto prod, sync there (so prod’s CourtListener token does the work, not dev’s), restart the service to pick up the new config, then pull the resulting store back to dev. The dev → prod config / prod → dev DB direction is intentional.--prunealso runsprune --applyfor dockets you removed from config; all other args forward tosync.pull-prod-db— copy prod’scase-calendar.sqlitedown to dev (DB only, no prod-side sync). Uses SQLite’s online-backup API so prod’s serve process keeps handling webhooks during the copy, and backs up the local store first.push-db-to-prod— the inverse: upload the local store +.env+config.yamlto prod, pull code, re-emit, and restart. Prompts for confirmation first, because it overwrites prod state.upgrade-prod— apt package upgrades, push the local.env, pull code,uv sync, re-emit, restart, and flag if a reboot is required. Leaves the prod DB andconfig.yamluntouched.
Conventions for changes
The project’s rules for human and AI contributors alike live in
agent-docs/CONVENTIONS.md
(linked from AGENTS.md) — read it before your first pull
request. The ones that catch newcomers most often:
- Every behavior change ships with its test. Adding a branch adds a test; fixing a bug adds the test that fails on the old code; changing behavior updates the tests that asserted the old behavior.
- Spell “CourtListener” in full everywhere — code, comments, commits, docs.
The only allowed abbreviation is the lowercase
clparameter name for a client object. - Format and lint with the version-pinned Ruff, and type-check with pyright. CI runs both and fails on any deviation — see Lint, format, and type-check for the exact commands and why the version pins matter.
- Modern type annotations throughout, with
TypedDictfor structured results. The project supports every currently-supported Python version. - Module-level loggers (
logger = logging.getLogger(__name__), one per module), and project-defined errors subclassRuntimeError— not bareException— so callers can catch project failures without sweeping in unrelated bugs.
Next steps
- Architecture — how the pipeline fits together.
- Configuration — every
config.yamloption. - CLI reference — every subcommand and flag.
- Cost — what the LLM and CourtListener APIs actually cost.