Skip to the content.

Local models with Ollama

Case Calendar can run its LLM work — the per-entry hearing/deadline extractor, the case-summary writer, or both — on local open models through Ollama instead of a hosted API. Ollama serves an OpenAI-compatible endpoint, so it plugs in as a fourth provider alongside Anthropic, OpenAI, and Gemini.

Why you might want it:

The trade-offs are real and worth reading before you rely on it.

Model names and hardware move fast. The families, sizes, and VRAM figures below were checked in June 2026; treat specific version numbers as examples and confirm current tags in the Ollama library.

← Back to docs

The two tracks pull in different directions

Everything on this page comes back to one distinction, because the extractor and the summarizer ask very different things of a model:

The project’s benchmark, the scorecard returned opposite verdicts for the two tracks, and the recommendations below follow it:

gpt-oss:20b is ~13 GB resident (MXFP4 4-bit, 20B parameters, OpenAI open-weights), so it fits a 16 GB card with room for a working context window. Its reasoning is level-based (it can’t run away the way a boolean-thinking model can), which makes it stable under the bounded thinking budget. The recommended production setup when using local models is therefore the hybrid: run extraction on gpt-oss:20b and send only summaries to a hosted provider — or don’t enable summaries at all. An all-local setup still works (it’s the default when you select LLM_PROVIDER=ollama) and is fine for testing or a private calendar; just know its summaries won’t be publication quality. The two tracks are configured independently — see Configure.

Model provenance and honesty

This matters specifically because Case Calendar is often used to track cybercrime and espionage prosecutions — including cases that name foreign states as the adversary. A model’s training can bias how faithfully it describes a case that is politically sensitive to the country that built it.

Independent evaluations have documented that China-developed open models are aligned to suppress or slant content sensitive to the Chinese government — for example, Promptfoo’s analysis found DeepSeek-R1 refused a large share of China-controversy prompts and returned a strong pro-government slant when it did answer, and similar audits report Qwen models producing falsehoods on topics like Tiananmen while clearly possessing the underlying knowledge. The open weights are less filtered than the vendors’ hosted chatbots, but the alignment is still present. It is most likely to surface exactly where this project cares most: a summary of an indictment alleging Chinese state-linked hacking or economic espionage.

The risk is not uniform across the two tracks:

Note what the project’s own guards do and don’t cover: the documents-only rule, the grounding guard, and the rest catch fabrication and ground dates / figures — but they do not catch omission, softening, or slanting of what a document says. A summary that quietly under-states the charges is still fully grounded in the document text, so it sails straight past them. The guards don’t neutralize this concern.

Recommendation: don’t put a China-developed model on the summary track for an adversary-nation caseload. For production that question is already settled the same way for every local model — keep summaries on a hosted Western model (no local summary cleared the benchmark’s publication bar) — and if you run local summaries anyway for testing or a private calendar, use a Western open model (gpt-oss, Gemma). Use whatever you like for extraction. And verify it yourself rather than taking anyone’s word — that’s what the benchmark harness is for: run a real China-nexus indictment through two models and read the summaries side by side.

Reading a model tag

Ollama tags look like gemma4:31b or, spelled out, gemma4:31b-it-q4_K_M. The parts:

Hardware requirements

The two numbers that decide what runs: model weights and context (KV cache). At the common q4_K_M quantization, weights are roughly 0.55 GB per billion parameters (about a 70–75% reduction from full precision). Context is extra on top — a 32K-token window can add several GB, and the summary track is what makes that window necessary. Ollama will spill what doesn’t fit to CPU/RAM and keep working, just slower, so “fits in VRAM” is about speed, not a hard yes/no.

VRAM by model size (Q4_K_M)

Model size Weights Comfortable with 32K context
7–8B ≈5–6 GB 8–12 GB
12–14B ≈10–12 GB 16 GB
24–27B ≈16–20 GB 24 GB (or 16 GB at reduced context)
31–32B ≈18–22 GB 32 GB — won’t fit a useful window on 24 GB
70B ≈40 GB 32 GB with minimal CPU offload, or 2× 24 GB

The “32K context” column is a reference point. The window this project recommends is 128K (see Context window), whose KV cache is roughly 4× the 32K figure — so a model that only just fits 32K on a card will overrun it at 128K. This is exactly why gemma4:31b (20 GB of weights) is not a 24 GB model for summaries: there’s no room left for the cache at the window summaries need.

Size a model by its actual download, not its parameter count — the multimodal Gemma 4 variants are heavier than the per-billion rule suggests. The default gemma4:e4b is 9.6 GB (it carries vision/audio weights on top of its text params), gemma4:e2b is 7.2 GB, gemma4:26b is 18 GB, and gemma4:31b is 20 GB.

By GPU tier

The VRAM tier is what matters; the cards are examples across vendors.

VRAM Nvidia AMD (ROCm) Intel Arc What runs
8 GB RTX 4060 / 5060 RX 7600 Arc A750 too small for the recommended gpt-oss:20b (~13 GB) — run extraction hosted; gemma4:e2b (7.2 GB) fits at a small window but is an unmeasured last resort
12 GB RTX 3060 12GB / 5070 RX 7700 XT Arc B580 tight for gpt-oss:20b — prefer hosted extraction; gemma4:e4b (9.6 GB) fits but measured at nearly twice gpt-oss’s deviation
16 GB RTX 4060 Ti 16GB / 5070 Ti / 5080 RX 7800 XT / 9070 XT Arc A770 runs gpt-oss:20b with room for a working window — the recommended local setup
24 GB RTX 3090 / 4090 RX 7900 XTX runs gpt-oss:20b with room to spare — and that’s still the right model: the dense 24–30B models crawl here (3–6× slower, no better scores), and gemma4:31b’s ~20 GB of weights crowd out the KV cache at a summary-sized window. See Context window
32 GB RTX 5090 fits gemma4:31b (or a 70B with minimal offload) at a real window — but no model this size has been benchmarked here, so there is no measured recommendation to upgrade

A single consumer card tops out around 32B, and only on the biggest one — a 32 GB RTX 5090 (the current NVIDIA line jumps from the 16 GB RTX 5080 straight to the 5090; AMD has no 32 GB consumer card at all — the line tops out at the 24 GB RX 7900 XTX, the card the local benchmark ran on). More VRAM is currently not a recommended spend for this project, on the measured evidence: extraction is already best-served by gpt-oss:20b (which fits 16 GB), and for summaries the verdict that no local model is production-ready was reached at the quality level, not the VRAM level — the hybrid (hosted summaries) path is the quality upgrade. A 32 GB+ card would reopen the larger-models question, but reopening it means benchmarking, not assuming.

Software support varies by vendor — VRAM is necessary but not sufficient. Nvidia (CUDA) is the most plug-and-play. AMD (ROCm) works well on recent Radeon cards — the 24 GB RX 7900 XTX is a strong-value pick for running the default gpt-oss:20b locally — but doesn’t cover every card. It’s smoothest on bare Linux; on Windows, native Ollama already drives the Radeon GPU directly, while running ROCm inside WSL2 needs a WSL-specific runtime (see below). Intel Arc runs local models (through llama.cpp’s Vulkan backend or Intel’s own tooling) but native Ollama support is still experimental in 2026 and more hands-on — confirm it works before buying a card for this. Higher-VRAM workstation cards exist beyond the consumer lineup (AMD Radeon Pro W7900 48 GB, Intel Arc Pro), where the limiter is software maturity, not memory. Always check your specific card against Ollama’s hardware support page.

Windows (WSL2)

On Windows, Case Calendar itself runs inside WSL2 (it’s a Linux app, and that’s how this project is developed). Where you run Ollama depends on your GPU vendor.

Nvidia is nearly transparent in WSL2: the Windows driver exposes CUDA to WSL2 automatically (do not install a Linux GPU driver inside the distro), and inference runs within a few percent of bare-metal Linux — so run Ollama in WSL2 too.

AMD’s simplest path is the opposite: run Ollama natively on Windows, where its ROCm/Vulkan build already drives recent Radeon GPUs, and point Case Calendar at it from WSL2 with OLLAMA_BASE_URL=http://WINDOWS_HOST_IP:11434 (the WSL default-gateway IP). Running Ollama inside WSL2 on AMD is officially supported too, but takes an extra step the Nvidia path doesn’t: AMD’s ROCDXG (librocdxg) runtime, which reaches the Windows GPU driver through Microsoft’s DXCore interface (/dev/dxg) and supersedes the older roc4wsl method. It needs a current WSL-capable Adrenalin driver (26.2.2 at the time of writing) and Ubuntu 22.04/24.04, and is installed from the librocdxg Quickstart — not the standard Linux ROCm packages. See AMD’s WSL how-to for Radeon and Ryzen.

Measured on an RX 7900 XTX (24 GB), June 2026 — on AMD, in-WSL2 ROCm works and is genuinely GPU-accelerated, but native Windows is faster. Running Ollama inside WSL2 needs the ROCm backend the stock Linux install omits: drop the matching ollama-linux-amd64-rocm bundle into /usr/local/lib/ollama/ and set HSA_ENABLE_DXG_DETECTION=1 on the service so the runtime enumerates the GPU through /dev/dxg. Done right the model loads fully on the GPU (ollama ps shows 100% GPU) and inference runs at real GPU speed — not a CPU fallback. But a controlled llama3.2:3b batch (identical prompts, each config run alone on the card) puts native Windows ahead on every axis:

  in-WSL2 ROCm-dxg native Windows
Extraction latency (9.3K-token prompt → JSON) 3.8 s/call 3.4 s/call (~12% faster)
Generation throughput 122 tok/s 150 tok/s (~23% faster)

The penalty is metric-dependent. Prompt processing (one big batched pass) is barely affected, so extraction — prompt-heavy with tiny outputs — is only ~12% slower in WSL2. Generation suffers more (~20–23% here; an earlier gemma4:e4b read showed closer to ~40%), consistent with the WSL2 GPU path adding per-call overhead: each token is a separate GPU kernel launch routed through the /dev/dxg paravirtualization layer, so a generation-heavy workload pays that cost once per token. (Caveat on measurement: run each config alone and confirm no leftover process is mid-inference — a shared GPU or a stray client produces contention artifacts that look like a property of the path but aren’t. In this benchmark an apparent sustained high-CPU pin turned out to be exactly that — a stray client left running by the harness, not the WSL2 path — so the in-WSL2 host-CPU overhead is not characterized here, and a one-off 6 tok/s reading was likewise contention, not the real ~122 tok/s rate.)

A second native-only advantage is memory headroom: native Windows could load gemma4:31b’s weights entirely in VRAM, which the WSL2 path could not — the DXG/DXCore layer’s memory overhead leaves too little room for the compute buffer, so a realistic (10K+ token) prompt to the 31b failed with an out-of-memory HTTP 500 or a load timeout though the weights themselves loaded. Either way, 24 GB has no room for a summary-sized context window on top of 31b’s 20 GB of weights (see Hardware) — WSL2 just hits the wall sooner. This is why the recommended local model (gpt-oss:20b) stays well under that ceiling; the dense 31b would need a 32 GB+ card, and nothing that size has been benchmarked.

Net: on AMD, native Windows is the faster host, especially for the generation-heavy summary track. But in-WSL2 ROCm is a perfectly usable fallback for the extraction track (the local half of the recommended hybrid), where the penalty is only ~12%. Point Case Calendar at whichever you run via OLLAMA_BASE_URL (the line above for native Windows; http://127.0.0.1:11434 for an in-WSL2 server). The cleanest AMD numbers of all — native ROCm on Linux, with no WSL layer at all — are a separate measurement this Windows-host section doesn’t cover.

Intel Arc is the roughest path here: possible via Intel’s tooling but layered on already-experimental Ollama support. (Multi-GPU tensor parallelism via NCCL doesn’t work in WSL2, but that only affects multi-card rigs — single-GPU inference is unaffected.)

Apple Silicon (Macs)

Apple Silicon shares one pool of unified memory between CPU and GPU, so usable model memory is roughly 65–75% of total RAM (an M-series Mac with 32 GB can run a 24–32B model; 64 GB+ opens up larger ones). Recent Ollama ships MLX-optimized builds for these chips — the -mlx tags in the library (e.g. gemma4:e2b-mlx, gemma4:31b-mlx), which are text-only and a good match for this project. Plain (GGUF) tags also run on Metal; MLX builds are tuned for Apple’s framework and tend to be faster on the same hardware.

Configure

Ollama is opt-in only. The hosted providers auto-detect from their API keys; Ollama has no key, so you select it explicitly in .env. No config.yaml change is needed.

Hybrid — local extraction, hosted summaries. This is the recommended production setup: local extraction is benchmark-competitive, local summaries are not (no local model cleared the publication bar), and it’s also the answer to the honesty concern above:

LLM_EXTRACTION_PROVIDER=ollama   # local extraction — defaults to gpt-oss:20b
# summary track left to its hosted default (e.g. Anthropic Sonnet)

Everything local — both tracks. Fine for testing or a private calendar; the summary prose won’t be publication quality. The default model is gpt-oss:20b for extraction and summaries, so the minimal config is one line:

LLM_PROVIDER=ollama
# Defaults to gpt-oss:20b for both tracks (~13 GB, fits a 16 GB card) — the
# only local model the benchmark recommends. Cards too small for it are
# better served by hosted extraction (see Recommended models).

On startup, sync / serve / summarize log which model each track resolved to (extraction LLM: provider=ollama model=… / summary LLM: …), so you can confirm the wiring at a glance.

Install and pull

# Install Ollama: https://ollama.com/download
ollama pull gpt-oss:20b     # the default, benchmark-recommended local model
ollama serve                # usually already running as a service

Context window

Local models default to a small context window — Ollama allocates 4K on cards under 24 GB of VRAM and 32K on 24–48 GB — and silently truncates anything longer instead of erroring. Extraction sends one short entry at a time, so it’s rarely affected. Summaries are — they feed tens of thousands of tokens into one call, so with too small a window the model would otherwise summarize only the first few pages.

Case Calendar guards against this rather than publishing half-baked output. It learns the window (OLLAMA_NUM_CTX if set, otherwise the model’s maximum via Ollama’s /api/show) and, if a prompt won’t fit — checked before the call, and again afterward against the tokens the server reports it actually read — it refuses instead of emitting a truncated result: extraction skips the entry (logged, retried next sync), the verify/dedupe passes return a no-op, and a summary stores a short “this docket’s documents are too large to summarize within the model’s configured context window” message on the index. Each refusal logs a WARNING naming the remedy. So the fix below is about avoiding the refusal, not preventing silent corruption.

Set the window to 128K (131072) — the full window the default model (gpt-oss:20b) supports. Ollama’s own defaults (above) are too small for summaries — it recommends at least 64K for large-context work — so you have to raise it. 128K comfortably covers Case Calendar’s summary prompts — the per-document char budgets bound how large one gets — and the context-overflow guard cleanly refuses anything that still wouldn’t fit, so there’s no downside to using the model’s full window.

There are three places to set it; any one works:

  1. In Ollama itself (simplest, applies to every model) — set the OLLAMA_CONTEXT_LENGTH environment variable on the server, or move the context-length slider in the Ollama desktop app’s settings. See the Ollama context length docs:

    OLLAMA_CONTEXT_LENGTH=131072 ollama serve
    
  2. OLLAMA_NUM_CTX in Case Calendar’s .env (quick) — Case Calendar forwards it per request as the native options.num_ctx:

    OLLAMA_NUM_CTX=131072
    

    On Ollama this is reliable — Case Calendar talks to Ollama’s native /api/chat endpoint. On a non-Ollama OpenAI-compatible server (see Other servers) it rides through the OpenAI extra_body instead, and whether that server honors it varies, so verify it took effect or use option 1 or 3.

  3. A Modelfile PARAMETER num_ctx (most reliable) — bake the window into a derived model and use that name:

    FROM gpt-oss:20b
    PARAMETER num_ctx 131072
    
    ollama create casecal-gptoss -f ./Modelfile
    # then: LLM_MODEL=casecal-gptoss
    

Confirm it took effect with ollama ps, which shows each loaded model’s allocated context and whether it’s running on GPU or CPU. A bigger window costs more VRAM (the KV cache in Hardware): 128K fits a 24 GB card running gpt-oss:20b comfortably; on a smaller card, drop it (64K is the floor Ollama recommends for large-context work). Don’t exceed what your hardware can hold — see the warning below.

Don’t set the window bigger than your computer can run. Picking a 256K window (in OLLAMA_NUM_CTX, a Modelfile, or the Ollama desktop app) doesn’t mean your GPU has the memory for it. If it doesn’t, one of two things happens: Ollama keeps going by using your regular system memory — the answers are still correct, but it gets much slower — or, if there isn’t enough memory at all, the request fails with an out-of-memory error.

That out-of-memory error is a different problem from a prompt being too big, and the fix is the opposite: make the window smaller (or free up memory), not bigger. Case Calendar tells the two apart. When it runs out of memory it writes a clear warning saying so — it does not call it “documents too large.” It also won’t publish anything broken: it skips the entry, or leaves the summary blank. But the same error will happen on every run until you shrink the window or add memory. So pick a window size your card can actually run.

Thinking models

Some open models (Qwen3, DeepSeek, Gemma, GLM) are thinking models: they emit a hidden chain of reasoning before the answer. On real Ollama (the native /api/chat endpoint, the only one that exposes the think control), Case Calendar lets a thinking model think on every track — extraction, verify, dedupe, and summaries alike — with a bounded output budget (max_tokens plus a reasoning headroom, default 8192, set by OLLAMA_THINK_BUDGET). No configuration; it keys off the model’s reported thinking capability. (The budget was originally unbounded, but that let runaway-prone models hang on large inputs — the runaways described below are why it’s now bounded.)

This reverses an earlier design that turned thinking off for the high-volume tracks to save time. That was a mistake: suppressing a weak model’s reasoning made it re-emit the known hearings/deadlines it was shown back as spurious actions — a single dense entry could dump twenty-plus phantom deadlines. Letting the model reason fixes that and is measurably more accurate for extraction — in the benchmark, letting gemma4:e4b think cut its deviation by 36% (see the scorecard). Note this inverts for summaries: in the blind grading, thinking made every local boolean-thinker’s summaries worse (gemma’s markup broke, qwen’s reasoning ran away, glm hung) — one more reason local summaries aren’t the production path, and a reason to set OLLAMA_FORCE_NO_THINK if you generate them anyway.

The budget is a runaway guard, not a cost lever — local inference has no per-token cost, so the bound exists only to make a model that won’t stop truncate cleanly (the entry is skipped) in seconds, rather than generating until the per-call timeout. It’s sized generously (default 8192 tokens of headroom on top of the answer allowance): a disciplined thinker finishes well under it and is never touched — measured on this project’s benchmark, gemma and gpt-oss top out around 1,500–1,900 generated tokens per extraction call, far below the cap. Raise OLLAMA_THINK_BUDGET only if a model’s reasoning is genuinely cut off mid-thought. The real cost is wall-clock — a thinking model is slower per call (gemma ~20s, qwen3.5:9b ~41s, vs ~7s not thinking), and a heavy thinker is simply slow on big backfills.

Watch for runaways on large inputs. A thinking model can occasionally fail to stop — it keeps generating until the request hits Case Calendar’s per-call timeout, so that one entry or summary is left incomplete and the call spent the full timeout doing it. It is worst on the largest thinking models (≈30B) writing a summary of a busy docket at a 128K window on a 24 GB card, where the model is also near the VRAM ceiling — there it can time out on most calls, which makes it impractical for that track on that hardware (in this project’s benchmark a 30B thinker timed out on every summary this way; a 29.9B one on half). Smaller thinkers (≤ ~9–20B) and the level-based gpt-oss complete cleanly. Per-entry extraction is usually safe because each input is small, but a single unusually dense entry can trigger the same stall. Runaways are also far more likely under greedy decoding, which Case Calendar no longer sends to local models — see Sampling and determinism.

Escape hatch — OLLAMA_FORCE_NO_THINK. Set it to any non-empty value to force reasoning off for the run: Case Calendar sends an explicit think=false and a bounded output budget, so the model answers without a reasoning trace — much faster, and no runaway. On the extraction track the trade is accuracy: this is the very suppression that made weak models re-emit known context (above), so reach for it there only when a thinking model is too slow or unstable on your hardware, and check the result against your own dockets. For summaries the benchmark found the opposite — thinking made every local boolean-thinker’s summaries worse — so if a run is summary-only, forcing thinking off is the better setting. It is a no-op for gpt-oss, whose reasoning is level-based and can’t be disabled by a boolean (see the note just below).

gpt-oss is the exception: its reasoning can’t be turned off — Ollama ignores a boolean think and tunes the trace by level (low / medium / high) instead. Its deepest trace is too slow for high-volume extraction, so Case Calendar sends low on the extract/verify/dedupe tracks and high for summaries (both bounded by the same output budget as any other thinker). Override the level with OLLAMA_THINK_LEVEL (low / medium / high) to set gpt-oss’s reasoning depth on every track at once — to trade speed against accuracy, or to compare the levels on your own dockets.

On a non-Ollama OpenAI-compatible server there is no think control at all, so a heavy thinker can still overrun on extraction — prefer a non-thinking model there, or run thinking models on real Ollama.

Sampling and determinism

Case Calendar runs every provider — hosted and local — at temperature=0 (greedy decoding) by default, for reproducible output. Greedy is off-spec for local thinking models: the Qwen3 model card says outright, “do not use greedy decoding, as it can lead to performance degradation and endless repetitions,” and the DeepSeek-R1 card recommends a temperature of 0.5–0.7 “to prevent endless repetitions” — and greedy is the main cause of the runaways above. Greedy is nonetheless the default because a re-benchmark of the recommended local models under in-spec sampling found it no better (gpt-oss:20b unchanged within its run-to-run noise, gemma4:e4b marginally worse — see the scorecard), so switching would buy no accuracy while invalidating every greedy-measured benchmark row. In-spec sampling is therefore an opt-in escape hatch — most useful if you insist on running one of the runaway-prone models (Qwen3, DeepSeek-R1) and want to escape its runaways:

What in-spec sampling does and doesn’t fix: it removes a real cause of the runaways, but it does not make the unstable models reliable. Even fully in-spec, Qwen3 still runs away on this task intermittently (a fixed seed makes the outcome reproducible, but it can lock in a runaway just as easily as a clean answer), and DeepSeek-R1 still emits enormous JSON for tiny answers. gpt-oss:20b, whose reasoning is level-based and inherently bounded, remains the recommended local extractor — and it neither runs away under greedy nor improves under sampling, which is why greedy stays the default.

Other OpenAI-compatible servers

The provider is named ollama, but it also drives any local OpenAI-compatible server. Case Calendar auto-detects which it’s talking to: a real Ollama server is driven through its native /api/chat endpoint (the only one that supports thinking control), while LM Studio, llama.cpp’s server, vLLM, and similar — which speak the OpenAI API but have no /api/chat — fall back to /v1/chat/completions and work with zero code change, just without thinking control. For example, to drive LM Studio (handy on Apple Silicon, with first-class MLX support and its own model browser):

LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:1234/v1   # LM Studio's default port
LLM_MODEL=gemma-4-e4b
Variable Default Purpose
OLLAMA_BASE_URL http://localhost:11434 Where the server listens — the host root (Ollama’s native endpoint). Point it at a remote box, or a non-Ollama OpenAI-compatible server like LM Studio (http://host:1234/v1).
OLLAMA_NUM_CTX (unset) Context window in tokens (see above).
OLLAMA_TEMPERATURE (unset) Pin the sampling temperature. Unset = the default: greedy (temperature=0), same as the hosted providers. Set it (e.g. 0.6) to take a runaway-prone model out of greedy — an opt-in escape hatch, not the default. See Sampling and determinism.
OLLAMA_SEED (unset) Pin the RNG seed to make a sampled run reproducible (a low temperature alone is not deterministic). Necessary but not sufficient on GPU — reliable for schema-constrained extraction on a non-reasoning model, not guaranteed for summaries or a reasoning model. Unset = fresh random seed each run. See Sampling and determinism.
OLLAMA_FORCE_NO_THINK (unset) Force reasoning OFF on a real-Ollama thinking model (sends think=false, bounded budget) — the escape hatch for a runaway or too-slow thinker, at an accuracy cost. No-op on gpt-oss and on non-Ollama servers. See Thinking models.
OLLAMA_THINK_BUDGET 8192 Reasoning headroom (tokens) added on top of the answer allowance for a thinking model on real Ollama. It’s a runaway guard, not a cost lever (local inference is free); raise it only if a model’s reasoning is being cut off mid-thought.

Schema-enforced JSON extraction is always on — there is no toggle. On real Ollama it uses the native format field (a hard GBNF grammar), automatically once real Ollama is detected (via /api/show), whether or not OLLAMA_BASE_URL carries a /v1 suffix; it measurably improved gpt-oss:20b extraction accuracy in the benchmark. On a non-Ollama OpenAI-compatible server it rides the json_schema response format, which vLLM and LM Studio’s llama.cpp engine both accept and enforce at current versions (verified from source, June 2026).

Cost reporting

Local inference has no per-token API charge, so the cost estimator reports cost_est=$0.0000 for every local call and the run TOTAL stays at $0.00. (This measures API billing only — your hardware and electricity aren’t free, but they aren’t per-token either, which is the thing the estimator models.) The real token counts are still logged on every llm-tokens line, so you can see how much work each call did.

Benchmarking a local model

The model-comparison harness builds a full store per model from the same cached CourtListener data, so you can compare a local model’s extracted hearings/deadlines and summaries against the hosted providers (and against each other) without re-spending on the hosted side — it caches their responses.

Ollama isn’t in the default comparison set — it needs a running server and a pulled model, which a clean run can’t assume — so add it as an extra column:

# Compare local Gemma 4 extraction against the committed default set:
uv run python model-comparison/build_provider_stores.py \
  --extra-variant ollama:gemma4:e4b \
  --validate

# Trust check: read one China-nexus indictment summarized two ways.
uv run python model-comparison/build_provider_stores.py \
  --extra-variant ollama:gemma4:31b \
  --extra-variant ollama:qwen3:32b \
  --case <a-china-nexus-case-id>

The spec is ollama:<extraction-model>[,<summary-model>] (the optional distinct summary model is comma-separated, since an Ollama model tag like gemma4:e4b already contains a colon). The harness pins each column’s models and replays the real pipeline, so the resulting store, ICS, and index page reflect exactly what that local model produces — including, in the second example, whether a China-developed model omits or softens anything a Western model states plainly.

Caveats

These aren’t blockers, but they’re where local differs from hosted — and exactly what you’d want to measure when tuning:

See also