Why We Switched Winston from Local AI to Claude Sonnet — And What We Learned
Winston — our AI-powered research tool for the Focal Point Publications archive — just went through a major upgrade. We tried to make it work with a fully self-hosted local model. We failed. Here's what happened.
The Setup
Winston searches 111,000+ document chunks from David Irving's personal archive: published books, personal diaries, WWII signals intelligence files, trial transcripts, letters, and articles. When a user asks a question, the system retrieves relevant documents via vector search, reranks them for relevance, and feeds them to an AI model that synthesises an answer with inline source citations.
The server is a Hetzner dedicated box with an NVIDIA RTX 4000 SFF Ada — a solid 20 GB VRAM card. Our local model was Qwen 3.5 27B, running via Ollama.
What Went Wrong
Problem 1: The Model Didn't Fit
The Qwen 3.5 27B model needs roughly 22 GB to run. Our GPU has 20 GB. Ollama handled this by partially offloading to CPU RAM, which meant:
- Response times of 60–120 seconds per query
- GPU utilisation hovering at ~50% while the CPU bottlenecked generation
- Users staring at a loading spinner for two minutes
We profiled it properly. Embedding took 0.3 seconds. Vector search took 0.1 seconds. The LLM generation was the entire bottleneck.
Problem 2: The Smaller Model Was Worse, Not Just Slower
We tried Qwen 3.5 9B — it fits comfortably in 20 GB VRAM with room to spare. Response times dropped to 15–20 seconds. Good progress.
But the answer quality was terrible.
Ask it "How did Irving reconstruct Hitler's daily routine?" and it would pull sources about Auschwitz trial transcripts and start writing confidently about gas chambers. Ask "Who was Hitler?" and it would veer into inflammatory editorialising, cherry-picking the most provocative quotes from sources while ignoring the actual question.
The 9B model simply wasn't smart enough to:
- Stay focused on the question asked
- Ignore irrelevant sources in its context
- Synthesise across multiple documents without hallucinating
- Handle sensitive historical content with appropriate nuance
Problem 3: The Retrieval Pipeline Was Masking the Issue
Initially, we blamed the model. But when we dug deeper, we found the retrieval pipeline was part of the problem:
- 57% of the archive is trial transcripts — they dominated every search result
- The embedding model (nomic-embed-text) matched "daily routine" to literal diary entries about Irving's breakfast, not his methodology for reconstructing Hitler's schedule
- The cross-encoder reranker (ms-marco-MiniLM) wasn't penalising off-topic document types
We fixed the retrieval with:
- Larger candidate pool: 40 vector + 25 keyword results instead of 20 + 15
- Type-aware scoring: Trial transcripts penalised for non-trial queries, books boosted for historical and methodology questions
- Intent-based supplementary search: Automatically injects book-filtered results when the query is about Irving's historical claims
- Cross-encoder dominance: Reduced the base vector score to a tiebreaker, letting the reranker make the final call
These changes helped significantly — the right documents were now reaching the model. But the 9B model still produced poor answers from good sources.
What Worked
We switched the default to Claude Sonnet 4 by Anthropic.
The difference was immediate and dramatic. Same retrieval pipeline, same sources, but:
- "Who was Rudolf Hess?" → A focused, source-grounded answer about Hess as Hitler's deputy and his 1941 flight to Scotland
- "How did Irving reconstruct Hitler's daily routine?" → Discussion of Irving's filing-card system and archival methodology
- Response times: 5–15 seconds via API
Sonnet excels at exactly what this task requires: following complex instructions ("answer ONLY from these sources"), synthesising across multiple documents, citing properly, and handling controversial historical content without editorialising.
The Current Architecture
Search pipeline (all local, self-hosted):
- Embedding: nomic-embed-text via Ollama
- Vector DB: ChromaDB (111,533 chunks)
- Full-text: Meilisearch (152,513 documents)
- Reranking: ms-marco-MiniLM cross-encoder
- Type-aware scoring with intent detection
Generation (API):
- Default: Claude Sonnet 4 (~$0.02/query)
- Admin mode: Claude Opus 4 (~$0.07/query)
- Free fallback: Qwen 3.5 9B (local GPU)
The local infrastructure still handles everything except the final answer generation. Embedding, search, reranking, and source selection all run on our hardware at zero marginal cost.
Cost Reality
| Monthly queries | Cost |
|---|---|
| 100 | $2 |
| 1,000 | $20 |
| 10,000 | $200 |
For a research tool serving a niche audience of historians and enthusiasts, this is entirely manageable. The quality difference over a free local model is not incremental — it's categorical.
Will We Go Back to Local?
We're watching the space closely. The honest assessment:
- No local model under 20 GB VRAM matches Sonnet today for this specific task (multi-document synthesis with strict source grounding)
- Q3–Q4 2026 could change this — Qwen 4, Llama 4, and Gemma 4 are all expected, and the 15–25B parameter class is improving fast
- A GPU upgrade (RTX 5090 at 32 GB, or an A6000 at 48 GB) would let us run 30–70B models fully in VRAM, which would close the gap significantly
We'll re-evaluate when the next generation of open models drops. For now, the hybrid approach — local search, cloud generation — gives us the best of both worlds: data sovereignty for the archive, and state-of-the-art answer quality for users.
Try It
The archive contains David Irving's complete published works, personal diaries (1978–2007), trial transcripts, CSDIC intelligence files, and thousands of articles and letters. Ask it anything.
← Back to all posts