Why We Switched Winston from Local AI to Claude Sonnet — And What We Learned

2026-03-29 · IRVING STUDIO

Winston — our AI-powered research tool for the Focal Point Publications archive — just went through a major upgrade. We tried to make it work with a fully self-hosted local model. We failed. Here's what happened.

The Setup

Winston searches 111,000+ document chunks from David Irving's personal archive: published books, personal diaries, WWII signals intelligence files, trial transcripts, letters, and articles. When a user asks a question, the system retrieves relevant documents via vector search, reranks them for relevance, and feeds them to an AI model that synthesises an answer with inline source citations.

The server is a Hetzner dedicated box with an NVIDIA RTX 4000 SFF Ada — a solid 20 GB VRAM card. Our local model was Qwen 3.5 27B, running via Ollama.

What Went Wrong

Problem 1: The Model Didn't Fit

The Qwen 3.5 27B model needs roughly 22 GB to run. Our GPU has 20 GB. Ollama handled this by partially offloading to CPU RAM, which meant:

Response times of 60–120 seconds per query
GPU utilisation hovering at ~50% while the CPU bottlenecked generation
Users staring at a loading spinner for two minutes

We profiled it properly. Embedding took 0.3 seconds. Vector search took 0.1 seconds. The LLM generation was the entire bottleneck.

Problem 2: The Smaller Model Was Worse, Not Just Slower

We tried Qwen 3.5 9B — it fits comfortably in 20 GB VRAM with room to spare. Response times dropped to 15–20 seconds. Good progress.

But the answer quality was terrible.

Ask it "How did Irving reconstruct Hitler's daily routine?" and it would pull sources about Auschwitz trial transcripts and start writing confidently about gas chambers. Ask "Who was Hitler?" and it would veer into inflammatory editorialising, cherry-picking the most provocative quotes from sources while ignoring the actual question.

The 9B model simply wasn't smart enough to:

Stay focused on the question asked
Ignore irrelevant sources in its context
Synthesise across multiple documents without hallucinating
Handle sensitive historical content with appropriate nuance

Problem 3: The Retrieval Pipeline Was Masking the Issue

Initially, we blamed the model. But when we dug deeper, we found the retrieval pipeline was part of the problem:

57% of the archive is trial transcripts — they dominated every search result
The embedding model (nomic-embed-text) matched "daily routine" to literal diary entries about Irving's breakfast, not his methodology for reconstructing Hitler's schedule
The cross-encoder reranker (ms-marco-MiniLM) wasn't penalising off-topic document types

We fixed the retrieval with:

Larger candidate pool: 40 vector + 25 keyword results instead of 20 + 15
Type-aware scoring: Trial transcripts penalised for non-trial queries, books boosted for historical and methodology questions
Intent-based supplementary search: Automatically injects book-filtered results when the query is about Irving's historical claims
Cross-encoder dominance: Reduced the base vector score to a tiebreaker, letting the reranker make the final call

These changes helped significantly — the right documents were now reaching the model. But the 9B model still produced poor answers from good sources.

What Worked

We switched the default to Claude Sonnet 4 by Anthropic.

The difference was immediate and dramatic. Same retrieval pipeline, same sources, but:

"Who was Rudolf Hess?" → A focused, source-grounded answer about Hess as Hitler's deputy and his 1941 flight to Scotland
"How did Irving reconstruct Hitler's daily routine?" → Discussion of Irving's filing-card system and archival methodology
Response times: 5–15 seconds via API

Sonnet excels at exactly what this task requires: following complex instructions ("answer ONLY from these sources"), synthesising across multiple documents, citing properly, and handling controversial historical content without editorialising.

The Current Architecture

Search pipeline (all local, self-hosted):

Embedding: nomic-embed-text via Ollama
Vector DB: ChromaDB (111,533 chunks)
Full-text: Meilisearch (152,513 documents)
Reranking: ms-marco-MiniLM cross-encoder
Type-aware scoring with intent detection

Generation (API):

Default: Claude Sonnet 4 (~$0.02/query)
Admin mode: Claude Opus 4 (~$0.07/query)
Free fallback: Qwen 3.5 9B (local GPU)

The local infrastructure still handles everything except the final answer generation. Embedding, search, reranking, and source selection all run on our hardware at zero marginal cost.

Cost Reality

Monthly queries	Cost
100	$2
1,000	$20
10,000	$200

For a research tool serving a niche audience of historians and enthusiasts, this is entirely manageable. The quality difference over a free local model is not incremental — it's categorical.

Will We Go Back to Local?

We're watching the space closely. The honest assessment:

No local model under 20 GB VRAM matches Sonnet today for this specific task (multi-document synthesis with strict source grounding)
Q3–Q4 2026 could change this — Qwen 4, Llama 4, and Gemma 4 are all expected, and the 15–25B parameter class is improving fast
A GPU upgrade (RTX 5090 at 32 GB, or an A6000 at 48 GB) would let us run 30–70B models fully in VRAM, which would close the gap significantly

We'll re-evaluate when the next generation of open models drops. For now, the hybrid approach — local search, cloud generation — gives us the best of both worlds: data sovereignty for the archive, and state-of-the-art answer quality for users.

Try It

ask.winston.study

The archive contains David Irving's complete published works, personal diaries (1978–2007), trial transcripts, CSDIC intelligence files, and thousands of articles and letters. Ask it anything.

← Back to all posts