WINSTON INSIGHTS

Why We Switched Winston from Local AI to Claude Sonnet — And What We Learned

Winston — our AI-powered research tool for the Focal Point Publications archive — just went through a major upgrade. We tried to make it work with a fully self-hosted local model. We failed. Here's what happened.

The Setup

Winston searches 111,000+ document chunks from David Irving's personal archive: published books, personal diaries, WWII signals intelligence files, trial transcripts, letters, and articles. When a user asks a question, the system retrieves relevant documents via vector search, reranks them for relevance, and feeds them to an AI model that synthesises an answer with inline source citations.

The server is a Hetzner dedicated box with an NVIDIA RTX 4000 SFF Ada — a solid 20 GB VRAM card. Our local model was Qwen 3.5 27B, running via Ollama.

What Went Wrong

Problem 1: The Model Didn't Fit

The Qwen 3.5 27B model needs roughly 22 GB to run. Our GPU has 20 GB. Ollama handled this by partially offloading to CPU RAM, which meant:

We profiled it properly. Embedding took 0.3 seconds. Vector search took 0.1 seconds. The LLM generation was the entire bottleneck.

Problem 2: The Smaller Model Was Worse, Not Just Slower

We tried Qwen 3.5 9B — it fits comfortably in 20 GB VRAM with room to spare. Response times dropped to 15–20 seconds. Good progress.

But the answer quality was terrible.

Ask it "How did Irving reconstruct Hitler's daily routine?" and it would pull sources about Auschwitz trial transcripts and start writing confidently about gas chambers. Ask "Who was Hitler?" and it would veer into inflammatory editorialising, cherry-picking the most provocative quotes from sources while ignoring the actual question.

The 9B model simply wasn't smart enough to:

  1. Stay focused on the question asked
  2. Ignore irrelevant sources in its context
  3. Synthesise across multiple documents without hallucinating
  4. Handle sensitive historical content with appropriate nuance

Problem 3: The Retrieval Pipeline Was Masking the Issue

Initially, we blamed the model. But when we dug deeper, we found the retrieval pipeline was part of the problem:

We fixed the retrieval with:

These changes helped significantly — the right documents were now reaching the model. But the 9B model still produced poor answers from good sources.

What Worked

We switched the default to Claude Sonnet 4 by Anthropic.

The difference was immediate and dramatic. Same retrieval pipeline, same sources, but:

Sonnet excels at exactly what this task requires: following complex instructions ("answer ONLY from these sources"), synthesising across multiple documents, citing properly, and handling controversial historical content without editorialising.

The Current Architecture

Search pipeline (all local, self-hosted):

Generation (API):

The local infrastructure still handles everything except the final answer generation. Embedding, search, reranking, and source selection all run on our hardware at zero marginal cost.

Cost Reality

Monthly queriesCost
100$2
1,000$20
10,000$200

For a research tool serving a niche audience of historians and enthusiasts, this is entirely manageable. The quality difference over a free local model is not incremental — it's categorical.

Will We Go Back to Local?

We're watching the space closely. The honest assessment:

We'll re-evaluate when the next generation of open models drops. For now, the hybrid approach — local search, cloud generation — gives us the best of both worlds: data sovereignty for the archive, and state-of-the-art answer quality for users.

Try It

ask.winston.study

The archive contains David Irving's complete published works, personal diaries (1978–2007), trial transcripts, CSDIC intelligence files, and thousands of articles and letters. Ask it anything.

← Back to all posts