Purpose: Document the human architectural decisions, trade-offs, and design thinking behind QWED
Core Philosophy
“Don’t reduce hallucinations. Make them irrelevant.”Large Language Models (LLMs) are probabilistic and will always hallucinate. Rather than trying to fix the LLM, QWED verifies its outputs using deterministic solvers. The LLM becomes an untrusted translator, not a trusted computer.
Architecture: The Untrusted Translator Pattern
The Problem
Traditional approach: Trust the LLM to compute correctly- Train it better (fine-tuning)
- Give it more context (RAG)
- Ask it to check itself (LLM-as-judge)
The QWED Solution
Separation of concerns:- LLM: Translates natural language → formal specification
- Symbolic Solver: Verifies the formal specification deterministically
Design Decision 1: Multiple Specialized Engines
The Trade-off
Option A: Single general-purpose verifier- Pros: Simpler architecture, one dependency
- Cons: No general verifier matches domain-specific tools
- Pros: Each domain uses decades of specialized research
- Cons: More complex architecture, multiple dependencies
The Decision
Chosen: Multiple specialized engines Engines:- SymPy (math): Symbolic algebra, calculus, linear algebra
- Z3 (logic): SAT/SMT solving, formal verification
- AST (code): Static analysis, syntax validation
- SQLGlot (SQL): Query parsing and validation
- NLI + TF-IDF (facts): Grounding against source documents
Design Decision 2: TF-IDF for Fact Grounding
The Trade-off
Option A: Vector embeddings (semantic similarity)- Pros: Captures semantic meaning, state-of-the-art
- Cons: Probabilistic, can match opposite meanings
- Pros: Deterministic, reproducible, searches for evidence
- Cons: Doesn’t capture semantic similarity
The Decision
Chosen: TF-IDF for evidence retrieval Example of the problem with embeddings:Design Decision 3: LLM-as-Translator vs LLM-as-Judge
The Trade-off
Option A: LLM-as-judge (e.g., GPT-4 checks GPT-3.5)- Pros: Uses latest models, easy to implement
- Cons: Both models probabilistic → recursive hallucination
- Pros: Deterministic, mathematical proof
- Cons: Limited to verifiable domains
The Decision
Chosen: Symbolic solver as judge Comparison:| Approach | Judge | Guarantee | Example |
|---|---|---|---|
| LLM-as-judge | GPT-4 | Probabilistic | ”Probably correct” |
| QWED | SymPy/Z3 | Mathematical proof | ”Provably correct” |
Design Decision 4: API Design Philosophy
The Trade-off
Option A: Low-level API (expose all solver internals)- Pros: Maximum flexibility
- Cons: Steep learning curve, users need solver expertise
- Pros: Easy to use, hides complexity
- Cons: Less control over solver behavior
The Decision
Chosen: High-level, developer-friendly API Design principles:- Single method per domain:
verify_math(),verify_logic(),verify_code() - Smart routing: Auto-detect domain from query
- Rich error messages: Explain what failed and why
- Fallback values: Graceful degradation on verification failure
Design Decision 5: Integration Patterns
The Trade-off
Option A: Standalone verification service- Pros: Clean separation
- Cons: Hard to integrate with existing LLM workflows
- Pros: Drops into LangChain, LlamaIndex, etc.
- Cons: More maintenance, framework dependencies
The Decision
Chosen: Native integrations with popular frameworks Integrations:- LangChain:
QWEDToolas native LangChain tool - LlamaIndex: Query engine wrapper
- Direct API: For custom workflows
Design Decision 6: Error Handling Philosophy
The Trade-off
Option A: Fail fast (reject on any error)- Pros: Safe, conservative
- Cons: Breaks user experience
- Pros: Better UX, transparent failures
- Cons: Users might miss errors
The Decision
Chosen: Graceful degradation with comprehensive logging Pattern:- Audit trails: Log every verification attempt
- Retry with exponential backoff: Handle transient LLM errors
- Alert systems: Notify on repeated failures
- Fallback values: User-defined safe defaults
Design Decision 7: PII Masking Strategy
The Trade-off
Option A: No PII handling (user responsibility)- Pros: Simpler architecture
- Cons: Privacy risk, compliance issues
- Pros: HIPAA/GDPR compliance, privacy by default
- Cons: Performance overhead, false positives
The Decision
Chosen: Optional built-in PII masking Implementation:- Uses Microsoft Presidio for entity detection
- Masks before sending to LLM
- Unmasks in verification step
- Logs PII detection events
Design Decision 8: Calculator vs Wikipedia Analogy
The Mental Model
Why this matters: Helps users understand QWED’s scope Calculator (Deterministic):- Input: 2 + 2
- Output: 4 (always)
- Verifiable: ✅
- Input: “Who is the president?”
- Output: Depends on edit history
- Verifiable: ❌ (subjective, changes)
- ✅ Math: Provable
- ✅ Logic: Provable
- ✅ Code syntax: Provable
- ❌ Opinions: Not provable
- ❌ Creative content: Not provable
What QWED Is NOT
Rejected Design Paths
1. Novel Research- Not claiming: New verification algorithms
- Claiming: Practical integration of existing solvers
- Not claiming: Making LLMs more accurate
- Claiming: Catching LLM errors deterministically
- Not claiming: Solving jailbreaks, toxicity, bias
- Claiming: Verifying mathematical/logical correctness
Development Transparency
AI Assistance
This project was developed with assistance from generative AI tools: Tools used:- Antigravity IDE with Claude Sonnet 4.5
- Scope: Code implementation, documentation, test generation
- All architectural decisions documented above
- Trade-off analysis and solver selection
- API design and integration patterns
- Error handling philosophy
- Domain modeling and verification workflows
Lessons Learned
What Worked
- Specialized engines: Better than trying to build one general verifier
- Developer-friendly API: Easier adoption than low-level solver access
- TF-IDF for grounding: Deterministic evidence > semantic vibes
- Open development: Community contributions improved design
What We’d Change
- Earlier framework integrations: Should have built LangChain support sooner
- More domain tutorials: Users needed more examples per domain
- Clearer scope communication: “Calculator not Wikipedia” analogy should be upfront
References
Academic validation for design decisions:-
TF-IDF for fact verification:
- FEVER Shared Task (2018): Fact verification benchmark
- AAAI 2019: Combining fact extraction and verification
- Neurocomputing 2023: Information retrieval for fact-checking
-
Neurosymbolic AI:
- Kautz (2020): The third AI summer
- Marcus (2020): The next decade in AI
-
LLM verification challenges:
- Ji et al. (2023): Survey of hallucination in NLP
- OpenAI (2023): GPT-4 technical report
Contributing to Design
Have suggestions on design improvements? Open an issue on GitHub to discuss! Questions welcome:- Why this solver over alternatives?
- Trade-offs we should reconsider?
- New design patterns?