QWED design decisions - QWED Documentation

Last updated: January 2026
Purpose: Document the human architectural decisions, trade-offs, and design thinking behind QWED

Core philosophy

“Don’t reduce hallucinations. Make them irrelevant.”

Large Language Models (LLMs) are probabilistic and will always hallucinate. Rather than trying to fix the LLM, QWED verifies its outputs using deterministic solvers. The LLM becomes an untrusted translator, not a trusted computer.

Architecture: the untrusted translator pattern

The problem

Traditional approach: Trust the LLM to compute correctly

Train it better (fine-tuning)
Give it more context (RAG)
Ask it to check itself (LLM-as-judge)

Why this fails: LLMs are fundamentally probabilistic. No amount of training guarantees correctness.

The QWED solution

Separation of concerns:

LLM: Translates natural language → formal specification
Symbolic Solver: Verifies the formal specification deterministically

User: "What is the derivative of x²?"
  ↓
LLM: Translates to → diff(x**2, x)
  ↓
SymPy: Computes → 2*x (proven correct)
  ↓
QWED: Returns verified result

Key insight: LLMs are good at translation, terrible at computation. Use them for what they’re good at.

Design decision 1: multiple specialized engines

The trade-off

Option A: Single general-purpose verifier

Pros: Simpler architecture, one dependency
Cons: No general verifier matches domain-specific tools

Option B: Multiple domain-specific engines ← CHOSEN

Pros: Each domain uses decades of specialized research
Cons: More complex architecture, multiple dependencies

The decision

Chosen: Multiple specialized engines Engines:

SymPy (math): Symbolic algebra, calculus, linear algebra
Z3 (logic): SAT/SMT solving, formal verification
AST (code): Static analysis, syntax validation
SQLGlot (SQL): Query parsing and validation
NLI + TF-IDF (facts): Grounding against source documents

Rationale: Each domain has specialized tools refined over decades. No general-purpose verifier can match the precision of SymPy for calculus or Z3 for logic. Better to integrate experts than build a generalist.

Design decision 2: TF-IDF for fact grounding

The trade-off

Option A: Vector embeddings (semantic similarity)

Pros: Captures semantic meaning; uses current embedding methods
Cons: Probabilistic, can match opposite meanings

Option B: TF-IDF (lexical matching) ← CHOSEN

Pros: Deterministic, reproducible, searches for evidence
Cons: Doesn’t capture semantic similarity

The decision

Chosen: TF-IDF for evidence retrieval Example of the problem with embeddings:

query1 = "The Earth orbits the Sun"
query2 = "The Sun orbits the Earth"

# Embeddings similarity: 0.92 (HIGH!)
# Meaning: Completely opposite

TF-IDF advantage: Searches for actual word evidence, not “vibes” Academic validation: 10+ papers (2018-2025) from AAAI, ACL, Neurocomputing use TF-IDF for fact verification (FEVER dataset, etc.) Use case: For verification, you need evidence-based grounding, not semantic similarity.

Design decision 3: LLM-as-translator vs LLM-as-judge

The trade-off

Option A: LLM-as-judge (e.g., GPT-4 checks GPT-3.5)

Pros: Uses latest models, easy to implement
Cons: Both models probabilistic → recursive hallucination

Option B: Solver-as-judge ← CHOSEN

Pros: Deterministic, mathematical proof
Cons: Limited to verifiable domains

The decision

Chosen: Symbolic solver as judge Comparison:

Approach	Judge	Guarantee	Example
LLM-as-judge	GPT-4	Probabilistic	”Probably correct”
QWED	SymPy/Z3	Mathematical proof	”Provably correct”

Rationale: If both judge and generator are probabilistic, errors compound. Need deterministic judge for verification. Limitation acknowledged: Only works for mathematically verifiable domains (math, logic, syntax). Not for creative writing or subjective content.

Design decision 4: API design philosophy

The trade-off

Option A: Low-level API (expose all solver internals)

Pros: Maximum flexibility
Cons: Steep learning curve, requires solver expertise

Option B: High-level developer-friendly API ← CHOSEN

Pros: Easy to use, hides complexity
Cons: Less control over solver behavior

The decision

Chosen: High-level, developer-friendly API Design principles:

Single method per domain: verify_math(), verify_logic(), verify_code()
Domain-aware routing: Auto-detect domain from query
Rich error messages: Explain what failed and why
Fallback values: Graceful degradation on verification failure

Example:

# User doesn't need to know about SymPy internals
result = client.verify_math("What is 2+2?")
print(result.verified)  # True
print(result.value)  # 4

Rationale: You shouldn’t need to learn SymPy, Z3, and AST separately. QWED provides a unified interface.

Design decision 5: integration patterns

The trade-off

Option A: Standalone verification service

Pros: Clean separation
Cons: Hard to integrate with existing LLM workflows

Option B: Framework integrations ← CHOSEN

Pros: Drops into LangChain, LlamaIndex, etc.
Cons: More maintenance, framework dependencies

The decision

Chosen: Native integrations with popular frameworks Integrations:

LangChain: QWEDTool as native LangChain tool
LlamaIndex: Query engine wrapper
Direct API: For custom workflows

Rationale: Developers already use LangChain and LlamaIndex. QWED integrates with both rather than forcing workflow changes.

Design decision 6: error handling philosophy

The trade-off

Option A: Fail fast (reject on any error)

Pros: Safe, conservative
Cons: Breaks user experience

Option B: Graceful degradation with audit trails ← CHOSEN

Pros: Better UX, transparent failures
Cons: May miss errors

The decision

Chosen: Graceful degradation with comprehensive logging Pattern:

try:
    result = verify_math(expression=query)
except VerificationError as e:
    log_error(e, audit_trail=True)
    return fallback_value  # User-defined

Features:

Audit trails: Log every verification attempt
Retry with exponential backoff: Handle transient LLM errors
Alert systems: Notify on repeated failures
Fallback values: User-defined safe defaults

Rationale: Production systems need resilience. Total failure is worse than controlled degradation with visibility.

Design decision 7: PII masking strategy

The trade-off

Option A: No PII handling (user responsibility)

Pros: Simpler architecture
Cons: Privacy risk, compliance issues

Option B: Built-in PII masking ← CHOSEN

Pros: HIPAA/GDPR compliance, privacy by default
Cons: Performance overhead, false positives

The decision

Chosen: Optional built-in PII masking Implementation:

Uses Microsoft Presidio for entity detection
Masks before sending to LLM
Unmasks in verification step
Logs PII detection events

Rationale: Healthcare, finance, legal sectors require PII protection. Making it built-in lowers adoption barrier for regulated industries.

Design decision 8: calculator vs Wikipedia analogy

The mental model

Why this matters: This analogy helps you understand QWED’s scope. Calculator (Deterministic):

Input: 2 + 2
Output: 4 (always)
Verifiable: ✅

Wikipedia (Knowledge Base):

Input: “Who is the president?”
Output: Depends on edit history
Verifiable: ❌ (subjective, changes)

QWED is a calculator, not Wikipedia:

✅ Math: Provable
✅ Logic: Provable
✅ Code syntax: Provable
❌ Opinions: Not provable
❌ Creative content: Not provable

What QWED is not

Rejected design paths

1. Novel Research

Not claiming: New verification algorithms
Claiming: Practical integration of existing solvers

2. LLM Improvement

Not claiming: Making LLMs more accurate
Claiming: Catching LLM errors deterministically

3. General AI Safety

Not claiming: Solving jailbreaks, toxicity, bias
Claiming: Verifying mathematical/logical correctness

Development transparency

AI assistance

This project was developed with assistance from generative AI tools: Tools used:

Antigravity IDE with Claude Sonnet 4.5
Scope: Code implementation, documentation, test generation

Human contributions:

All architectural decisions documented above
Trade-off analysis and solver selection
API design and integration patterns
Error handling philosophy
Domain modeling and verification workflows

Validation: All AI-generated code was reviewed, benchmarked, and validated against the QWED test suite.

Lessons learned

What worked

Specialized engines: Better than trying to build one general verifier
Developer-friendly API: Easier adoption than low-level solver access
TF-IDF for grounding: Deterministic evidence > semantic vibes
Open development: Community contributions improved design

What we’d change

Earlier framework integrations: Should have built LangChain support sooner
More domain tutorials: Users requested more examples per domain
Clearer scope communication: “Calculator not Wikipedia” analogy should be upfront

References

Academic validation for design decisions:

TF-IDF for fact verification:
- FEVER Shared Task (2018): Fact verification benchmark
- AAAI 2019: Combining fact extraction and verification
- Neurocomputing 2023: Information retrieval for fact-checking
Neurosymbolic AI:
- Kautz (2020): The third AI summer
- Marcus (2020): The next decade in AI
LLM verification challenges:
- Ji et al. (2023): Survey of hallucination in NLP
- OpenAI (2023): GPT-4 technical report

Contributing to design

Have suggestions on design improvements? Open an issue on GitHub to discuss! Questions welcome:

Why this solver over alternatives?
Trade-offs we should reconsider?
New design patterns?

GitHub Discussions: https://github.com/QWED-AI/qwed-verification/discussions

​Core philosophy

​Architecture: the untrusted translator pattern

​The problem

​The QWED solution

​Design decision 1: multiple specialized engines

​The trade-off

​The decision

​Design decision 2: TF-IDF for fact grounding

​The trade-off

​The decision

​Design decision 3: LLM-as-translator vs LLM-as-judge

​The trade-off

​The decision

​Design decision 4: API design philosophy

​The trade-off

​The decision

​Design decision 5: integration patterns

​The trade-off

​The decision

​Design decision 6: error handling philosophy

​The trade-off

​The decision

​Design decision 7: PII masking strategy

​The trade-off

​The decision

​Design decision 8: calculator vs Wikipedia analogy

​The mental model

​What QWED is not

​Rejected design paths

​Development transparency

​AI assistance

​Lessons learned

​What worked

​What we’d change

​References

​Contributing to design

Core philosophy

Architecture: the untrusted translator pattern

The problem

The QWED solution

Design decision 1: multiple specialized engines

The trade-off

The decision

Design decision 2: TF-IDF for fact grounding

The trade-off

The decision

Design decision 3: LLM-as-translator vs LLM-as-judge

The trade-off

The decision

Design decision 4: API design philosophy

The trade-off

The decision

Design decision 5: integration patterns

The trade-off

The decision

Design decision 6: error handling philosophy

The trade-off

The decision

Design decision 7: PII masking strategy

The trade-off

The decision

Design decision 8: calculator vs Wikipedia analogy

The mental model

What QWED is not

Rejected design paths

Development transparency

AI assistance

Lessons learned

What worked

What we’d change

References

Contributing to design