> ## Documentation Index
> Fetch the complete documentation index at: https://docs.qwedai.com/llms.txt
> Use this file to discover all available pages before exploring further.

# QWED design decisions

> QWED's core design philosophy and the Untrusted Translator pattern for deterministic verification of LLM outputs.

**Last updated:** January 2026\
**Purpose:** Document the human architectural decisions, trade-offs, and design thinking behind QWED

***

## Core philosophy

> **"Don't reduce hallucinations. Make them irrelevant."**

Large Language Models (LLMs) are probabilistic and will always hallucinate. Rather than trying to fix the LLM, QWED verifies its outputs using deterministic solvers. The LLM becomes an **untrusted translator**, not a trusted computer.

***

## Architecture: the untrusted translator pattern

### The problem

**Traditional approach:** Trust the LLM to compute correctly

* Train it better (fine-tuning)
* Give it more context (RAG)
* Ask it to check itself (LLM-as-judge)

**Why this fails:** LLMs are fundamentally probabilistic. No amount of training guarantees correctness.

### The QWED solution

**Separation of concerns:**

1. **LLM:** Translates natural language → formal specification
2. **Symbolic Solver:** Verifies the formal specification deterministically

```
User: "What is the derivative of x²?"
  ↓
LLM: Translates to → diff(x**2, x)
  ↓
SymPy: Computes → 2*x (proven correct)
  ↓
QWED: Returns verified result
```

**Key insight:** LLMs are good at translation, terrible at computation. Use them for what they're good at.

***

## Design decision 1: multiple specialized engines

### The trade-off

**Option A: Single general-purpose verifier**

* Pros: Simpler architecture, one dependency
* Cons: No general verifier matches domain-specific tools

**Option B: Multiple domain-specific engines** ← **CHOSEN**

* Pros: Each domain uses decades of specialized research
* Cons: More complex architecture, multiple dependencies

### The decision

**Chosen:** Multiple specialized engines

**Engines:**

* **SymPy** (math): Symbolic algebra, calculus, linear algebra
* **Z3** (logic): SAT/SMT solving, formal verification
* **AST** (code): Static analysis, syntax validation
* **SQLGlot** (SQL): Query parsing and validation
* **NLI + TF-IDF** (facts): Grounding against source documents

**Rationale:** Each domain has specialized tools refined over decades. No general-purpose verifier can match the precision of SymPy for calculus or Z3 for logic. Better to integrate experts than build a generalist.

***

## Design decision 2: TF-IDF for fact grounding

### The trade-off

**Option A: Vector embeddings (semantic similarity)**

* Pros: Captures semantic meaning; uses current embedding methods
* Cons: Probabilistic, can match opposite meanings

**Option B: TF-IDF (lexical matching)** ← **CHOSEN**

* Pros: Deterministic, reproducible, searches for evidence
* Cons: Doesn't capture semantic similarity

### The decision

**Chosen:** TF-IDF for evidence retrieval

**Example of the problem with embeddings:**

```python theme={null}
query1 = "The Earth orbits the Sun"
query2 = "The Sun orbits the Earth"

# Embeddings similarity: 0.92 (HIGH!)
# Meaning: Completely opposite
```

**TF-IDF advantage:** Searches for actual word evidence, not "vibes"

**Academic validation:** 10+ papers (2018-2025) from AAAI, ACL, Neurocomputing use TF-IDF for fact verification (FEVER dataset, etc.)

**Use case:** For verification, you need **evidence-based grounding**, not semantic similarity.

***

## Design decision 3: LLM-as-translator vs LLM-as-judge

### The trade-off

**Option A: LLM-as-judge** (e.g., GPT-4 checks GPT-3.5)

* Pros: Uses latest models, easy to implement
* Cons: Both models probabilistic → recursive hallucination

**Option B: Solver-as-judge** ← **CHOSEN**

* Pros: Deterministic, mathematical proof
* Cons: Limited to verifiable domains

### The decision

**Chosen:** Symbolic solver as judge

**Comparison:**

| Approach     | Judge    | Guarantee          | Example            |
| ------------ | -------- | ------------------ | ------------------ |
| LLM-as-judge | GPT-4    | Probabilistic      | "Probably correct" |
| QWED         | SymPy/Z3 | Mathematical proof | "Provably correct" |

**Rationale:** If both judge and generator are probabilistic, errors compound. Need deterministic judge for verification.

**Limitation acknowledged:** Only works for mathematically verifiable domains (math, logic, syntax). Not for creative writing or subjective content.

***

## Design decision 4: API design philosophy

### The trade-off

**Option A: Low-level API** (expose all solver internals)

* Pros: Maximum flexibility
* Cons: Steep learning curve, requires solver expertise

**Option B: High-level developer-friendly API** ← **CHOSEN**

* Pros: Easy to use, hides complexity
* Cons: Less control over solver behavior

### The decision

**Chosen:** High-level, developer-friendly API

**Design principles:**

1. **Single method per domain:** `verify_math()`, `verify_logic()`, `verify_code()`
2. **Domain-aware routing:** Auto-detect domain from query
3. **Rich error messages:** Explain what failed and why
4. **Fallback values:** Graceful degradation on verification failure

**Example:**

```python theme={null}
# User doesn't need to know about SymPy internals
result = client.verify_math("What is 2+2?")
print(result.verified)  # True
print(result.value)  # 4
```

**Rationale:** You shouldn't need to learn SymPy, Z3, and AST separately. QWED provides a unified interface.

***

## Design decision 5: integration patterns

### The trade-off

**Option A: Standalone verification service**

* Pros: Clean separation
* Cons: Hard to integrate with existing LLM workflows

**Option B: Framework integrations** ← **CHOSEN**

* Pros: Drops into LangChain, LlamaIndex, etc.
* Cons: More maintenance, framework dependencies

### The decision

**Chosen:** Native integrations with popular frameworks

**Integrations:**

* **LangChain:** `QWEDTool` as native LangChain tool
* **LlamaIndex:** Query engine wrapper
* **Direct API:** For custom workflows

**Rationale:** Developers already use LangChain and LlamaIndex. QWED integrates with both rather than forcing workflow changes.

***

## Design decision 6: error handling philosophy

### The trade-off

**Option A: Fail fast** (reject on any error)

* Pros: Safe, conservative
* Cons: Breaks user experience

**Option B: Graceful degradation with audit trails** ← **CHOSEN**

* Pros: Better UX, transparent failures
* Cons: May miss errors

### The decision

**Chosen:** Graceful degradation with comprehensive logging

**Pattern:**

```python theme={null}
try:
    result = verify_math(expression=query)
except VerificationError as e:
    log_error(e, audit_trail=True)
    return fallback_value  # User-defined
```

**Features:**

* **Audit trails:** Log every verification attempt
* **Retry with exponential backoff:** Handle transient LLM errors
* **Alert systems:** Notify on repeated failures
* **Fallback values:** User-defined safe defaults

**Rationale:** Production systems need resilience. Total failure is worse than controlled degradation with visibility.

***

## Design decision 7: PII masking strategy

### The trade-off

**Option A: No PII handling** (user responsibility)

* Pros: Simpler architecture
* Cons: Privacy risk, compliance issues

**Option B: Built-in PII masking** ← **CHOSEN**

* Pros: HIPAA/GDPR compliance, privacy by default
* Cons: Performance overhead, false positives

### The decision

**Chosen:** Optional built-in PII masking

**Implementation:**

* Uses Microsoft Presidio for entity detection
* Masks before sending to LLM
* Unmasks in verification step
* Logs PII detection events

**Rationale:** Healthcare, finance, legal sectors require PII protection. Making it built-in lowers adoption barrier for regulated industries.

***

## Design decision 8: calculator vs Wikipedia analogy

### The mental model

**Why this matters:** This analogy helps you understand QWED's scope.

**Calculator (Deterministic):**

* Input: 2 + 2
* Output: 4 (always)
* Verifiable: ✅

**Wikipedia (Knowledge Base):**

* Input: "Who is the president?"
* Output: Depends on edit history
* Verifiable: ❌ (subjective, changes)

**QWED is a calculator, not Wikipedia:**

* ✅ Math: Provable
* ✅ Logic: Provable
* ✅ Code syntax: Provable
* ❌ Opinions: Not provable
* ❌ Creative content: Not provable

***

## What QWED is not

### Rejected design paths

**1. Novel Research**

* **Not claiming:** New verification algorithms
* **Claiming:** Practical integration of existing solvers

**2. LLM Improvement**

* **Not claiming:** Making LLMs more accurate
* **Claiming:** Catching LLM errors deterministically

**3. General AI Safety**

* **Not claiming:** Solving jailbreaks, toxicity, bias
* **Claiming:** Verifying mathematical/logical correctness

***

## Development transparency

### AI assistance

This project was developed with assistance from generative AI tools:

**Tools used:**

* **Antigravity IDE** with **Claude Sonnet 4.5**
* **Scope:** Code implementation, documentation, test generation

**Human contributions:**

* All architectural decisions documented above
* Trade-off analysis and solver selection
* API design and integration patterns
* Error handling philosophy
* Domain modeling and verification workflows

**Validation:** All AI-generated code was reviewed, benchmarked, and validated against the QWED test suite.

***

## Lessons learned

### What worked

1. **Specialized engines:** Better than trying to build one general verifier
2. **Developer-friendly API:** Easier adoption than low-level solver access
3. **TF-IDF for grounding:** Deterministic evidence > semantic vibes
4. **Open development:** Community contributions improved design

### What we'd change

1. **Earlier framework integrations:** Should have built LangChain support sooner
2. **More domain tutorials:** Users requested more examples per domain
3. **Clearer scope communication:** "Calculator not Wikipedia" analogy should be upfront

***

## References

**Academic validation for design decisions:**

1. **TF-IDF for fact verification:**
   * FEVER Shared Task (2018): Fact verification benchmark
   * AAAI 2019: Combining fact extraction and verification
   * Neurocomputing 2023: Information retrieval for fact-checking

2. **Neurosymbolic AI:**
   * Kautz (2020): The third AI summer
   * Marcus (2020): The next decade in AI

3. **LLM verification challenges:**
   * Ji et al. (2023): Survey of hallucination in NLP
   * OpenAI (2023): GPT-4 technical report

***

## Contributing to design

Have suggestions on design improvements? Open an issue on GitHub to discuss!

**Questions welcome:**

* Why this solver over alternatives?
* Trade-offs we should reconsider?
* New design patterns?

GitHub Discussions: [https://github.com/QWED-AI/qwed-verification/discussions](https://github.com/QWED-AI/qwed-verification/discussions)
