Determinism guarantee - QWED Documentation

TL;DR: QWED uses deterministic solvers (SymPy, Z3, AST, SQLGlot) wherever possible. When LLM fallback is required, the response is explicitly marked as HEURISTIC.

Engine classification

Every QWED verification engine falls into one of three categories:

Category	Engines	Technology	Reproducible?
100% Symbolic	Math, Logic, Code, SQL, Schema, Taint	SymPy, Z3, AST, SQLGlot	✅ Yes
Hybrid	Fact, Stats	TF-IDF / Sandbox → LLM fallback	⚠️ Conditional
Heuristic	Image, Consensus, Reasoning	VLM / Multi-LLM voting	❌ No

Detailed breakdown

Engine	Mode	Primary Technology	LLM Fallback?
Math	Symbolic	SymPy (symbolic algebra)	❌ Never
Logic	Symbolic	Z3 Theorem Prover (SMT)	❌ Never
Code	Symbolic	Python AST + Bandit	❌ Never
SQL	Symbolic	SQLGlot AST parser	❌ Never
Schema	Symbolic	Pydantic + SymPy	❌ Never
Taint	Symbolic	Data Flow Analysis (AST)	❌ Never
Fact	Hybrid	TF-IDF similarity	✅ When TF-IDF confidence < threshold
Stats	Hybrid	Wasm/Docker sandbox	✅ When sandbox execution fails
Image	Heuristic	Vision LLM (GPT-4V, Claude)	✅ Always
Consensus	Heuristic	Multi-LLM voting	✅ Always
Reasoning	Heuristic	Chain-of-thought LLM	✅ Always

How to know which mode was used

Every API response includes a verification_mode field:

{
  "status": "VERIFIED",
  "verified": true,
  "engine": "math",
  "verification_mode": "SYMBOLIC",
  "result": { ... }
}

`verification_mode` values

Value	Meaning	Trust Level
`SYMBOLIC`	Result proven by deterministic solver (SymPy/Z3/AST)	🟢 100% reproducible
`HEURISTIC`	Result from LLM fallback or multi-model voting	🟡 Best-effort, not guaranteed

Why this matters

For regulated industries

Banks, healthcare, and legal systems require audit trails and reproducibility:

✅ SYMBOLIC results can be independently verified
⚠️ HEURISTIC results should be flagged for human review

For production systems

result = client.verify_math("2+2=4")

if result.verification_mode == "SYMBOLIC":
    # Safe to use without human review
    process_result(result)
elif result.verification_mode == "HEURISTIC":
    # Flag for human review or additional validation
    queue_for_review(result)

Trust boundaries in translated queries

When a natural language query is verified through the math pipeline, the response status is INCONCLUSIVE — even when the expression evaluation itself is fully deterministic. This is because the LLM translation step (natural language → expression) is not formally verified. Each response includes a trust_boundary object that makes this explicit:

result = client.verify("What is 15% of 200?")

print(result.status)  # "INCONCLUSIVE"
print(result.trust_boundary["deterministic_expression_evaluation"])  # True
print(result.trust_boundary["query_semantics_verified"])  # False

Use the trust_boundary fields to decide how to handle the result in your application. For example, you might accept the computed answer while logging that the translation was not formally proven. See Math Engine — Trust boundary for the full field reference.

Design philosophy

“Probabilistic systems should not be trusted with deterministic tasks.”

QWED treats LLMs as untrusted translators:

LLM converts natural language → formal specification
Deterministic solver verifies the formal specification
If solver cannot verify, result is marked HEURISTIC
If the translation itself cannot be formally verified, result is marked INCONCLUSIVE with a trust_boundary

┌─────────────────────────────────────────────────────────┐
│                    QWED Protocol                        │
├─────────────────────────────────────────────────────────┤
│  User Query → LLM (Translator) → DSL → Solver → Result │
│                                                         │
│  Solver = SymPy/Z3/AST  →  SYMBOLIC                     │
│  Solver = LLM Fallback  →  HEURISTIC                    │
│  Translation unverified  →  INCONCLUSIVE + trust_boundary│
└─────────────────────────────────────────────────────────┘

Trust boundaries in the natural language pipeline

When a query enters through POST /verify/natural_language, an LLM first translates the user’s natural language into a formal expression. Even when the downstream solver (e.g., SymPy) evaluates that expression deterministically, the overall result depends on whether the LLM correctly interpreted the user’s intent — and that step is not deterministic. To make this distinction explicit, the natural language math pipeline now:

Returns INCONCLUSIVE as the top-level status instead of VERIFIED, even when the expression evaluation succeeds.
Includes a trust_boundary object in the response that describes exactly what was and was not proven.

{
  "trust_boundary": {
    "query_interpretation_source": "llm_translation",
    "query_semantics_verified": false,
    "verification_scope": "translated_expression_only",
    "deterministic_expression_evaluation": true,
    "formal_proof": false,
    "translation_claim_self_consistent": true,
    "provider_used": "openai",
    "overall_status": "INCONCLUSIVE"
  }
}

This means a VERIFIED result from the Math engine’s direct endpoint (POST /verify/math) carries stronger guarantees than the same calculation routed through the natural language pipeline. See the Math engine documentation for the full field reference.

Numerical sampling fallback

The identity verification fallback (numerical sampling at fixed test points) now fails closed — returning BLOCKED with is_equivalent: false and confidence: 0.0. Previously this returned UNKNOWN with is_equivalent: null (and before that, LIKELY_EQUIVALENT with confidence: 0.99). Matching at a handful of sample points does not constitute a formal proof, so the engine now rejects the result outright rather than leaving the outcome ambiguous.

Frequently asked questions

Q: Can I filter responses by verification_mode?

A: Yes! Use the require_symbolic option in your request:

result = client.verify(
    query="Calculate NPV",
    options={"require_symbolic": True}
)
# Fails if solver cannot verify deterministically

Q: Why is Fact verification sometimes HEURISTIC?

A: Fact verification uses TF-IDF (deterministic) to find evidence. If TF-IDF confidence is below threshold, it falls back to NLI (LLM-based), marking the result as HEURISTIC.

Q: Is Consensus useful if it’s HEURISTIC?

A: Yes! Consensus detects disagreement between multiple LLMs. While not deterministic, it catches cases where models are uncertain — useful as a safety net, not a formal verifier.

​Engine classification

​Detailed breakdown

​How to know which mode was used

​verification_mode values

​Why this matters

​For regulated industries

​For production systems

​Trust boundaries in translated queries

​Design philosophy

​Trust boundaries in the natural language pipeline

​Numerical sampling fallback

​Frequently asked questions

​Q: Can I filter responses by verification_mode?

​Q: Why is Fact verification sometimes HEURISTIC?

​Q: Is Consensus useful if it’s HEURISTIC?

​See also