๐ฏ The core problem
Verification requires two things:- LLM translates the query (natural language โ structured reasoning)
- Symbolic verifier proves the answer (SymPy, Z3, AST)
๐ LLM accuracy comparison
Math verification example
Query: โWhat is the integral of 2x?โ| Model | Type | Accuracy | Typical Response |
|---|---|---|---|
| GPT-4o-mini | Cloud | ~95% | โxยฒ + Cโ โ |
| Claude 3 Haiku | Cloud | ~93% | โxยฒ + Cโ โ |
| Llama 3 8B | Local | ~75% | Sometimes โxยฒ + Cโ โ , sometimes โ2xยฒ/2โ โ |
| Mistral 7B | Local | ~70% | Inconsistent, may confuse derivative/integral |
Why this matters
When QWED verifies:๐ค When to use each
| Use Case | Local LLM (Ollama) | Cloud LLM (OpenAI/Anthropic) |
|---|---|---|
| Development/Testing | โ Free, fast iteration | โ ๏ธ Costs add up |
| Production (Critical) | โ Lower accuracy | โ Recommended |
| Privacy-Sensitive Data | โ 100% local + PII masking | โ ๏ธ Use with PII masking |
| Cost-Sensitive | โ $0/month | โ ๏ธ ~$5-50/month |
| High-Stakes Decisions | โ Risk of errors | โ Recommended |
๐ก QWEDโs hybrid approach
Best Practice: Use both strategicallyDevelopment setup (free)
Use for: Prototyping, experimentation, learning
Production setup (reliable)
Use for: Production, high-stakes decisions
๐ฐ Cost analysis
Local LLM (Ollama)
- Setup: 10 minutes (download model)
- Monthly Cost: $0
- Accuracy: 70-80% on math/logic
- Privacy: 100% local
- Best for: Development, testing, learning
Cloud LLM (OpenAI GPT-4o-mini)
- Setup: 2 minutes (API key)
- Monthly Cost: $5-10 (with caching)
- Accuracy: 90-95% on math/logic
- Privacy: Use PII masking
- Best for: Production, critical tasks
With QWED caching (cost savings)
๐ Privacy considerations
Local LLM advantages
โ Private โ data stays on your machineโ No API keys - no third-party access
โ Compliance - easier GDPR/HIPAA compliance
Cloud LLM with PII masking
๐ฏ Recommendation by use case
Healthcare (HIPAA)
Finance (PCI-DSS)
Enterprise (general)
๐ The QWED advantage
Even with local LLMs, QWED catches errors!Scenario: local LLM makes mistake
- More failures = worse UX
- Cloud LLMs = fewer verification failures = better UX
๐ Accuracy in practice
From QWED internal testing:| Domain | Local LLM (Llama 3 8B) | Cloud LLM (GPT-4o-mini) |
|---|---|---|
| Basic Math | 85% | 98% |
| Calculus | 75% | 95% |
| Logic (SAT) | 70% | 93% |
| Code Security | 80% | 96% |
๐ Bottom line
Start with local LLM
Scale to cloud LLM
๐ Related documentation
- LLM configuration guide โ complete LLM setup
- PII masking guide โ privacy protection
- Caching guide โ cost savings
โ FAQ
Q: Can I use Llama 3 70B instead of GPT-4?A: Yes! Larger local models (70B+) approach cloud accuracy but require significant hardware (40GB+ VRAM). Q: Is Ollama really free?
A: Yes! Fully open source. You just need hardware to run it. Q: What about Google Gemini?
A: QWED supports Gemini! Similar accuracy to GPT-4/Claude. Q: Can I switch between local and cloud?
A: Absolutely! Change the
provider parameter anytime.
Q: Do I need PII masking with local LLMs?A: Not necessarily, but itโs still good practice for audit trails.
The choice is yours - QWED works with both! ๐ Recommendation: Start local (free), scale to cloud (reliable) when it matters.