Monitoring QWED in Production

Track QWED's performance, errors, and usage in your production environment.

Key Metrics to Track

1. Verification Success Rate

What to track:

Percentage of successful verifications
Number of failed verifications
Failure reasons

Example logging:

from datadog import statsd

result = qwed.verify(query)

if result.verified:
    statsd.increment('qwed.verification.success')
else:
    statsd.increment('qwed.verification.failure')
    statsd.increment(f'qwed.failure.{result.reason}')

Target: >99% success rate

2. Response Time

What to track:

Average response time
p50, p95, p99 latency
Slow queries

import time

start = time.time()
result = qwed.verify(query)
duration = time.time() - start

# Log to monitoring
statsd.timing('qwed.response_time', duration * 1000)  # ms

if duration > 5:  # Slow query threshold
    logger.warning(f"Slow QWED query: {duration}s")

Target: p95 < 3 seconds

3. API Quota Usage

What to track:

Daily API calls
Remaining quota
Quota usage trend

# After each call
current_quota = client.get_quota_status()

statsd.gauge('qwed.quota.used', current_quota.used)
statsd.gauge('qwed.quota.remaining', current_quota.remaining)

if current_quota.remaining < 1000:
    alert_team("QWED quota low!")

4. Error Rates

What to track:

Network errors
Timeout errors
Authentication errors
Validation errors

from qwed.exceptions import *

try:
    result = qwed.verify(query)
except TimeoutError:
    statsd.increment('qwed.error.timeout')
except AuthenticationError:
    statsd.increment('qwed.error.auth')
    alert_team("QWED auth failure!")
except Exception as e:
    statsd.increment('qwed.error.unknown')
    logger.error(f"QWED error: {e}")

Target: Error rate < 0.1%

Monitoring Dashboard Example

Grafana Dashboard

{
  "dashboard": {
    "title": "QWED Monitoring",
    "panels": [
      {
        "title": "Verification Success Rate",
        "targets": [
          {
            "expr": "rate(qwed_verification_success_total[5m]) / rate(qwed_verification_total[5m]) * 100"
          }
        ]
      },
      {
        "title": "Response Time (p95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, qwed_response_time_bucket)"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(qwed_error_total[5m])"
          }
        ]
      }
    ]
  }
}

Alerting Rules

Critical Alerts

1. High Error Rate

- alert: QWEDHighErrorRate
  expr: rate(qwed_error_total[5m]) > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "QWED error rate above 1%"
    description: "Error rate: {{ $value }}%"

2. Slow Response Time

- alert: QWEDSlowResponses
  expr: histogram_quantile(0.95, qwed_response_time_bucket) > 5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "QWED p95 latency > 5s"

3. Quota Low

- alert: QWEDQuotaLow
  expr: qwed_quota_remaining < 1000
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: "QWED quota running low"
    description: "Remaining: {{ $value }}"

Logging Best Practices

Structured Logging

import logging
import json

logger = logging.getLogger('qwed')

def verify_with_logging(query, user_id):
    log_data = {
        'timestamp': time.time(),
        'user_id': user_id,
        'query': query[:100],  # Truncate
    }
    
    try:
        start = time.time()
        result = qwed.verify(query)
        duration = time.time() - start
        
        log_data.update({
            'verified': result.verified,
            'duration_ms': int(duration * 1000),
            'status': 'success'
        })
        
        logger.info(json.dumps(log_data))
        return result
        
    except Exception as e:
        log_data.update({
            'status': 'error',
            'error': str(e)
        })
        logger.error(json.dumps(log_data))
        raise

Health Checks

Endpoint Health Check

from flask import Flask, jsonify

app = Flask(__name__)

@app.route('/health/qwed')
def qwed_health():
    try:
        # Test QWED connection
        result = qwed.verify("2+2=4", timeout=5)
        
        if result.verified:
            return jsonify({
                'status': 'healthy',
                'qwed': 'operational'
            }), 200
        else:
            return jsonify({
                'status': 'degraded',
                'qwed': 'verification_failed'
            }), 503
            
    except Exception as e:
        return jsonify({
            'status': 'unhealthy',
            'qwed': 'error',
            'error': str(e)
        }), 503

Troubleshooting Alerts

When alerts fire:

Check dashboard - Review metrics
Check logs - Look for errors
Test manually - Run test script
Contact support - If issue persists

Next: Troubleshooting Guide

Key Metrics to Track​

1. Verification Success Rate​

2. Response Time​

3. API Quota Usage​

4. Error Rates​

Monitoring Dashboard Example​

Grafana Dashboard​

Alerting Rules​

Critical Alerts​

Logging Best Practices​

Structured Logging​

Health Checks​

Endpoint Health Check​

Troubleshooting Alerts​

Key Metrics to Track

1. Verification Success Rate

2. Response Time

3. API Quota Usage

4. Error Rates

Monitoring Dashboard Example

Grafana Dashboard

Alerting Rules

Critical Alerts

Logging Best Practices

Structured Logging

Health Checks

Endpoint Health Check

Troubleshooting Alerts