Consensus

Multi-model verification for high-stakes decisions.

Overview

Consensus sends the same prompt to multiple LLM providers in parallel, compares the responses for semantic similarity, and only returns a result when the models agree. If the models disagree, the request is flagged for human review or rejected outright.

This is critical for domains where a single hallucination can cause real harm:

If two or three independent models all produce the same answer, the probability of a shared hallucination drops dramatically. Consensus trades cost for confidence.

Configuration

Create a consensus configuration via the API. Each config specifies which models to fan out to, the similarity threshold, and the decision logic.

curl -X POST http://localhost:4200/api/consensus/configs \
  -H "Authorization: Bearer sy_admin_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "medical-verification",
    "models": ["gpt-4o", "claude-sonnet-4-20250514", "gemini-1.5-pro"],
    "threshold": 0.85,
    "decision": "majority",
    "on_disagreement": "reject",
    "timeout_ms": 30000
  }'

Configuration fields:

FieldTypeDescription
namestringUnique identifier for this config
modelsstring[]List of models to query in parallel
thresholdfloatSimilarity score required for agreement (0.0–1.0)
decisionstringunanimous or majority
on_disagreementstringreject, flag, or return_best
timeout_msintMax wait time for all models to respond

Or configure via stockyard.yaml:

# stockyard.yaml
apps:
  consensus:
    configs:
      - name: medical-verification
        models: [gpt-4o, claude-sonnet-4-20250514, gemini-1.5-pro]
        threshold: 0.85
        decision: majority
        on_disagreement: reject
        timeout_ms: 30000

How It Works

1. Fan-out. When a request hits the proxy with a consensus config attached, Stockyard sends the identical prompt to every model in the config simultaneously. All requests execute in parallel to minimize latency.

2. Similarity comparison. Once all responses arrive (or the timeout is reached), Stockyard computes pairwise semantic similarity between every response pair. The similarity engine uses embedding-based comparison, not simple string matching.

3. Decision logic. Based on the decision setting:

If consensus is reached, the response from the highest-ranked model is returned. If not, the on_disagreement action fires.

Making a Consensus Request

Add the X-Stockyard-Consensus header to any chat completion request:

curl http://localhost:4200/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "X-Stockyard-Consensus: medical-verification" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Is ibuprofen safe to take with warfarin?"}]
  }'

The model field in the request body is ignored when consensus is active — the config's model list takes precedence.

Tuning the Threshold

The threshold value controls how similar responses must be to count as agreement:

ThresholdBehaviorUse Case
0.95Near-identical responses requiredFactual lookups, yes/no answers
0.85Same meaning, different wording OKMedical advice, financial analysis
0.70Broadly similar conclusionsCreative tasks, summaries
Tip: Start with 0.85 and adjust based on your disagreement rate. Check /api/consensus/stats to see how often consensus fails for a given config.

Cost Implications

Consensus multiplies your per-request cost by the number of models in the config. A 3-model consensus config costs approximately 3x per request. For most teams, this is reserved for high-stakes paths — not every request.

Strategies to manage cost:

Cost tracking: Observe tracks each sub-request individually, so you get full visibility into consensus costs via the cost dashboard.

API Endpoints

MethodEndpointDescription
GET/api/consensus/configsList all consensus configurations
POST/api/consensus/configsCreate a new consensus config
GET/api/consensus/configs/:idGet a specific config by ID
PUT/api/consensus/configs/:idUpdate an existing config
DELETE/api/consensus/configs/:idDelete a config
GET/api/consensus/statsAgreement rates and latency stats
GET/api/consensus/historyRecent consensus decisions with details

Example: Medical Chatbot

A medical chatbot needs to verify drug interaction answers before showing them to patients. Here is the full setup:

# 1. Create the consensus config
curl -X POST http://localhost:4200/api/consensus/configs \
  -H "Authorization: Bearer sy_admin_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "drug-interactions",
    "models": ["gpt-4o", "claude-sonnet-4-20250514", "gemini-1.5-pro"],
    "threshold": 0.90,
    "decision": "unanimous",
    "on_disagreement": "reject",
    "timeout_ms": 15000
  }'

# 2. Send a verified request
curl http://localhost:4200/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "X-Stockyard-Consensus: drug-interactions" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {"role": "system", "content": "You are a pharmacist assistant."},
      {"role": "user", "content": "Can I take ibuprofen with warfarin?"}
    ]
  }'

# 3. Check consensus stats
curl http://localhost:4200/api/consensus/stats?config=drug-interactions \
  -H "Authorization: Bearer sy_admin_..."
{
  "config": "drug-interactions",
  "total_requests": 1247,
  "consensus_reached": 1198,
  "consensus_rate": 0.961,
  "avg_latency_ms": 2840,
  "disagreements": 49,
  "avg_similarity": 0.93
}

With a 96% consensus rate, the system catches the 4% of responses where models disagree — exactly the cases most likely to contain errors.