Consensus
Multi-model verification for high-stakes decisions.
Overview
Consensus sends the same prompt to multiple LLM providers in parallel, compares the responses for semantic similarity, and only returns a result when the models agree. If the models disagree, the request is flagged for human review or rejected outright.
This is critical for domains where a single hallucination can cause real harm:
- Medical: Drug interactions, dosage recommendations, diagnostic suggestions
- Financial: Investment advice, regulatory interpretations, risk assessments
- Legal: Contract analysis, compliance guidance, case law citations
If two or three independent models all produce the same answer, the probability of a shared hallucination drops dramatically. Consensus trades cost for confidence.
Configuration
Create a consensus configuration via the API. Each config specifies which models to fan out to, the similarity threshold, and the decision logic.
curl -X POST http://localhost:4200/api/consensus/configs \
-H "Authorization: Bearer sy_admin_..." \
-H "Content-Type: application/json" \
-d '{
"name": "medical-verification",
"models": ["gpt-4o", "claude-sonnet-4-20250514", "gemini-1.5-pro"],
"threshold": 0.85,
"decision": "majority",
"on_disagreement": "reject",
"timeout_ms": 30000
}'
Configuration fields:
| Field | Type | Description |
|---|---|---|
name | string | Unique identifier for this config |
models | string[] | List of models to query in parallel |
threshold | float | Similarity score required for agreement (0.0–1.0) |
decision | string | unanimous or majority |
on_disagreement | string | reject, flag, or return_best |
timeout_ms | int | Max wait time for all models to respond |
Or configure via stockyard.yaml:
# stockyard.yaml
apps:
consensus:
configs:
- name: medical-verification
models: [gpt-4o, claude-sonnet-4-20250514, gemini-1.5-pro]
threshold: 0.85
decision: majority
on_disagreement: reject
timeout_ms: 30000
How It Works
1. Fan-out. When a request hits the proxy with a consensus config attached, Stockyard sends the identical prompt to every model in the config simultaneously. All requests execute in parallel to minimize latency.
2. Similarity comparison. Once all responses arrive (or the timeout is reached), Stockyard computes pairwise semantic similarity between every response pair. The similarity engine uses embedding-based comparison, not simple string matching.
3. Decision logic. Based on the decision setting:
unanimous— All models must agree above the threshold. Strictest mode.majority— A majority of models must agree above the threshold. More tolerant of one outlier.
If consensus is reached, the response from the highest-ranked model is returned. If not, the on_disagreement action fires.
Making a Consensus Request
Add the X-Stockyard-Consensus header to any chat completion request:
curl http://localhost:4200/v1/chat/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "X-Stockyard-Consensus: medical-verification" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Is ibuprofen safe to take with warfarin?"}]
}'
The model field in the request body is ignored when consensus is active — the config's model list takes precedence.
Tuning the Threshold
The threshold value controls how similar responses must be to count as agreement:
| Threshold | Behavior | Use Case |
|---|---|---|
0.95 | Near-identical responses required | Factual lookups, yes/no answers |
0.85 | Same meaning, different wording OK | Medical advice, financial analysis |
0.70 | Broadly similar conclusions | Creative tasks, summaries |
0.85 and adjust based on your disagreement rate. Check /api/consensus/stats to see how often consensus fails for a given config.Cost Implications
Consensus multiplies your per-request cost by the number of models in the config. A 3-model consensus config costs approximately 3x per request. For most teams, this is reserved for high-stakes paths — not every request.
Strategies to manage cost:
- Use consensus only on specific routes or prompt categories
- Mix expensive and cheap models (e.g., GPT-4o + two smaller models)
- Enable caching — repeated identical prompts only trigger consensus once
- Set a
costcapmodule limit to prevent runaway spend
API Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/consensus/configs | List all consensus configurations |
| POST | /api/consensus/configs | Create a new consensus config |
| GET | /api/consensus/configs/:id | Get a specific config by ID |
| PUT | /api/consensus/configs/:id | Update an existing config |
| DELETE | /api/consensus/configs/:id | Delete a config |
| GET | /api/consensus/stats | Agreement rates and latency stats |
| GET | /api/consensus/history | Recent consensus decisions with details |
Example: Medical Chatbot
A medical chatbot needs to verify drug interaction answers before showing them to patients. Here is the full setup:
# 1. Create the consensus config curl -X POST http://localhost:4200/api/consensus/configs \ -H "Authorization: Bearer sy_admin_..." \ -H "Content-Type: application/json" \ -d '{ "name": "drug-interactions", "models": ["gpt-4o", "claude-sonnet-4-20250514", "gemini-1.5-pro"], "threshold": 0.90, "decision": "unanimous", "on_disagreement": "reject", "timeout_ms": 15000 }' # 2. Send a verified request curl http://localhost:4200/v1/chat/completions \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "X-Stockyard-Consensus: drug-interactions" \ -d '{ "model": "gpt-4o", "messages": [ {"role": "system", "content": "You are a pharmacist assistant."}, {"role": "user", "content": "Can I take ibuprofen with warfarin?"} ] }' # 3. Check consensus stats curl http://localhost:4200/api/consensus/stats?config=drug-interactions \ -H "Authorization: Bearer sy_admin_..."
{
"config": "drug-interactions",
"total_requests": 1247,
"consensus_reached": 1198,
"consensus_rate": 0.961,
"avg_latency_ms": 2840,
"disagreements": 49,
"avg_similarity": 0.93
}
With a 96% consensus rate, the system catches the 4% of responses where models disagree — exactly the cases most likely to contain errors.