Consensus

Multi-model verification for high-stakes decisions.

Overview

Consensus sends the same prompt to multiple LLM providers in parallel, compares the responses for semantic similarity, and only returns a result when the models agree. If the models disagree, the request is flagged for human review or rejected outright.

This is critical for domains where a single hallucination can cause real harm:

Medical: Drug interactions, dosage recommendations, diagnostic suggestions
Financial: Investment advice, regulatory interpretations, risk assessments
Legal: Contract analysis, compliance guidance, case law citations

If two or three independent models all produce the same answer, the probability of a shared hallucination drops dramatically. Consensus trades cost for confidence.

Configuration

Create a consensus configuration via the API. Each config specifies which models to fan out to, the similarity threshold, and the decision logic.

curl -X POST http://localhost:4200/api/consensus/configs \
  -H "Authorization: Bearer sy_admin_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "medical-verification",
    "models": ["gpt-4o", "claude-sonnet-4-20250514", "gemini-1.5-pro"],
    "threshold": 0.85,
    "decision": "majority",
    "on_disagreement": "reject",
    "timeout_ms": 30000
  }'

Configuration fields:

Field	Type	Description
`name`	string	Unique identifier for this config
`models`	string[]	List of models to query in parallel
`threshold`	float	Similarity score required for agreement (0.0–1.0)
`decision`	string	`unanimous` or `majority`
`on_disagreement`	string	`reject`, `flag`, or `return_best`
`timeout_ms`	int	Max wait time for all models to respond

Or configure via stockyard.yaml:

# stockyard.yaml
apps:
  consensus:
    configs:
      - name: medical-verification
        models: [gpt-4o, claude-sonnet-4-20250514, gemini-1.5-pro]
        threshold: 0.85
        decision: majority
        on_disagreement: reject
        timeout_ms: 30000

How It Works

1. Fan-out. When a request hits the proxy with a consensus config attached, Stockyard sends the identical prompt to every model in the config simultaneously. All requests execute in parallel to minimize latency.

2. Similarity comparison. Once all responses arrive (or the timeout is reached), Stockyard computes pairwise semantic similarity between every response pair. The similarity engine uses embedding-based comparison, not simple string matching.

3. Decision logic. Based on the decision setting:

unanimous — All models must agree above the threshold. Strictest mode.
majority — A majority of models must agree above the threshold. More tolerant of one outlier.

If consensus is reached, the response from the highest-ranked model is returned. If not, the on_disagreement action fires.

Making a Consensus Request

Add the X-Stockyard-Consensus header to any chat completion request:

curl http://localhost:4200/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "X-Stockyard-Consensus: medical-verification" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Is ibuprofen safe to take with warfarin?"}]
  }'

The model field in the request body is ignored when consensus is active — the config's model list takes precedence.

Tuning the Threshold

The threshold value controls how similar responses must be to count as agreement:

Threshold	Behavior	Use Case
`0.95`	Near-identical responses required	Factual lookups, yes/no answers
`0.85`	Same meaning, different wording OK	Medical advice, financial analysis
`0.70`	Broadly similar conclusions	Creative tasks, summaries

Tip: Start with 0.85 and adjust based on your disagreement rate. Check /api/consensus/stats to see how often consensus fails for a given config.

Cost Implications

Consensus multiplies your per-request cost by the number of models in the config. A 3-model consensus config costs approximately 3x per request. For most teams, this is reserved for high-stakes paths — not every request.

Strategies to manage cost:

Use consensus only on specific routes or prompt categories
Mix expensive and cheap models (e.g., GPT-4o + two smaller models)
Enable caching — repeated identical prompts only trigger consensus once
Set a costcap module limit to prevent runaway spend

Cost tracking: Observe tracks each sub-request individually, so you get full visibility into consensus costs via the cost dashboard.

API Endpoints

Method	Endpoint	Description
GET	`/api/consensus/configs`	List all consensus configurations
POST	`/api/consensus/configs`	Create a new consensus config
GET	`/api/consensus/configs/:id`	Get a specific config by ID
PUT	`/api/consensus/configs/:id`	Update an existing config
DELETE	`/api/consensus/configs/:id`	Delete a config
GET	`/api/consensus/stats`	Agreement rates and latency stats
GET	`/api/consensus/history`	Recent consensus decisions with details

Example: Medical Chatbot

A medical chatbot needs to verify drug interaction answers before showing them to patients. Here is the full setup:

# 1. Create the consensus config
curl -X POST http://localhost:4200/api/consensus/configs \
  -H "Authorization: Bearer sy_admin_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "drug-interactions",
    "models": ["gpt-4o", "claude-sonnet-4-20250514", "gemini-1.5-pro"],
    "threshold": 0.90,
    "decision": "unanimous",
    "on_disagreement": "reject",
    "timeout_ms": 15000
  }'

# 2. Send a verified request
curl http://localhost:4200/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "X-Stockyard-Consensus: drug-interactions" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {"role": "system", "content": "You are a pharmacist assistant."},
      {"role": "user", "content": "Can I take ibuprofen with warfarin?"}
    ]
  }'

# 3. Check consensus stats
curl http://localhost:4200/api/consensus/stats?config=drug-interactions \
  -H "Authorization: Bearer sy_admin_..."

{
  "config": "drug-interactions",
  "total_requests": 1247,
  "consensus_reached": 1198,
  "consensus_rate": 0.961,
  "avg_latency_ms": 2840,
  "disagreements": 49,
  "avg_similarity": 0.93
}

With a 96% consensus rate, the system catches the 4% of responses where models disagree — exactly the cases most likely to contain errors.

← Trust Studio →