Prompt CI
Run prompt regression tests in your CI pipeline with golden datasets.
Overview
Stockyard’s golden dataset testing lets you define test suites with expected criteria, run them against your configured models, and fail CI builds when prompts regress.
1. Create a Test Suite
# Create a test suite
curl -X POST $STOCKYARD_URL/api/studio/test-suites \
-H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" \
-d '{"name": "customer-support", "model": "gpt-4o"}'
# Add test cases
curl -X POST $STOCKYARD_URL/api/studio/test-suites/ts_abc123/cases \
-H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" \
-d '{
"prompt": "How do I reset my password?",
"expected_criteria": {
"contains": ["password", "reset", "email"],
"min_length": 50,
"not_contains": ["error", "cannot"]
}
}'
2. Add the GitHub Action
Copy .github/workflows/prompt-test.yml to your repository:
name: Prompt Regression Test
on:
pull_request:
branches: [main]
jobs:
prompt-test:
runs-on: ubuntu-latest
steps:
- name: Run Stockyard Test Suites
env:
STOCKYARD_URL: ${{ secrets.STOCKYARD_URL }}
STOCKYARD_ADMIN_KEY: ${{ secrets.STOCKYARD_ADMIN_KEY }}
run: |
# Trigger test run
RUN=$(curl -sf -X POST \
-H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" \
"$STOCKYARD_URL/api/studio/test-suites/$SUITE_ID/run")
RUN_ID=$(echo "$RUN" | jq -r '.run_id')
# Poll for results
for i in $(seq 1 60); do
sleep 5
RESULT=$(curl -sf -H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" \
"$STOCKYARD_URL/api/studio/test-runs/$RUN_ID")
STATUS=$(echo "$RESULT" | jq -r '.status')
[ "$STATUS" = "completed" ] && break
done
# Check for failures
FAILED=$(echo "$RESULT" | jq -r '.failed')
[ "$FAILED" -gt 0 ] && exit 1
3. Expected Criteria
Each test case defines criteria the LLM response must meet:
{
"contains": ["keyword1", "keyword2"], // Response must contain these
"not_contains": ["banned", "error"], // Response must not contain these
"min_length": 50, // Minimum character length
"max_tokens": 500 // Maximum estimated tokens
}
4. Repository Secrets
Set these in your GitHub repository settings:
STOCKYARD_URL — Your Stockyard instance URL
STOCKYARD_ADMIN_KEY — Admin API key
Tip: Combine with drift detection (
GET /api/observe/drift) to catch model behavior changes automatically.