Prompt CI

Run prompt regression tests in your CI pipeline with golden datasets.

Overview

Stockyard’s golden dataset testing lets you define test suites with expected criteria, run them against your configured models, and fail CI builds when prompts regress.

1. Create a Test Suite

# Create a test suite
curl -X POST $STOCKYARD_URL/api/studio/test-suites \
  -H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" \
  -d '{"name": "customer-support", "model": "gpt-4o"}'

# Add test cases
curl -X POST $STOCKYARD_URL/api/studio/test-suites/ts_abc123/cases \
  -H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" \
  -d '{
    "prompt": "How do I reset my password?",
    "expected_criteria": {
      "contains": ["password", "reset", "email"],
      "min_length": 50,
      "not_contains": ["error", "cannot"]
    }
  }'

2. Add the GitHub Action

Copy .github/workflows/prompt-test.yml to your repository:

name: Prompt Regression Test
on:
  pull_request:
    branches: [main]

jobs:
  prompt-test:
    runs-on: ubuntu-latest
    steps:
      - name: Run Stockyard Test Suites
        env:
          STOCKYARD_URL: ${{ secrets.STOCKYARD_URL }}
          STOCKYARD_ADMIN_KEY: ${{ secrets.STOCKYARD_ADMIN_KEY }}
        run: |
          # Trigger test run
          RUN=$(curl -sf -X POST \
            -H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" \
            "$STOCKYARD_URL/api/studio/test-suites/$SUITE_ID/run")
          RUN_ID=$(echo "$RUN" | jq -r '.run_id')

          # Poll for results
          for i in $(seq 1 60); do
            sleep 5
            RESULT=$(curl -sf -H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" \
              "$STOCKYARD_URL/api/studio/test-runs/$RUN_ID")
            STATUS=$(echo "$RESULT" | jq -r '.status')
            [ "$STATUS" = "completed" ] && break
          done

          # Check for failures
          FAILED=$(echo "$RESULT" | jq -r '.failed')
          [ "$FAILED" -gt 0 ] && exit 1

3. Expected Criteria

Each test case defines criteria the LLM response must meet:

{
  "contains": ["keyword1", "keyword2"],  // Response must contain these
  "not_contains": ["banned", "error"],   // Response must not contain these
  "min_length": 50,                      // Minimum character length
  "max_tokens": 500                      // Maximum estimated tokens
}

4. Repository Secrets

Set these in your GitHub repository settings:

STOCKYARD_URL       — Your Stockyard instance URL
STOCKYARD_ADMIN_KEY — Admin API key

Tip: Combine with drift detection (GET /api/observe/drift) to catch model behavior changes automatically.