CI/CD Integration Guide for Prompt Engineering¶
Overview¶
Prompts are code. Like application code, they benefit from automated testing, version control, and continuous integration. This guide shows how to integrate prompt evaluation into your CI/CD pipeline using GitHub Actions, building on the evaluation methodology from Module 5, §5.4 and the prompt registry schema from prompts/shared/.
1. What to Automate¶
| Check | Purpose | Blocking? | Complexity |
|---|---|---|---|
| Frontmatter lint | Ensure .prompt.md files have required YAML fields (mode, description, version) | ✅ Yes | Low |
| Markdown link check | Catch broken links across the curriculum and prompt files | ✅ Yes | Low |
| Token budget estimate | Warn if a prompt exceeds a configurable token threshold | ⚠️ Advisory | Low |
| Prompt regression tests | Run prompts against a test suite and check output quality | ✅ Yes (for production prompts) | Medium |
| Schema validation | Validate prompt metadata against a JSON schema | ✅ Yes | Low |
| Security scan | Check prompts for accidental secret leaks or injection-vulnerable patterns | ✅ Yes | Low |
This repository already implements the first three in .github/workflows/lint-markdown.yml. This guide covers the remaining three.
2. Prompt Regression Testing¶
Concept¶
A prompt regression test is analogous to a unit test for code. You maintain a set of (input, expected behavior) pairs and run them against your prompt after every change. If outputs degrade, the CI pipeline fails.
Test Suite Structure¶
Create a tests/ directory alongside your prompts:
test_suite.json:
{
"prompt_file": "review-code.prompt.md",
"model": "gpt-4o",
"temperature": 0.0,
"test_cases": [
{
"id": "tc-001",
"description": "Missing type hints",
"input": "def add(a, b):\n return a + b",
"assertions": [
{"type": "contains", "value": "type hint"},
{"type": "format", "value": "numbered list"}
]
},
{
"id": "tc-002",
"description": "Empty function",
"input": "def process():\n pass",
"assertions": [
{"type": "contains", "value": "implementation"},
{"type": "min_length", "value": 50}
]
}
]
}
Evaluation Script¶
evaluate.py:
#!/usr/bin/env python3
"""
Prompt regression test runner.
Usage:
python evaluate.py test_suite.json
Exit codes:
0 — all tests passed
1 — one or more tests failed
"""
import json
import os
import re
import sys
from openai import OpenAI
def load_test_suite(path: str) -> dict:
with open(path) as f:
return json.load(f)
def load_prompt(prompt_path: str) -> str:
with open(prompt_path) as f:
content = f.read()
# Strip YAML frontmatter
if content.startswith("---"):
end = content.index("---", 3) + 3
content = content[end:].strip()
return content
def run_prompt(client: OpenAI, prompt: str, user_input: str,
model: str, temperature: float) -> str:
response = client.chat.completions.create(
model=model,
temperature=temperature,
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": user_input},
],
)
return response.choices[0].message.content or ""
def check_assertion(output: str, assertion: dict) -> tuple[bool, str]:
atype = assertion["type"]
value = assertion["value"]
if atype == "contains":
passed = value.lower() in output.lower()
return passed, f"contains '{value}': {'PASS' if passed else 'FAIL'}"
if atype == "not_contains":
passed = value.lower() not in output.lower()
return passed, f"not_contains '{value}': {'PASS' if passed else 'FAIL'}"
if atype == "min_length":
passed = len(output) >= int(value)
return passed, f"min_length {value}: {'PASS' if passed else 'FAIL'} (actual: {len(output)})"
if atype == "max_length":
passed = len(output) <= int(value)
return passed, f"max_length {value}: {'PASS' if passed else 'FAIL'} (actual: {len(output)})"
if atype == "format":
if value == "numbered list":
passed = bool(re.search(r"^\d+\.", output, re.MULTILINE))
elif value == "json":
try:
json.loads(output)
passed = True
except json.JSONDecodeError:
passed = False
elif value == "markdown":
passed = bool(re.search(r"^#{1,6}\s", output, re.MULTILINE))
else:
passed = True # unknown format, skip
return passed, f"format '{value}': {'PASS' if passed else 'FAIL'}"
if atype == "regex":
passed = bool(re.search(value, output, re.IGNORECASE | re.MULTILINE))
return passed, f"regex '{value}': {'PASS' if passed else 'FAIL'}"
return True, f"unknown assertion type '{atype}': SKIPPED"
def main():
if len(sys.argv) < 2:
print("Usage: python evaluate.py test_suite.json")
sys.exit(1)
suite_path = sys.argv[1]
suite = load_test_suite(suite_path)
# Resolve prompt path relative to test suite
suite_dir = os.path.dirname(os.path.abspath(suite_path))
prompt_path = os.path.join(suite_dir, "..", suite["prompt_file"])
prompt = load_prompt(prompt_path)
client = OpenAI()
model = suite.get("model", "gpt-4o")
temperature = suite.get("temperature", 0.0)
total = 0
passed = 0
failed = 0
for case in suite["test_cases"]:
total += 1
print(f"\n--- Test {case['id']}: {case['description']} ---")
output = run_prompt(client, prompt, case["input"], model, temperature)
print(f"Output preview: {output[:200]}...")
case_passed = True
for assertion in case["assertions"]:
ok, msg = check_assertion(output, assertion)
print(f" {msg}")
if not ok:
case_passed = False
if case_passed:
passed += 1
print(f" → PASSED")
else:
failed += 1
print(f" → FAILED")
print(f"\n{'='*50}")
print(f"Results: {passed}/{total} passed, {failed} failed")
sys.exit(0 if failed == 0 else 1)
if __name__ == "__main__":
main()
GitHub Actions Workflow¶
Add this job to your workflow to run prompt regression tests on every PR that modifies prompt files:
# .github/workflows/prompt-regression.yml
name: Prompt Regression Tests
on:
pull_request:
paths:
- 'prompts/**/*.prompt.md'
- 'prompts/**/tests/**'
jobs:
prompt-regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install openai
- name: Run prompt regression tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
failed=0
for suite in prompts/**/tests/*/test_suite.json; do
echo "Running: $suite"
python "$(dirname "$suite")/evaluate.py" "$suite" || failed=1
done
exit $failed
3. Schema Validation¶
If you adopt the prompt registry schema (see prompt-registry.schema.json if available), add a validation step:
schema-validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install jsonschema
run: pip install jsonschema pyyaml
- name: Validate prompt frontmatter
run: |
python scripts/validate-prompt-schema.py prompts/**/*.prompt.md
4. Security Scanning for Prompts¶
Prompts can accidentally leak patterns that indicate sensitive content:
prompt-security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Scan for sensitive patterns
run: |
echo "Scanning prompt files for potential secrets..."
patterns=(
'sk-[a-zA-Z0-9]{20,}' # OpenAI API keys
'ghp_[a-zA-Z0-9]{36}' # GitHub PATs
'AKIA[0-9A-Z]{16}' # AWS access keys
'password\s*[:=]\s*\S+' # Hardcoded passwords
)
failed=0
for pattern in "${patterns[@]}"; do
if grep -rPn "$pattern" prompts/ learn/; then
echo "FAIL: Found sensitive pattern: $pattern"
failed=1
fi
done
if [ $failed -eq 0 ]; then
echo "PASS: No sensitive patterns found."
fi
exit $failed
5. Complete Workflow Example¶
Here is a complete .github/workflows/prompt-ci.yml that combines all checks:
name: Prompt CI
on:
pull_request:
paths:
- 'prompts/**'
- 'learn/**'
- 'scripts/**'
push:
branches: [main]
paths:
- 'prompts/**'
- 'learn/**'
jobs:
# --- Lint checks (fast, no API needed) ---
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Lint prompt frontmatter
run: bash scripts/lint-prompt-frontmatter.sh
- name: Check markdown links
uses: gaurav-nelson/github-action-markdown-link-check@v1
with:
use-quiet-mode: 'yes'
# --- Token budget (fast, no API needed) ---
token-budget:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Estimate token budgets
run: |
for f in $(find prompts -name '*.prompt.md'); do
chars=$(wc -c < "$f")
tokens=$((chars / 4))
if [ "$tokens" -gt 4000 ]; then
echo "WARNING: $f ≈ $tokens tokens (exceeds 4000)"
else
echo "OK: $f ≈ $tokens tokens"
fi
done
# --- Security scan (fast, no API needed) ---
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Scan for secrets in prompts
run: |
patterns='(sk-[a-zA-Z0-9]{20,}|ghp_[a-zA-Z0-9]{36}|AKIA[0-9A-Z]{16})'
if grep -rPn "$patterns" prompts/ learn/; then
echo "::error::Potential secrets found in prompt files"
exit 1
fi
echo "No secrets found."
# --- Regression tests (requires API key, runs only on prompt changes) ---
regression:
if: github.event_name == 'pull_request'
needs: [lint, security]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install openai
- name: Run prompt regression tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
found=0
for suite in $(find prompts -name 'test_suite.json' 2>/dev/null); do
dir=$(dirname "$suite")
if [ -f "$dir/evaluate.py" ]; then
echo "Running: $suite"
python "$dir/evaluate.py" "$suite" || found=1
fi
done
if [ $found -eq 0 ] && [ "$(find prompts -name 'test_suite.json' | wc -l)" -eq 0 ]; then
echo "No test suites found — skipping regression tests."
fi
exit $found
6. Best Practices¶
-
Separate fast checks from slow checks. Lint, link-check, and security scans run in seconds. Regression tests call the LLM API and take minutes. Run them in parallel, and gate regression tests behind the fast checks.
-
Use temperature 0 for regression tests. Non-determinism is the enemy of CI. Set
temperature: 0so outputs are as reproducible as possible (though not perfectly — model updates can still change outputs). -
Test behavior, not exact text. Assert on structural properties (contains keyword, correct format, within length range) rather than exact string matches. This makes tests robust to minor model output variations.
-
Budget API costs. Limit the number of test cases to what you can afford per PR. 5–10 cases per prompt is typically sufficient for catching regressions.
-
Version your test suites with your prompts. When you change a prompt, update the corresponding test suite in the same commit.
-
Cache evaluation results. For long-running test suites, consider caching outputs for unchanged (prompt + input) pairs to avoid redundant API calls.
Connection to This Repository¶
- The existing workflow at
.github/workflows/lint-markdown.ymlimplements the frontmatter lint and markdown link checks. - The evaluation methodology is described in Module 5, §5.4 and
prompts/shared/evaluation-template.md. - The lab at
learn/labs/lab_04_evaluation_pipeline.pydemonstrates the evaluation approach interactively. - The prompt registry schema (Item 18) provides the validation target for schema checks.