Manual Testing Strategy
Systematic manual testing is essential before Skill deployment.
Test Case Design
- Positive cases: Typical, common scenarios; verify the task can be completed
- Edge cases: Empty input, very long input, special characters, multilingual
- Negative cases: Invalid input, insufficient permissions, unavailable dependencies; verify error handling
- Cross-scenario: Same Skill in different contexts and models
Checklist
- [ ] Output format matches expectations (JSON, Markdown, etc.)
- [ ] Key information is accurate and complete
- [ ] Error messages are clear and actionable
- [ ] No hallucination, overreach, or sensitive data leakage
- [ ] Context is correct across multi-turn dialogue
Recording and Reproducibility
- Save test inputs, outputs, and model config for reproducibility
- Add regression cases for anomalies to catch regressions after fixes
Automated Validation
Structured Output Validation
For JSON, YAML, etc., use schema validation:
import jsonschema
schema = {"type": "object", "properties": {"result": {"type": "string"}}, "required": ["result"]}
jsonschema.validate(output, schema)
Key Content Assertions
- For known inputs, assert expected keywords or structure in output
- Use regex or simple NLP to check semantic relevance
LLM-as-Judge
For open-ended output, use another LLM to score:
- Correctness: Does it answer the question?
- Completeness: Are key points missing?
- Format: Does it follow requirements?
- Safety: Does it contain inappropriate content?
Regression Test Set
- Build a "golden dataset": input + expected output (or expected features)
- Run after each change; compare actual vs. expected
- For non-deterministic output, compare key fields or use fuzzy matching
CI Integration
GitHub Actions Example
- name: Run Skill tests
run: |
python -m pytest tests/skills/ -v
python scripts/validate_skills.py
Workflow Design
- PR trigger: Run tests on every PR; require pass before merge
- Scheduled: Periodically run regression with latest models to monitor model drift
- Gate: Block merge and notify owners when critical Skill pass rate falls below threshold
Test Environment
- Use test API keys and mock external services to avoid affecting production
- Fix model version or seed for reproducibility
Summary
Skill quality assurance combines manual testing with automated validation. Use case design, schema validation, LLM-as-Judge, and regression tests, then integrate into CI to systematically improve Skill reliability and maintainability.