skill-eval

Category: Ops & Delivery | Uploader: aws-samples | Downloads: 0 | Version: v1.0（Latest）

Evaluate AI Agent Skills across safety, quality, reliability, and cost efficiency. Audit for security issues (secrets, injection, unsafe installs), test functional correctness with-skill vs without-skill, measure trigger precision, classify cost-efficiency tradeoffs, track version lifecycle, and generate unified grades. Use when evaluating a skill before installing, auditing marketplace skills, proving your skill works with automated tests, setting up CI/CD quality gates, or comparing two skill versions. NOT for: evaluating full agent systems, testing non-skill plugins, runtime performance benchmarking, or monitoring production agent behavior.

Changelog: Source: GitHub https://github.com/aws-samples/sample-agent-skill-eval

Directory Structure

Current level: Root

📁 .github/
- 📁 workflows/
  - 📄 ci.yml 772 B
  - 📄 skill-eval.yml 5.0 KB
📁 docs/
- 📁 reviews/
  - 📄 PHASE2_REVIEW.md 10.4 KB
  - 📄 REVIEW.md 10.5 KB
- 📄 concepts.md 9.3 KB
- 📄 tutorial.md 9.2 KB
📁 evals/
- 📁 files/
  - 📁 good-skill/
    
    📁 scripts/
    
    📄 process.py 481 B
    
    📄 SKILL.md 404 B
  - 📁 insecure-installer/
    
    📁 scripts/
    
    📄 installer.py 776 B
    
    📄 SKILL.md 933 B
  - 📁 over-permissioned/
    
    📁 scripts/
    
    📄 organize.sh 392 B
    
    📄 SKILL.md 1.1 KB
- 📄 benchmark.json 31.0 KB
- 📄 eval_queries.json 1.1 KB
- 📄 evals.json 2.6 KB
📁 examples/
- 📁 data-analysis/
  - 📁 evals/
    
    📁 files/
    
    📄 sales.csv 769 B
    
    📄 benchmark.json 31.6 KB
    
    📄 eval_queries.json 998 B
    
    📄 evals.json 3.1 KB
  - 📁 scripts/
    
    📄 analyze_csv.py 3.8 KB
  - 📄 README.md 8.1 KB
  - 📄 SKILL.md 1.9 KB
- 📁 f-to-a-improvement/
  - 📁 after/
    
    📁 scripts/
    
    📄 organize.py 3.2 KB
    
    📄 SKILL.md 971 B
  - 📁 before/
    
    📁 scripts/
    
    📄 organize.py 1.2 KB
    
    📄 SKILL.md 776 B
  - 📁 v2/
    
    📁 scripts/
    
    📄 organize.py 1.7 KB
    
    📄 SKILL.md 462 B
  - 📄 README.md 2.5 KB
- 📁 golden-dataset/
  - 📁 bad-skills/
    
    📁 insecure-installer/
    
    📁 scripts/
    
    📄 installer.py 776 B
    
    📄 SKILL.md 933 B
    
    📁 over-permissioned/
    
    📁 scripts/
    
    📄 organize.sh 392 B
    
    📄 SKILL.md 1.1 KB
    
    📁 poor-structure/
    
    📁 scripts/
    
    📄 script.py 113 B
    
    📄 SKILL.md 177 B
    
    📁 sloppy-weather/
    
    📁 evals/
    
    📄 eval_queries.json 477 B
    
    📄 evals.json 869 B
    
    📁 scripts/
    
    📄 weather.py 409 B
    
    📄 SKILL.md 402 B
  - 📄 README.md 1.7 KB
- 📁 golden-evals/
  - 📄 README.md 4.7 KB
- 📁 lifecycle-demo/
  - 📁 results/
    
    📄 audit-v1.html 5.9 KB
    
    📄 audit-v2.html 4.5 KB
    
    📄 audit-v3.html 5.3 KB
    
    📄 lifecycle-func-v1.json 22.2 KB
    
    📄 lifecycle-func-v2.json 20.9 KB
    
    📄 lifecycle-func-v3.json 21.9 KB
    
    📄 lifecycle-trigger-v1.json 2.7 KB
    
    📄 lifecycle-trigger-v2.json 2.7 KB
    
    📄 lifecycle-trigger-v3.json 2.7 KB
  - 📁 v1/
    
    📁 evals/
    
    📄 eval_queries.json 655 B
    
    📄 evals.json 2.2 KB
    
    📁 scripts/
    
    📄 check_pr.py 1.4 KB
    
    📄 SKILL.md 433 B
  - 📁 v2/
    
    📁 evals/
    
    📄 eval_queries.json 655 B
    
    📄 evals.json 2.2 KB
    
    📁 references/
    
    📄 naming-rules.md 3.6 KB
    
    📁 scripts/
    
    📄 check_pr.py 3.3 KB
    
    📄 SKILL.md 2.5 KB
  - 📁 v3/
    
    📁 evals/
    
    📄 eval_queries.json 655 B
    
    📄 evals.json 2.2 KB
    
    📁 references/
    
    📄 naming-rules.md 3.6 KB
    
    📁 scripts/
    
    📄 check_pr.py 3.3 KB
    
    📄 update_rules.py 1.6 KB
    
    📄 SKILL.md 2.7 KB
  - 📄 ground-truth.md 4.8 KB
  - 📄 README.md 1.9 KB
- 📁 real-skill-audits/
  - 📄 README.md 6.1 KB
- 📁 self-eval/
  - 📄 ground-truth.md 5.1 KB
  - 📄 meta-func-data-analysis.json 31.4 KB
  - 📄 meta-func-good-skill.json 8.9 KB
  - 📄 meta-func-sloppy-weather.json 9.6 KB
  - 📄 meta-trigger-data-analysis.json 3.3 KB
  - 📄 meta-trigger-good-skill.json 1.8 KB
  - 📄 meta-trigger-sloppy-weather.json 2.1 KB
  - 📄 README.md 2.2 KB
  - 📄 RESULTS.md 8.8 KB
📁 references/
- 📄 cli-reference.md 4.9 KB
- 📄 security-checklist.md 6.0 KB
- 📄 security-checks.md 2.1 KB
📁 skill_eval/
- 📁 audit/
  - 📄 __init__.py 0 B
  - 📄 permission_analyzer.py 8.8 KB
  - 📄 security_scan.py 28.9 KB
  - 📄 structure_check.py 20.3 KB
- 📄 __init__.py 0 B
- 📄 _claude.py 2.0 KB
- 📄 agent_runner.py 11.9 KB
- 📄 cli.py 19.4 KB
- 📄 compare.py 13.6 KB
- 📄 config.py 6.4 KB
- 📄 cost.py 4.4 KB
- 📄 eval_schemas.py 5.2 KB
- 📄 explanations.py 3.1 KB
- 📄 functional.py 20.8 KB
- 📄 grading.py 11.2 KB
- 📄 html_report.py 13.6 KB
- 📄 init.py 3.9 KB
- 📄 lifecycle.py 10.6 KB
- 📄 regression.py 12.6 KB
- 📄 report.py 3.4 KB
- 📄 schemas.py 3.4 KB
- 📄 trigger.py 14.8 KB
- 📄 unified_report.py 12.6 KB
📁 tests/
- 📁 fixtures/
  - 📁 bad-skill/
    
    📁 scripts/
    
    📄 evil.py 1.5 KB
    
    📄 SKILL.md 909 B
  - 📁 clawhub-skills/
    
    📁 nano-pdf/
    
    📄 SKILL.md 765 B
    
    📁 slack/
    
    📄 SKILL.md 2.3 KB
    
    📁 weather/
    
    📄 SKILL.md 1.1 KB
  - 📁 empty-dir/
    
    📄 .gitkeep 0 B
  - 📁 eval-skill/
    
    📁 evals/
    
    📁 files/
    
    📄 sample.csv 72 B
    
    📄 eval_queries.json 368 B
    
    📄 evals.json 1.1 KB
    
    📄 SKILL.md 548 B
  - 📁 good-skill/
    
    📁 evals/
    
    📄 eval_queries.json 381 B
    
    📄 evals.json 841 B
    
    📁 scripts/
    
    📄 process.py 481 B
    
    📄 SKILL.md 404 B
  - 📁 mcp-skill/
    
    📁 scripts/
    
    📄 setup.py 367 B
    
    📄 SKILL.md 707 B
  - 📁 no-frontmatter/
    
    📄 SKILL.md 74 B
  - 📁 scoped-skill/
    
    📁 references/
    
    📄 security-docs.md 243 B
    
    📁 scripts/
    
    📄 process.py 110 B
    
    📁 tests/
    
    📄 test_bad_patterns.py 518 B
    
    📄 SKILL.md 198 B
- 📄 __init__.py 0 B
- 📄 test_agent_runner.py 18.2 KB
- 📄 test_clawhub_fixtures.py 11.0 KB
- 📄 test_cli.py 10.5 KB
- 📄 test_compare.py 12.5 KB
- 📄 test_config.py 6.9 KB
- 📄 test_cost.py 5.5 KB
- 📄 test_eval_schemas.py 11.6 KB
- 📄 test_functional.py 26.7 KB
- 📄 test_golden_bad_skills.py 6.8 KB
- 📄 test_golden_dataset.py 4.7 KB
- 📄 test_grading.py 17.0 KB
- 📄 test_html_report.py 6.2 KB
- 📄 test_init.py 7.9 KB
- 📄 test_lifecycle.py 15.7 KB
- 📄 test_permission_analyzer.py 4.4 KB
- 📄 test_regression.py 13.8 KB
- 📄 test_security_scan.py 25.2 KB
- 📄 test_structure_check.py 10.2 KB
- 📄 test_trigger.py 24.0 KB
- 📄 test_unified_report.py 29.4 KB
📄 .gitignore 100 B
📄 AGENTS.md 9.6 KB
📄 CODE_OF_CONDUCT.md 309 B
📄 CONTRIBUTING.md 3.1 KB
📄 demo.sh 3.9 KB
📄 LICENSE 947 B
📄 pyproject.toml 765 B
📄 README.md 8.3 KB
📄 SKILL.md 3.1 KB

SKILL.md

---
name: skill-eval
description: "Evaluate AI Agent Skills across safety, quality, reliability, and cost efficiency. Audit for security issues (secrets, injection, unsafe installs), test functional correctness with-skill vs without-skill, measure trigger precision, classify cost-efficiency tradeoffs, track version lifecycle, and generate unified grades. Use when evaluating a skill before installing, auditing marketplace skills, proving your skill works with automated tests, setting up CI/CD quality gates, or comparing two skill versions. NOT for: evaluating full agent systems, testing non-skill plugins, runtime performance benchmarking, or monitoring production agent behavior."
---

# Skill Eval — Agent Skill Evaluation Framework

Evaluate Agent Skills across four dimensions: safety (audit), quality (functional), reliability (trigger), and cost efficiency (Pareto classification).

## Quick Start

```bash
skill-eval audit /path/to/skill          # Is it safe?
skill-eval report /path/to/skill         # Full grade (audit + functional + trigger)
skill-eval functional /path/to/skill     # Quality: with-skill vs without-skill
skill-eval trigger /path/to/skill        # Reliability: activation precision
```

## Decision Tree

- **"Is this skill safe?"** → `skill-eval audit <path>`
- **"Full evaluation with grade"** → `skill-eval report <path>`
- **"Full repo security review"** → `skill-eval audit <path> --include-all`
- **"Write eval cases"** → `skill-eval init <path>`, then edit `evals/`
- **"Compare two versions"** → `skill-eval compare <old> <new>`
- **"Check for regressions"** → `skill-eval snapshot <path>`, then `skill-eval regression <path>`
- **"Track changes"** → `skill-eval lifecycle <path> --save --label v1.0`

## Commands

| Command | Purpose |
|---------|---------|
| `audit` | Security & structure scan (secrets, permissions, spec compliance) |
| `functional` | Quality eval — runs prompts with and without skill, grades output |
| `trigger` | Reliability eval — tests activation precision for relevant/irrelevant queries |
| `report` | Unified grade combining audit (40%) + functional (40%) + trigger (20%) |
| `compare` | Side-by-side comparison of two skills on the same eval cases |
| `snapshot` | Save current audit as regression baseline |
| `regression` | Check for score regressions against baseline |
| `lifecycle` | Version tracking and change detection |
| `init` | Generate eval scaffold from SKILL.md frontmatter |

For detailed flags and examples, see `references/cli-reference.md`.

## Eval File Format

Functional evals (`evals/evals.json`):
```json
[{"id": "case-1", "prompt": "...", "assertions": ["contains 'expected'"], "files": ["files/input.csv"]}]
```

Trigger queries (`evals/eval_queries.json`):
```json
[{"query": "relevant question", "should_trigger": true}, {"query": "unrelated question", "should_trigger": false}]
```

## Scoring

Grades: A (90+), B (80-89), C (70-79), D (60-69), F (<60). Findings deduct: CRITICAL −25, WARNING −10, INFO −2.

For the full security check reference and OWASP mapping, see `references/security-checks.md`.

Comments 0

Latest | Hot

Please login before commenting.

No comments yet. Be the first one!

skill-eval

Directory Structure

SKILL.md

Report

Notice