Create new eval suites for the deepagentsjs monorepo. Handles dataset design, test case scaffolding, scoring logic, vitest configuration, and LangSmith integration. Use when the user asks to: (1) create an eval, (2) write an evaluation, (3) add a benchmark, (4) build an eval suite, (5) evaluate agent behaviour, (6) add test cases for a capability, or (7) implement an existing benchmark (e.g. oolong, AgentBench, SWE-bench). Trigger on phrases like 'create eval', 'new eval', 'add eval', 'benchmark', 'evaluate', 'eval suite', 'write evals for'.
Create trigger evaluation setup for a toolkit skill. Use when the user wants to test whether a skill's description triggers correctly, set up eval workspaces, or generate trigger test queries for a skill. Use when user says 'create eval', 'test triggers', 'eval skill', or wants to measure skill triggering accuracy.
Diagnose and test Claude Code skills against Anthropic's 7 principles. Scans SKILL.md files, checks 8 rules (gotchas, description, allowed-tools, file-size, structure, frontmatter, conflicts, usage-hooks), classifies skill types, generates prescriptions, and runs eval tests. Use when checking skill quality, auditing skills, testing skills, or before publishing skills. Triggers on "스킬 점검", "스킬 진단", "스킬 테스트", "check skills", "audit skills", "test skills", "skill health", "pulser", "pulser eval".
Conducts a structured gap assessment of an organization's readiness against ISO 42001:2023 (AI Management System standard). Runs an interview-style evaluation across all mandatory clauses (4-10) and applicable Annex A controls. Produces a scored gap assessment report saved to the vault, a draft Statement of Applicability, and a prioritized list of gaps to address before certification. Requires a vault created by /setup-iso42001-vault.
HIPAA compliance interview. Processes one NIST 800-53 control at a time — reads the official NIST assessment method and asks relevant questions. Covers vendors (SA-9), risk (RA-3), training (AT-2), and all other interview-only controls.