Create new eval suites for the deepagentsjs monorepo. Handles dataset design, test case scaffolding, scoring logic, vitest configuration, and LangSmith integration. Use when the user asks to: (1) create an eval, (2) write an evaluation, (3) add a benchmark, (4) build an eval suite, (5) evaluate agent behaviour, (6) add test cases for a capability, or (7) implement an existing benchmark (e.g. oolong, AgentBench, SWE-bench). Trigger on phrases like 'create eval', 'new eval', 'add eval', 'benchmark', 'evaluate', 'eval suite', 'write evals for'.
dangerous eval pattern
Create trigger evaluation setup for a toolkit skill. Use when the user wants to test whether a skill's description triggers correctly, set up eval workspaces, or generate trigger test queries for a skill. Use when user says 'create eval', 'test triggers', 'eval skill', or wants to measure skill triggering accuracy.
Diagnose and test Claude Code skills against Anthropic's 7 principles. Scans SKILL.md files, checks 8 rules (gotchas, description, allowed-tools, file-size, structure, frontmatter, conflicts, usage-hooks), classifies skill types, generates prescriptions, and runs eval tests. Use when checking skill quality, auditing skills, testing skills, or before publishing skills. Triggers on "스킬 점검", "스킬 진단", "스킬 테스트", "check skills", "audit skills", "test skills", "skill health", "pulser", "pulser eval".
Set up evals for an agent codebase or check eval status after changes. Determines readiness, identifies what can be tested, and prepares the environment. Use when starting evals for the first time, returning after a code change, or figuring out what to do next. Also triggered by "set up evals", "is my agent ready?", "eval status", "what should I do next?", "init eval", "evaluate my agent", "test my agent", "help me eval this", "get started with evals", "where do I start", "how do I test this agent", "check my setup". This is the default entry point, use it whenever a user wants to evaluate an agent and you're unsure which skill to start with.