Create new eval suites for the deepagentsjs monorepo. Handles dataset design, test case scaffolding, scoring logic, vitest configuration, and LangSmith integration. Use when the user asks to: (1) create an eval, (2) write an evaluation, (3) add a benchmark, (4) build an eval suite, (5) evaluate agent behaviour, (6) add test cases for a capability, or (7) implement an existing benchmark (e.g. oolong, AgentBench, SWE-bench). Trigger on phrases like 'create eval', 'new eval', 'add eval', 'benchmark', 'evaluate', 'eval suite', 'write evals for'.
dangerous eval pattern
Expertise in evaluating AWS accounts for compliance — what checks are meaningful, which SCF controls they map to, and how to interpret aws CLI output.
- 📁 examples/
- 📄 README.md
- 📄 SKILL.md
Qualify trade show leads from badge scans, booth notes, or voice memos into scored CRM-ready cards. \"Score my booth leads\" / \"给展会线索打分\" / \"Leads qualifizieren\" / \"リードを評価する\" / \"calificar leads de feria\". 展会线索/资质审核/线索分级 Leadqualifizierung Messeleads 展示会リード評価 calificación de leads
This skill should be used when the user asks "will AI affect my job", "is my role at risk from AI", "AI impact on my career", "will my job be automated", "how will AI change my role", "is my role safe from automation", "should I be worried about AI", or "what jobs are AI replacing". Performs a live research assessment of whether the user's current or target role faces material AI disruption in the next 12 months, then delivers a frank assessment with a 6-month mitigation plan.
Create trigger evaluation setup for a toolkit skill. Use when the user wants to test whether a skill's description triggers correctly, set up eval workspaces, or generate trigger test queries for a skill. Use when user says 'create eval', 'test triggers', 'eval skill', or wants to measure skill triggering accuracy.
- 📁 scripts/
- 📁 subskills/
- 📁 templates/
- 📄 .gitignore
- 📄 AGENTS.md
- 📄 CLAUDE.md
Personal AI tutor — generates learning paths, sends daily tasks via Telegram, evaluates progress, and adapts to the learner.
Classify and score business risks so agents produce consistent, comparable assessments.
Diagnose and test Claude Code skills against Anthropic's 7 principles. Scans SKILL.md files, checks 8 rules (gotchas, description, allowed-tools, file-size, structure, frontmatter, conflicts, usage-hooks), classifies skill types, generates prescriptions, and runs eval tests. Use when checking skill quality, auditing skills, testing skills, or before publishing skills. Triggers on "스킬 점검", "스킬 진단", "스킬 테스트", "check skills", "audit skills", "test skills", "skill health", "pulser", "pulser eval".
Set up evals for an agent codebase or check eval status after changes. Determines readiness, identifies what can be tested, and prepares the environment. Use when starting evals for the first time, returning after a code change, or figuring out what to do next. Also triggered by "set up evals", "is my agent ready?", "eval status", "what should I do next?", "init eval", "evaluate my agent", "test my agent", "help me eval this", "get started with evals", "where do I start", "how do I test this agent", "check my setup". This is the default entry point, use it whenever a user wants to evaluate an agent and you're unsure which skill to start with.
Deep product audit. Brutal honest assessment of what you're building, who for, the biggest strategic gap, and the question you're avoiding. Produces AUDIT.md.
- 📁 assets/
- 📁 references/
- 📄 SKILL.md
Conducts a structured gap assessment of an organization's readiness against ISO 42001:2023 (AI Management System standard). Runs an interview-style evaluation across all mandatory clauses (4-10) and applicable Annex A controls. Produces a scored gap assessment report saved to the vault, a draft Statement of Applicability, and a prioritized list of gaps to address before certification. Requires a vault created by /setup-iso42001-vault.