langchain-ai
from GitHub
调研与分析
Create new eval suites for the deepagentsjs monorepo. Handles dataset design, test case scaffolding, scoring logic, vitest configuration, and LangSmith integration. Use when the user asks to: (1) create an eval, (2) write an evaluation, (3) add a benchmark, (4) build an eval suite, (5) evaluate agent behaviour, (6) add test cases for a capability, or (7) implement an existing benchmark (e.g. oolong, AgentBench, SWE-bench). Trigger on phrases like 'create eval', 'new eval', 'add eval', 'benchmark', 'evaluate', 'eval suite', 'write evals for'.
andreahaku
from GitHub
开发与编程
Autonomous self-improving loop for Claude Code skills. Reads a target skill's SKILL.md, runs it multiple times against binary eval assertions, scores the output, and iteratively mutates the skill instructions to maximize the pass rate. Use when user says "improve skill", "optimize skill", "auto-improve", "run self-improvement loop", "make skill better", "eval my skill", "test and improve skill", "autoresearch skill", "skill-improver", "run evals on skill", or wants to autonomously improve a skill overnight.