Create new eval suites for the deepagentsjs monorepo. Handles dataset design, test case scaffolding, scoring logic, vitest configuration, and LangSmith integration. Use when the user asks to: (1) create an eval, (2) write an evaluation, (3) add a benchmark, (4) build an eval suite, (5) evaluate agent behaviour, (6) add test cases for a capability, or (7) implement an existing benchmark (e.g. oolong, AgentBench, SWE-bench). Trigger on phrases like 'create eval', 'new eval', 'add eval', 'benchmark', 'evaluate', 'eval suite', 'write evals for'.
Create coding agent benchmarks for evaluation with nasde. Use this skill when the user wants to: - Create a new benchmark project (set of tasks for evaluating coding agents) - Add tasks to an existing benchmark - Create or modify agent variants (configurations that control agent behavior) - Set up assessment dimensions and scoring criteria - Verify that a new benchmark's Docker environment and tests work Even if the user doesn't say "benchmark" — if they're talking about creating coding challenges for AI agents or setting up evaluation criteria, this skill applies. --- # NASDE Benchmark Creator Create and configure coding agent benchmarks for evaluation with `nasde`. A benchmark is a set of coding tasks that AI agents solve inside isolated Docker containers, scored both by functional tests (pass/fail) and by an LLM-as-a-Judge architecture assessment. ## Step 1: Understand what to evaluate Before creating files, clarify with the user: - What programming language/framework? (determines Dockerfile base image) - What kind of coding challenges? (feature implementation, refactoring, bug fixing, etc.) - What source repository should the agent work on? (git URL cloned in Dockerfile) - What quality dimensions should be assessed? (these are benchmark-specific, not hardcoded) ## Step 2: Scaffold or create the project For a new benchmark, run: ```bash nasde init my-benchmark --name my-benchmark ``` This creates the base structure. Then customize the generated files. For adding tasks to an existing benchmark, skip to Step 4. ## Step 3: Define assessment dimensions Edit `assessment_dimensions.json`. Each benchmark has its OWN dimensions — design them for what matters in this benchmark's domain.
Add a new simulation benchmark to the VLA evaluation harness. Use this skill whenever the user wants to integrate, create, or add a new benchmark or simulation environment — e.g. 'add ManiSkill3', 'integrate OmniGibson', 'hook up a new sim'. Also use when they ask how benchmarks are structured or want to understand the benchmark interface.
Add a new SWE benchmark task from a real GitHub bug-fix. Use when the user provides a GitHub issue or PR URL and wants to add it to the bench-swe pipeline.
Write benchmark scripts for EmbodiChain modules following project conventions
- 📁 examples/
- 📁 scripts/
- 📁 server/
- 📄 .gitignore
- 📄 group.jpg
- 📄 install.sh
学术论文搜索与分析服务 (Academic paper search & analysis)。当用户涉及以下学术场景时,必须使用本 skill 而非 web-search:搜索论文、查找 ArXiv/PubMed/PapersWithCode 论文、查询 SOTA 榜单与 benchmark 结果、引用分析、生成论文解读博客、查找论文相关 GitHub 仓库、获取热门论文推荐。Keywords: arxiv, paper, papers, academic, scholar, research, 论文, 学术, 搜索论文, 找论文, SOTA, benchmark, MMLU, citation, 引用, 博客, blog, PapersWithCode, HuggingFace.
Before installing or using a skill, check its independent benchmark report on SkillTester.ai. Trigger this skill when the user is about to install a third-party skill, or when the user explicitly says `Check this skill <skill_url>`. Resolve the provided URL to SKILL.md, extract name and description, query the server by name, and return the benchmark result when the description is either an exact match or a high-overlap near match that likely represents a newer skill revision.