Create new eval suites for the deepagentsjs monorepo. Handles dataset design, test case scaffolding, scoring logic, vitest configuration, and LangSmith integration. Use when the user asks to: (1) create an eval, (2) write an evaluation, (3) add a benchmark, (4) build an eval suite, (5) evaluate agent behaviour, (6) add test cases for a capability, or (7) implement an existing benchmark (e.g. oolong, AgentBench, SWE-bench). Trigger on phrases like 'create eval', 'new eval', 'add eval', 'benchmark', 'evaluate', 'eval suite', 'write evals for'.
Add a new simulation benchmark to the VLA evaluation harness. Use this skill whenever the user wants to integrate, create, or add a new benchmark or simulation environment — e.g. 'add ManiSkill3', 'integrate OmniGibson', 'hook up a new sim'. Also use when they ask how benchmarks are structured or want to understand the benchmark interface.
Add a new SWE benchmark task from a real GitHub bug-fix. Use when the user provides a GitHub issue or PR URL and wants to add it to the bench-swe pipeline.
Write benchmark scripts for EmbodiChain modules following project conventions
Create coding agent benchmarks for evaluation with nasde. Use this skill when the user wants to: - Create a new benchmark project (set of tasks for evaluating coding agents) - Add tasks to an existing benchmark - Create or modify agent variants (configurations that control agent behavior) - Set up assessment dimensions and scoring criteria - Verify that a new benchmark's Docker environment and tests work Even if the user doesn't say "benchmark" — if they're talking about creating coding challenges for AI agents or setting up evaluation criteria, this skill applies. --- # NASDE Benchmark Creator Create and configure coding agent benchmarks for evaluation with `nasde`. A benchmark is a set of coding tasks that AI agents solve inside isolated Docker containers, scored both by functional tests (pass/fail) and by an LLM-as-a-Judge architecture assessment. ## Step 1: Understand what to evaluate Before creating files, clarify with the user: - What programming language/framework? (determines Dockerfile base image) - What kind of coding challenges? (feature implementation, refactoring, bug fixing, etc.) - What source repository should the agent work on? (git URL cloned in Dockerfile) - What quality dimensions should be assessed? (these are benchmark-specific, not hardcoded) ## Step 2: Scaffold or create the project For a new benchmark, run: ```bash nasde init my-benchmark --name my-benchmark ``` This creates the base structure. Then customize the generated files. For adding tasks to an existing benchmark, skip to Step 4. ## Step 3: Define assessment dimensions Edit `assessment_dimensions.json`. Each benchmark has its OWN dimensions — design them for what matters in this benchmark's domain.
- 📁 examples/
- 📁 scripts/
- 📁 server/
- 📄 .gitignore
- 📄 group.jpg
- 📄 install.sh
学术论文搜索与分析服务 (Academic paper search & analysis)。当用户涉及以下学术场景时,必须使用本 skill 而非 web-search:搜索论文、查找 ArXiv/PubMed/PapersWithCode 论文、查询 SOTA 榜单与 benchmark 结果、引用分析、生成论文解读博客、查找论文相关 GitHub 仓库、获取热门论文推荐。Keywords: arxiv, paper, papers, academic, scholar, research, 论文, 学术, 搜索论文, 找论文, SOTA, benchmark, MMLU, citation, 引用, 博客, blog, PapersWithCode, HuggingFace.
Before installing or using a skill, check its independent benchmark report on SkillTester.ai. Trigger this skill when the user is about to install a third-party skill, or when the user explicitly says `Check this skill <skill_url>`. Resolve the provided URL to SKILL.md, extract name and description, query the server by name, and return the benchmark result when the description is either an exact match or a high-overlap near match that likely represents a newer skill revision.