nasde-benchmark-creator
Create coding agent benchmarks for evaluation with nasde. Use this skill when the user wants to: - Create a new benchmark project (set of tasks for evaluating coding agents) - Add tasks to an existing benchmark - Create or modify agent variants (configurations that control agent behavior) - Set up assessment dimensions and scoring criteria - Verify that a new benchmark's Docker environment and tests work Even if the user doesn't say "benchmark" — if they're talking about creating coding challenges for AI agents or setting up evaluation criteria, this skill applies. --- # NASDE Benchmark Creator Create and configure coding agent benchmarks for evaluation with `nasde`. A benchmark is a set of coding tasks that AI agents solve inside isolated Docker containers, scored both by functional tests (pass/fail) and by an LLM-as-a-Judge architecture assessment. ## Step 1: Understand what to evaluate Before creating files, clarify with the user: - What programming language/framework? (determines Dockerfile base image) - What kind of coding challenges? (feature implementation, refactoring, bug fixing, etc.) - What source repository should the agent work on? (git URL cloned in Dockerfile) - What quality dimensions should be assessed? (these are benchmark-specific, not hardcoded) ## Step 2: Scaffold or create the project For a new benchmark, run: ```bash nasde init my-benchmark --name my-benchmark ``` This creates the base structure. Then customize the generated files. For adding tasks to an existing benchmark, skip to Step 4. ## Step 3: Define assessment dimensions Edit `assessment_dimensions.json`. Each benchmark has its OWN dimensions — design them for what matters in this benchmark's domain.
Changelog: Source: GitHub https://github.com/NoesisVision/nasde-toolkit
Loading comments...