shelvick
from GitHub
调研与分析
Coordinates LiveBench benchmark runs. Reads a pre-built manifest, dispatches one solver per question by sequential index using batch_async, then scores all answers via score-run.sh and produces the benchmark report. Use when running LiveBench evaluations. Do NOT use for MMLU-Pro or general tasks.
Add a new SWE benchmark task from a real GitHub bug-fix. Use when the user provides a GitHub issue or PR URL and wants to add it to the bench-swe pipeline.
Chekhovin
from GitHub
数据与AI
search and analyze llm benchmark results within a fixed benchmark universe, then produce evidence-based model strength and weakness reports or domain-leader summaries. use when comparing a model across benchmarks, ranking the best models by domain, explaining what a benchmark measures, checking predecessor-vs-current progress, or writing benchmark reports that must prioritize exact model version, evaluation date, benchmark variant, score semantics, sub-scores, and benchmark defect warnings. works with browser, web, and multimodal extraction for text, table, canvas, or image-only leaderboards.
sublimotion
from GitHub
运维与交付
Plan, execute, and analyze LLM serving benchmarks across vLLM and SGLang configurations. Use when user says "run benchmarks", "benchmark config", "QPS sweep", "compare serving configs", "custbench", "benchmark analysis", or "generate benchmark report". Do NOT use for writing Terraform (use terraform-automation), deployment validation (use deployment-orchestrator), or GPU hardware diagnostics (use gpu-infra-troubleshooting).