Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation. --- # Auto Arena Skill End-to-end automated model comparison using the OpenJudge `AutoArenaPipeline`: 1. **Generate queries** — LLM creates diverse test queries from task description 2. **Collect responses** — query all target endpoints concurrently 3. **Generate rubrics** — LLM produces evaluation criteria from task + sample queries 4. **Pairwise evaluation** — judge model compares every model pair (with position-bias swap) 5. **Analyze & rank** — compute win rates, win matrix, and rankings 6. **Report & charts** — Markdown report + win-rate bar chart + optional matrix heatmap ## Prerequisites ```bash # Install OpenJudge pip install py-openjudge # Extra dependency for auto_arena (chart generation) pip install matplotlib ``` ## Gather from user before running | Info | Required? | Notes | |------|-----------|-------| | Task description | Yes | What the models/agents should do (set in config YAML) | | Target endpoints | Yes | At least 2 OpenAI-compatible endpoints to compare | | Judge endpoint | Yes | Strong model for pairwise evaluation (e.g. `gpt-4`, `qwen-max`) | | API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc. | | Number of queries | No | Default: `20` | | Seed queries | No | Example queries to guide generation style | | System prompts | No | Per-endpoint system prompts | | Output directory | No | Default: `./evaluation_results` | | Report language | No | `"zh"` (default) or `"en"` | ## Quick start ### CLI `
- 📁 api/
- 📁 commands/
- 📄 __init__.py
- 📄 builder_fee.py
- 📄 config.py
Autonomous Hyperliquid trading — 14 strategies (MM, momentum, arbitrage, LLM) with APEX multi-slot orchestrator, REFLECT performance review, DSL trailing stops, and builder fee revenue collection.
需求池管理。用户随时抛出想法/痛点,AI 负责追问、整理、合并、归档到需求池文件。用户准备开新版本时,协助从池中筛选。痛点驱动,不做提前排期。
Create grouped detection narratives that tie individual rules into coherent threat stories. Covers Splunk Analytic Stories, Elastic detection rule groups, and Sentinel analytics grouping.
- 📁 rules/
- 📄 AGENTS.md
- 📄 metadata.json
- 📄 README.md
MUST USE when reviewing ClickHouse schemas, queries, or configurations. Contains 28 rules that MUST be checked before providing recommendations. Always read relevant rule files and cite specific rules in responses.
Analyzes and fixes A2A Transport Compatibility Kit (TCK) issues by understanding the specification, reproducing the failure, implementing the fix, and validating it works.
- 📁 agents/
- 📁 evals/
- 📁 references/
- 📄 SKILL.md
Use when creating cloned voices with Alibaba Cloud Model Studio CosyVoice customization models, especially cosyvoice-v3.5-plus or cosyvoice-v3.5-flash, from reference audio and then reusing the returned voice_id in later TTS calls.
Audit and score blog posts on a 5-category 100-point scoring system covering content quality, SEO optimization, E-E-A-T signals, technical elements, and AI citation readiness. Includes AI content detection (burstiness, phrase flagging, vocabulary diversity). Supports export formats (markdown, JSON, table) and batch analysis with sorting. Generates prioritized recommendations (Critical/High/Medium/Low) with specific fixes. Works with any format (MDX, markdown, HTML, URL). Use when user says "analyze blog", "audit blog", "blog score", "check blog quality", "blog review", "rate this blog", "blog health check". --- # Blog Analyzer -- Quality Audit & Scoring Scores blog posts on a 0-100 scale across 5 categories and provides prioritized improvement recommendations. Includes AI content detection analysis. Works with local files or published URLs.
Follow this loop every time. Jumping straight to "fix" without reproducing
Explains ev-node architecture, components, and internal workings. Use when the user asks how ev-node works, wants to understand the block package, DA layer, sequencing, namespaces, or needs architecture explanations. Covers block production, syncing, DA submission, forced inclusion, single vs based sequencer, and censorship resistance.
- 📁 reference/
- 📁 scripts/
- 📄 SKILL.md
High-performance data analysis using Polars - load, transform, aggregate, visualize and export tabular data. Use for CSV/JSON/Parquet processing, statistical analysis, time series, and creating charts.
- 📁 .claude/
- 📁 .github/
- 📁 docs/
- 📄 .gitignore
- 📄 AGENTS.md
- 📄 Cargo.lock
> description: "wx-cli — 从本地微信数据库查询聊天记录、联系人、会话、收藏等。用户提到微信聊天记录、联系人、消息历史、群成员、收藏内容时,使用此 skill 安装并调用 wx-cli。"