Skip to content

jedobe/skill-evaluator

Repository files navigation

skill-evaluator

License: MIT Version CI Platform

Everyone's writing skills. Nobody knows if they're any good.

This skill scores yours — against a rubric reverse-engineered from the 9 most-starred Claude Code skill repos on GitHub.


How it works

Most skill feedback is vibes. This isn't.

The rubric was built by analyzing what actually separates top-tier skills (obra/superpowers 229k★, affaan-m/ECC 216k★, anthropics/skills 151k★) from the rest. Seven dimensions. 100 points. A tier you can point to.

## Skill Evaluation: karpathy-guidelines

Scope: Single skill
Category: Guideline skill

| Dimension            | Score | Max |
|----------------------|-------|-----|
| Trigger Clarity      |  16   |  20 |
| Instruction Specificity |  20 | 24 |
| Reference Density    |   5   |   8 |
| Verifiability        |   3   |   5 |
| Tradeoff Transparency|  16   |  18 |
| Portability          |  14   |  18 |
| Maintenance Maturity |   4   |   7 |
| **Total**            | **78**|**100**|

### Tier: Gold
Strong behavioral guidelines, but thin on bundled reference data and evals.

### Top 3 Improvements
1. **Reference Density**: Add a `references/` folder with lookup tables or code examples...
2. **Verifiability**: Define an output spec or add test prompts to an `evals/` folder...
3. **Trigger Clarity**: Add "do NOT use when..." conditions to the description...

Install

/plugin install skill-evaluator

Or clone and point at it manually:

git clone https://github.com/jedobe/skill-evaluator

Usage

Ask Claude to evaluate any skill — by file path, GitHub URL, or pasted content:

evaluate ~/.claude/skills/my-skill/SKILL.md
evaluate this skill: [paste SKILL.md here]

The 7 dimensions

Weights adapt to the skill category. Tool skills produce structured output or automate a task; Guideline skills shape how the model behaves. Both total 100.

# Dimension Tool Guideline The question it answers
1 Trigger Clarity 20 20 Does the description tell the model when to invoke — not just what it does?
2 Instruction Specificity 15 24 Is there a concrete procedure, or just a description of desired output?
3 Reference Density 15 8 Is supporting data bundled in — or does the model rely on training alone?
4 Verifiability 15 5 Is there a defined output spec, eval suite, or success criteria?
5 Tradeoff Transparency 10 18 Does the skill honestly state its limits and when NOT to use it?
6 Portability 15 18 Zero-dep? Multi-harness? No hardcoded paths?
7 Maintenance Maturity 10 7 License, version, CHANGELOG — does it look maintained?

Tiers: Elite (85+) · Gold (70–84) · Silver (50–69) · Bronze (0–49)


Calibration

Scores are grounded in real repos. A few reference points:

Skill Stars Score Tier
anthropics/skills — skill-creator ~88 Elite
JuliusBrussee/caveman 73k★ ~85 Elite
multica-ai/andrej-karpathy-skills 176k★ ~78 Gold
OthmanAdi/planning-with-files 23k★ ~76 Gold

If your skill scores 85+, it's in genuinely rare company.

Note: skill-evaluator itself is not in this table. Meta-skills (tools that evaluate other tools) don't fit the rubric — the dimensions were designed for task-performing skills. Scoring a rubric tool against its own rubric is circular.


Why this exists

The skill ecosystem is growing fast. There's no shared standard for what "good" looks like — so most feedback is either "looks fine" or a wall of subjective opinions.

This rubric is an attempt to make that judgment concrete, reproducible, and grounded in what the community has already validated with stars.


License

MIT

About

Score any Claude Code skill against a research-backed rubric derived from the top 9 most-starred skill repos on GitHub

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages