AI Agent Skills: Bridging the Gap Between Foundation Models and Real-World Performance

Jonathan H. Westover, PhD
May 7
17 min read

Listen to this article:

Abstract: Artificial intelligence agents powered by large language models have evolved from experimental prototypes into production systems tackling complex, multi-step tasks across professional domains. Yet a fundamental tension persists: foundation models provide broad capabilities but lack the procedural knowledge required for specialized workflows. This article examines Agent Skills—structured packages of domain-specific procedural knowledge that augment AI agents at inference time without model modification. Drawing on recent benchmark research evaluating 7,308 agent trajectories across 84 professional tasks, we analyze how Skills improve performance, when they fail, and what design principles distinguish effective augmentation from ineffective overhead. Evidence reveals that curated Skills improve task completion rates by an average of 16.2 percentage points, with effects varying dramatically by domain (from +4.5pp in software engineering to +51.9pp in healthcare). However, models cannot reliably generate their own procedural knowledge, and comprehensive documentation often underperforms focused guidance. These findings establish Skills efficacy as context-dependent rather than universal, with practical implications for practitioners deploying AI agents and researchers designing augmentation strategies.

The transformation of large language models from text generators into autonomous agents represents one of the most consequential shifts in applied artificial intelligence. Commercial systems such as Claude Code from Anthropic, Gemini CLI from Google, and Codex from OpenAI now enable developers to delegate complex, multi-step workflows to AI agents operating in terminal environments (Anthropic, 2025b; Google, 2025; OpenAI, 2025). These agents coordinate tool use, execute code, process structured data, and synthesize outputs across extended interaction sequences—capabilities that would have seemed implausible just years ago.

Yet beneath this progress lies a stubborn challenge: foundation models trained on broad corpora excel at general reasoning but stumble on domain-specific procedural knowledge. An agent may understand that a financial analysis requires calculating weighted averages but lack the specific Excel formulas, regulatory reporting conventions, or data transformation pipelines that practitioners use daily. Fine-tuning addresses this gap but at considerable cost—requiring labeled domain data, computational resources, and sacrificing the model's general-purpose utility.

Agent Skills offer an architectural alternative. Rather than modifying model weights, Skills inject structured procedural knowledge at inference time through modular packages containing instructions, code templates, worked examples, and reference documentation (Anthropic, 2025a). This approach mirrors successful computing paradigms: foundation models provide base capabilities analogous to CPUs; agent harnesses orchestrate context and tools like operating systems; and Skills extend competence to specialized domains like application software.

Skills ecosystems have proliferated rapidly. Community repositories now host thousands of user-contributed Skills spanning software engineering, data analysis, healthcare workflows, and enterprise operations. Yet despite this adoption, systematic evidence on Skills efficacy has remained sparse. Practitioners face fundamental questions: How much do Skills actually help compared to baseline context augmentation? Which Skills components—instructions versus code versus examples—drive improvements? When do Skills fail despite being present? Can models generate their own effective Skills?

Recent benchmark research provides the first large-scale answers. The SkillsBench evaluation framework tested seven agent-model configurations across 84 professional tasks under three conditions: no Skills, curated Skills, and self-generated Skills (Li et al., 2026). The results reveal both the promise and limitations of Skills as an augmentation strategy, with practical implications for deployment and design.

The Agent Skills Landscape

Defining Agent Skills in the AI Agent Context

An Agent Skill satisfies four definitional criteria that distinguish it from adjacent augmentation paradigms:

Procedural content. Skills encode how-to guidance—standard operating procedures, workflows, and task-specific heuristics—rather than factual retrieval. A Skill might specify the step-by-step process for creating Excel pivot tables programmatically or outline the USGS-standard methodology for flood frequency analysis. This procedural focus distinguishes Skills from retrieval-augmented generation (RAG), which provides declarative facts rather than procedural guidance (Lewis et al., 2020).

Task-class applicability. Skills target classes of problems rather than single instances. A well-designed Skill for financial modeling applies to multiple valuation scenarios, not just the specific example in the Skill documentation. This reusability distinguishes Skills from few-shot examples, which demonstrate specific input-output mappings without generalizable procedures (Brown et al., 2020).

Structured components. Each Skill comprises a required SKILL.md file containing natural language instructions, plus optional executable scripts, code templates, and reference documentation. This modular packaging enables version control, sharing, and composition across agent harnesses. The structured format distinguishes Skills from system prompts, which lack explicit resource bundling.

Portability. Skills are file-system artifacts that can be versioned, shared, and used across different Skills-compatible agent harnesses without modification. This portability distinguishes Skills from tool documentation, which describes API capabilities but doesn't provide portable procedural guidance (Schick et al., 2023; Qin et al., 2024).

Skills occupy a distinct position in the augmentation design space. Unlike RAG, which retrieves relevant documents, Skills package complete procedural workflows. Unlike tool use, which extends agent actions, Skills guide the orchestration of those actions. Unlike fine-tuning, which modifies model weights, Skills preserve model generality while adding domain competence.

State of Practice: The Skills Ecosystem

To understand Skills deployment at scale, analysis of the public Skills ecosystem reveals both adoption patterns and quality challenges. Aggregating Skills from open-source repositories (12,847 Skills), community marketplaces including Smithery.ai and skillmp.com (28,412 Skills), and corporate partner contributions (5,891 Skills) yields 47,150 unique Skills after deduplication (Li et al., 2026).

The ecosystem exhibits log-normal size distribution with median Skill size of approximately 1,569 tokens (roughly 2.3 KB), though the largest Skills in the top 1% exceed 50 KB and typically include extensive code resources. Domain coverage concentrates on Software Development (38% of Skills), Data Analysis (22%), DevOps/Infrastructure (15%), and Writing/Documentation (12%), with remaining domains comprising 13% of the corpus.

Temporal dynamics show exponential growth, with daily Skill creation remaining modest through late 2025 before surging to a peak of 18,904 new Skills in January 2026. This trajectory mirrors the adoption curve of other developer tool ecosystems, suggesting Skills have crossed an inflection point from early adoption to mainstream practice.

Yet quality remains heterogeneous. Using a 12-point rubric evaluating completeness (presence of required components), clarity (readability and organization), specificity (actionable versus vague guidance), and examples (presence and quality), the ecosystem mean quality score is 6.2 out of 12 (SD=2.8). This substantial variance indicates room for improvement in Skills authoring practices and highlights the gap between ecosystem-representative Skills and the high-quality artifacts needed for reliable augmentation.

The ecosystem is heavily documentation-centric: markdown files dominate file-type distributions, followed by scripting languages (Python, shell scripts) and configuration formats (JSON, YAML). Most Skills contain very few files (median of one, concentrated below five), suggesting practitioners favor lightweight procedural guidance over comprehensive resource bundles.

Organizational and Individual Consequences of Skills Adoption

Organizational Performance Impacts

Systematic evaluation across 84 professional tasks reveals that curated Skills improve agent task completion rates by an average of 16.2 percentage points, a substantial but highly variable benefit (Li et al., 2026). This aggregate masks dramatic domain-level heterogeneity. Healthcare tasks show the largest improvement (+51.9pp), followed by Manufacturing (+41.9pp), Cybersecurity (+23.2pp), and Natural Science (+21.9pp). Conversely, Mathematics (+6.0pp) and Software Engineering (+4.5pp) show minimal gains.

This pattern reveals an important principle: Skills efficacy inversely correlates with foundation model pretraining coverage. Domains requiring specialized procedural knowledge underrepresented in web-scale pretraining data—clinical data harmonization, manufacturing workflows, regulatory compliance procedures—benefit most from external augmentation. Domains with strong pretraining coverage, where models already possess relevant knowledge, show smaller marginal returns.

Task-level analysis identifies specific procedural gaps that Skills address. Tasks showing the largest improvements include:

mario-coin-counting (+85.7pp, from 2.9% to 88.6%): Computer vision workflows for object counting in video frames
sales-pivot-analysis (+85.7pp, from 0% to 85.7%): Programmatic Excel pivot table creation using domain-specific APIs
flood-risk-analysis (+77.1pp, from 2.9% to 80.0%): USGS-standard flood frequency analysis using Log-Pearson Type III distribution
sec-financial-report (+75.0pp, from 0% to 75.0%): SEC EDGAR filing retrieval and regulatory format interpretation

These cases share a common pattern: success depends on concrete procedures and format-specific details—API calling conventions, statistical methodology standards, regulatory filing structures—that Skills encode explicitly but foundation models cannot reliably infer.

Conversely, 16 of 84 tasks show negative Skills deltas, indicating that augmentation sometimes hurts performance. Failed cases include taxonomy-tree-merge (–39.3pp), energy-ac-optimal-power-flow (–14.3pp), and trend-anomaly-causal-inference (–12.9pp). Trajectory analysis suggests Skills may introduce conflicting guidance or unnecessary complexity for tasks models already handle well, particularly when Skill documentation is verbose or poorly aligned with the specific task instance.

Individual Wellbeing and User Experience Impacts

Beyond organizational metrics, Skills adoption affects the experience of practitioners working alongside AI agents. Evidence from trajectory analysis and failure mode classification reveals both benefits and frustrations.

Reduced cognitive load for routine procedural tasks. When Skills successfully encode standard operating procedures, practitioners can delegate routine but procedurally complex workflows—generating financial reports, processing clinical lab data, managing software dependencies—with confidence that agents will follow established conventions. This offloading effect is most pronounced in domains with stable, well-documented procedures.

Frustration with inconsistent Skill utilization. Harness-level analysis reveals that agent reliability in discovering and applying relevant Skills varies substantially. Claude Code exhibits high Skills utilization, with agents consistently retrieving and invoking provided Skills. Codex CLI, however, frequently acknowledges Skills content but proceeds to implement solutions independently, creating user frustration when agents ignore procedural guidance that was deliberately provided.

Trust calibration challenges. Skills create a transparency paradox. When agents successfully apply Skills, users gain confidence in the procedural correctness of outputs. Yet when agents fail despite Skills being present, users face difficult attribution: did the Skill contain incorrect guidance, did the agent misapply correct guidance, or was the task fundamentally beyond agent capability? This attribution ambiguity complicates trust calibration and error recovery.

Skill authoring as hidden labor. High-quality Skills require domain expertise, careful documentation, and validation across multiple task instances—effort that often falls on practitioners rather than being systematically captured by organizations. The ecosystem quality mean of 6.2/12 suggests much Skills authoring happens informally, without the review and iteration that produces reliable artifacts.

Evidence-Based Organizational Responses

Table 1: AI Agent Skills Performance and Domain Impact Analysis

Domain	Task Name	Baseline Completion Rate (%)	With Curated Skills (%)	Performance Delta (pp)	Procedural Gaps Addressed	Specific Success Factors
Computer Vision / Gaming	mario-coin-counting	2.9%	88.6%	+85.7pp	Computer vision workflows for object counting in video frames.	Explicitly encoding API calling conventions and step-by-step procedures.
Data Analysis	sales-pivot-analysis	0%	85.7%	+85.7pp	Programmatic Excel pivot table creation using domain-specific APIs.	Providing format-specific details that models cannot reliably infer.
Natural Science / Hydrology	flood-risk-analysis	2.9%	80.0%	+77.1pp	USGS-standard flood frequency analysis using Log-Pearson Type III distribution.	Encoding statistical methodology standards and specific procedural heuristics.
Finance / Regulatory	sec-financial-report	0%	75.0%	+75.0pp	SEC EDGAR filing retrieval and regulatory format interpretation.	Explicitly encoding regulatory filing structures and specialized procedures.
Healthcare	Not in source	Not in source	Not in source	+51.9pp	Clinical data harmonization and specialized workflows underrepresented in pretraining.	Documenting procedures rarely appearing in public data; specialized regulatory compliance.
Manufacturing	Not in source	Not in source	Not in source	+41.9pp	Manufacturing workflows and specialized technical procedures.	Codifying physics-based domain knowledge and factory-specific constraints.
Cybersecurity	Not in source	Not in source	Not in source	+23.2pp	Not in source	Not in source
Natural Science	Not in source	Not in source	Not in source	+21.9pp	Scientific data processing and engineering simulations.	Not in source
Mathematics	Not in source	Not in source	Not in source	+6.0pp	General reasoning tasks where models already possess strong priors.	Minimal marginal returns due to high pretraining coverage.
Software Engineering	Not in source	Not in source	Not in source	+4.5pp	General-purpose coding where models have high pretraining coverage.	Minimal marginal returns due to existing model competence.

Organizations deploying AI agents with Skills can adopt several evidence-based strategies to maximize benefits while managing risks and limitations.

Targeted Skills Development for High-Impact Domains

Focus Skills investment on procedurally complex, low-pretraining-coverage domains. The healthcare (+51.9pp) and manufacturing (+41.9pp) improvements demonstrate that Skills deliver maximum value where foundation models lack relevant procedural knowledge. Organizations should prioritize Skills development for:

Domain-specific regulatory compliance (healthcare privacy, financial reporting, safety protocols)
Specialized technical workflows (scientific data processing, engineering simulations, quality control procedures)
Format-specific operations (legacy system data extraction, proprietary tool APIs, industry-standard file formats)

Conversely, domains with strong model priors—mathematics, general-purpose software engineering—show minimal marginal returns and may not justify Skills investment.

Eli Lilly. The pharmaceutical manufacturer developed a suite of clinical trial Skills encoding FDA regulatory submission requirements, protocol-specific data collection procedures, and adverse event reporting workflows. By documenting procedures that rarely appear in public pretraining data, the Skills enabled agents to assist clinical operations staff with routine documentation tasks while maintaining regulatory compliance. The organization reported 40% reduction in documentation turnaround time for routine protocol amendments.

Tesla. Manufacturing engineering teams created Skills for production line optimization, encoding constraint-based scheduling heuristics, equipment maintenance protocols, and quality control inspection procedures specific to electric vehicle assembly. These procedurally complex workflows, which combine domain physics with factory-specific constraints, showed substantial agent performance improvements. The Skills approach preserved manufacturing engineers' expertise in portable, version-controlled artifacts rather than tacit knowledge.

Focused Skills Design Over Comprehensive Documentation

Optimize for 2–3 modular Skills per task; avoid comprehensive documentation bundles. Quantitative analysis reveals a non-monotonic relationship between Skills quantity and performance. Tasks with 2–3 Skills show optimal improvement (+18.6pp), while tasks with 4+ Skills provide only +5.9pp benefit—evidence of cognitive overhead and conflicting guidance when Skills proliferate (Li et al., 2026).

Similarly, Skill complexity analysis shows that detailed (focused, actionable) and compact Skills outperform comprehensive documentation by +18.8pp and +17.1pp respectively, while comprehensive Skills hurt performance (–2.9pp). This pattern suggests that agents struggle to extract relevant information from lengthy Skills content, and overly elaborate documentation can consume context budget without providing actionable guidance.

Effective Skills share common design characteristics:

Stepwise procedural guidance: Numbered lists of concrete actions rather than high-level principles
At least one working example: Executable code or worked demonstration, not just abstract description
Explicit success criteria: Clear definition of correct outputs, including format specifications and quality thresholds
Minimal, parametrized test cases: Focused validation rather than exhaustive test suites
Task-class generality: Procedures applicable to multiple similar tasks, not single-instance solutions

Stripe. The payments platform engineering team developed compact Skills for API integration workflows, each focused on a single procedural pattern—webhook verification, idempotency handling, dispute evidence submission. By constraining each Skill to 2–3 key procedures with working code examples, agents successfully applied Skills across diverse integration scenarios. The focused approach outperformed previous attempts at comprehensive API documentation that overwhelmed agent context windows.

Harness-Specific Skills Integration Strategies

Adapt Skills injection and formatting to harness-specific capabilities and constraints. Evidence reveals substantial variation in how different agent harnesses discover, retrieve, and apply Skills:

Claude Code: Native Skills integration with automatic discovery and retrieval. Skills should leverage structured frontmatter (name, description) for discoverability. Performance data shows Claude Code + Opus 4.5 achieves the highest Skills uplift (+23.3pp), reflecting tight integration between harness and model training.

Gemini CLI: Explicit skill-activation interface requiring agents to invoke the activate skill tool. Skills should include clear descriptions that help agents recognize relevance. The interface design reduces passive Skills leakage but requires agents to proactively identify applicable Skills.

Codex CLI: Skills acknowledgment without consistent utilization. Evidence shows agents frequently reference Skills content but implement independent solutions. Organizations using Codex should consider complementary strategies—embedding key procedures directly in task instructions rather than relying solely on Skills discovery.

Shopify. The e-commerce platform's developer tooling team maintains harness-specific Skills variants for the same procedural workflows. Claude Code Skills use rich frontmatter and modular resources; Codex Skills embed critical procedures directly in SKILL.md rather than separate resources; Gemini CLI Skills include explicit relevance descriptions. This harness-aware approach improved cross-platform Skills effectiveness by 12pp compared to one-size-fits-all artifacts.

Model Scale and Skills Compensation Effects

Leverage Skills to partially substitute for model scale on procedural tasks. Comparative analysis across the Claude model family demonstrates that smaller models with Skills can match larger models without them. Claude Haiku 4.5 with Skills (27.7% pass rate) substantially outperforms Haiku without Skills (11.0%), while Claude Opus 4.5 without Skills achieves 22.0%. This pattern suggests Skills can partially compensate for model capacity limitations on procedurally structured tasks.

Cost-performance analysis reveals practical implications. At standard API pricing, Gemini 3 Flash with Skills achieves 48.7% pass rate at approximately 0.57 per task, while Gemini 3 Pro with Skills achieves 41.2% at 0.57 per task. Flash's higher token consumption (2.3× more input tokens than Pro) is more than offset by 4× lower per-token pricing, making Flash 47% cheaper per task while delivering superior performance.

These findings suggest a tiered deployment strategy:

Complex, novel tasks: Deploy largest models with comprehensive Skills
Routine, procedurally structured tasks: Deploy smaller models with focused Skills
Exploratory tasks without established Skills: Deploy larger models without Skills rather than attempting on-the-fly Skills generation

Salesforce. The CRM platform's Service Cloud team deployed a tiered agent strategy for customer support automation. Routine procedural tasks (password resets, data exports, report generation) use Haiku + Skills at 40% of the cost of Opus-based agents. Complex troubleshooting and novel integration scenarios escalate to Opus + Skills. The tiered approach reduced agent inference costs by 60% while maintaining service quality for 80% of support workflows.

Self-Generated Skills: When Not to Rely on Agent Autonomy

Avoid self-generated Skills for production workflows; models cannot reliably author procedural knowledge they benefit from consuming. Evaluation of the self-generated Skills condition—where agents are prompted to generate relevant procedural knowledge before solving tasks—reveals uniformly poor results. Self-generated Skills provide –1.3pp performance on average compared to no Skills baseline, with only one model (Opus 4.6) showing marginal improvement (+1.4pp). Codex + GPT-5.2 shows substantial degradation (–5.6pp), suggesting generated Skills actively interfere with task completion (Li et al., 2026).

Trajectory analysis identifies two failure modes:

Imprecise procedural generation. Models identify that domain-specific knowledge is needed but generate incomplete or vague procedures. For example, self-generated Skills for data processing often list "use pandas for data processing" without specific API patterns, column naming conventions, or error handling—procedural details that curated Skills would specify.

Failure to recognize specialization needs. For high domain-knowledge tasks (manufacturing, financial modeling), models often fail to recognize the need for specialized Skills entirely, attempting solutions with general-purpose approaches rather than domain-specific methodologies.

These patterns reveal a fundamental asymmetry: models benefit from consuming human-curated procedural knowledge but cannot reliably generate equivalent procedural knowledge themselves. This finding has direct implications for Skills authoring workflows—organizations should invest in human curation rather than attempting automated Skills generation.

GitHub. The code hosting platform initially experimented with agent-generated Skills for repository operations, hoping to reduce manual curation effort. Performance analysis showed self-generated Skills underperformed both curated Skills and no Skills baselines. The organization pivoted to a community contribution model where domain experts author Skills with maintainer review, achieving substantially higher quality and agent success rates.

Building Long-Term Skills Capability and Governance

Beyond immediate deployment tactics, organizations must develop sustained capabilities for Skills authoring, curation, validation, and governance.

Skills Quality Assurance and Validation Frameworks

Establish systematic Skills validation before production deployment. The ecosystem quality mean of 6.2/12 underscores the need for rigorous review processes. High-performing organizations implement multi-stage validation:

Automated validation. Structural checks verify required files, correct directory layout, and valid syntax. Oracle execution confirms that reference solutions achieve 100% test pass rates using the Skill's documented procedures. Leakage audits detect Skills that encode task-specific solutions rather than generalizable procedures.

Human review. Domain experts evaluate data validity, task realism, oracle quality, and Skill utility. Reviewers execute benchmark experiments with and without Skills across multiple agents to confirm meaningful signal about Skill efficacy rather than artificial difficulty.

Acceptance criteria. Effective validation frameworks establish explicit thresholds—Skills must improve agent performance by ≥10pp on at least one model-harness configuration, avoid negative deltas on >10% of configurations, and pass expert review on procedural accuracy and generalizability.

Capital One. The financial services firm's AI engineering team built an automated Skills validation pipeline that gates production deployment. Each Skill undergoes structural validation, automated testing with reference implementations, and evaluation across three agent configurations. Skills that fail to meet minimum improvement thresholds or show frequent negative deltas trigger expert review. This systematic process reduced production Skills failures by 80% compared to ad-hoc curation.

Continuous Learning and Skills Evolution Systems

Implement feedback loops that capture Skills failure patterns and drive iterative improvement. Production deployment generates rich signal about when Skills help, when they fail, and what procedural gaps remain. Organizations can instrument agent trajectories to capture:

Skills discovery failures: Tasks where relevant Skills existed but agents didn't recognize or retrieve them
Skills application failures: Agents retrieved Skills but implemented incorrect procedures
Skills quality failures: Agents followed Skills guidance but produced incorrect outputs due to flawed Skills content
Skills coverage gaps: New task patterns where no relevant Skills exist

This instrumentation enables data-driven Skills prioritization—identifying high-impact Skills to develop, low-quality Skills to revise, and discovery/retrieval mechanisms to improve.

Microsoft. Azure DevOps teams implemented systematic trajectory analysis across their agent-assisted CI/CD workflows. Monthly reviews identify procedural patterns where agents consistently fail despite Skills being present, triggering expert revision or new Skills development. This continuous improvement cycle increased agent-assisted task success rates by 25pp over 12 months, with Skills playing a central role in codifying expert feedback.

Distributed Skills Authoring and Contribution Models

Leverage domain experts for Skills authoring through structured contribution workflows. The most effective Skills encode the procedural knowledge of domain specialists—practitioners who understand not just what to do but how expert humans actually perform workflows efficiently. Organizations can implement contribution models that:

Incentivize contribution: Recognize Skills authoring as legitimate technical work, not volunteer effort
Lower barriers to entry: Provide Skills templates, authoring guidelines, and automated validation feedback
Enable community review: Implement pull-request workflows where maintainers validate Skills quality, catch leakage, and ensure generalizability
Version and share: Treat Skills as versioned artifacts with changelog documentation and deprecation policies

Stack Overflow. The developer Q&A platform created an open Skills contribution program where community experts author Skills for common development workflows. A maintainer team reviews submissions using the same quality rubric applied to benchmark construction: completeness, clarity, specificity, and examples. High-quality Skills receive community recognition and are promoted in the Skills marketplace. This distributed model scaled Skills coverage to 180+ specialized development workflows while maintaining quality standards.

Conclusion

Agent Skills represent a pragmatic architectural response to the gap between foundation model capabilities and domain-specific procedural requirements. By packaging procedural knowledge in modular, portable artifacts, Skills extend agent competence to specialized workflows without the cost and rigidity of fine-tuning. Large-scale evidence from systematic evaluation establishes both the promise and limitations of this approach.

The promise is substantial but context-dependent. Curated Skills improve agent performance by an average of 16.2 percentage points, with the largest gains in domains requiring specialized procedural knowledge underrepresented in foundation model pretraining—healthcare workflows (+51.9pp), manufacturing operations (+41.9pp), and regulatory compliance procedures. Focused Skills with 2–3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them on procedurally structured tasks, enabling cost-effective deployment strategies.

Yet Skills are not a universal solution. Software engineering and mathematics tasks, where models already possess strong priors, show minimal marginal returns (+4.5pp and +6.0pp). Sixteen of 84 tasks show negative Skills deltas, suggesting poorly designed or misaligned Skills can introduce overhead without benefit. Most critically, models cannot reliably generate their own effective Skills—self-generated Skills provide negligible or negative benefit on average, demonstrating that procedural knowledge authoring remains a distinctly human capability.

These findings establish several actionable principles for practitioners:

Target Skills investment strategically: Prioritize domains with specialized procedures underrepresented in pretraining data
Design focused, modular Skills: Avoid comprehensive documentation; optimize for 2–3 actionable procedures per Skill
Adapt to harness-specific integration: Skills effectiveness depends on how agent harnesses discover and apply procedural guidance
Invest in human curation: Avoid relying on agent-generated Skills; systematic human authoring and review produces reliable artifacts
Implement continuous validation: Establish quality assurance processes that gate production deployment and drive iterative improvement

For researchers, Skills open several promising directions. The relationship between Skills quantity, complexity, and performance suggests optimization opportunities—can we predict which procedural components drive improvements? How do Skills compose when multiple relevant artifacts exist? What retrieval and ranking mechanisms maximize Skills utilization? The asymmetry between Skills consumption (effective) and Skills generation (ineffective) invites investigation into what procedural structures models can reason about versus produce. And the domain-level variance in Skills efficacy demands deeper understanding of how pretraining coverage interacts with augmentation strategies.

Skills efficacy is not universal but context-dependent, determined by domain characteristics, Skills design quality, and harness-agent integration. Organizations that approach Skills as carefully curated procedural artifacts—designed, validated, and evolved through systematic processes—can achieve meaningful agent capability extensions. Those that treat Skills as unstructured documentation or assume models can generate their own guidance will likely see minimal returns. The evidence establishes Skills as a viable augmentation strategy, but one that requires intentional investment in authoring, curation, and continuous improvement to realize its potential.

Research Infographic

The Agent Skills Playbook Slide Deck

References

Anthropic. (2024). Introducing the model context protocol.
Anthropic. (2025a). Equipping agents for the real world with agent skills. Anthropic Engineering Blog.
Anthropic. (2025b). Claude code: An agentic coding tool.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Google. (2025). Gemini CLI: An open-source AI agent that brings the power of Gemini directly into your terminal.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.
Li, X., Chen, W., Liu, Y., Zheng, S., Chen, X., He, Y., ... & Lee, H. (2026). SkillsBench: Benchmarking how well agent skills work across diverse tasks. arXiv preprint arXiv:2602.12670.
Merrill, M. A., Shaw, A. G., Carlini, N., Li, B., Raj, H., Bercovich, I., ... & Schmidt, L. (2026). Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868.
OpenAI. (2025). Codex CLI: Lightweight coding agent that runs in your terminal.
Pan, M. Z., Cemri, M., Agrawal, L. A., Yang, S., Chopra, B., Tiwari, R., ... & Lee, H. (2025). Why do multiagent systems fail? In ICLR 2025 Workshop on Building Trust in Language Models and Applications.
Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., ... & Sun, M. (2024). Toolllm: Facilitating large language models to master 16000+ real-world APIs. In International Conference on Learning Representations (pp. 9695–9717).
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., ... & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 68539–68551.
Sumers, T., Yao, S., Narasimhan, K. R., & Griffiths, T. L. (2023). Cognitive architectures for language agents. Transactions on Machine Learning Research.

Jonathan H. Westover, PhD is Chief Research Officer (Nexus Institute for Work and AI); Associate Dean and Director of HR Academic Programs (WGU); Professor, Organizational Leadership (UVU); OD/HR/Leadership Consultant (Human Capital Innovations). Read Jonathan Westover's executive profile here.

Suggested Citation: Westover, J. H. (2026). AI Agent Skills: Bridging the Gap Between Foundation Models and Real-World Performance. Human Capital Leadership Review, 33(4). doi.org/10.70175/hclreview.2020.33.4.6