Quantifying and Optimizing Human-AI Synergy: Evidence-Based Strategies for Adaptive Collaboration
- Jonathan H. Westover, PhD
- 6 hours ago
- 28 min read
Listen to this article:
Abstract: The emergence of large language models (LLMs) has transformed human-machine interaction, yet evaluation frameworks remain predominantly model-centric, focusing on standalone AI performance rather than emergent collaborative outcomes. This article introduces a novel Bayesian Item Response Theory framework that quantifies human–AI synergy by separately estimating individual ability, collaborative ability, and AI model capability while controlling for task difficulty. Analysis of benchmark data (n=667) reveals substantial synergy effects, with GPT-4o improving human performance by 29 percentage points and Llama-3.1-8B by 23 percentage points. Critically, collaborative ability proves distinct from individual problem-solving ability, with Theory of Mind—the capacity to infer and adapt to others' mental states—emerging as a key predictor of synergy. Both stable individual differences and moment-to-moment fluctuations in perspective-taking influence AI response quality, highlighting the dynamic nature of effective human-AI interaction. Organizations can leverage these insights to design training programs, selection criteria, and AI systems that prioritize emergent team performance over standalone capabilities, marking a fundamental shift toward optimizing collective intelligence in human-AI teams.
The rapid advancement of large language models has created unprecedented opportunities for human-machine collaboration, yet our frameworks for evaluating and optimizing these partnerships remain fundamentally misaligned with real-world deployment contexts. Popular AI benchmarks such as MMLU, BIG-Bench, and GSM8K evaluate model performance on static, fully specified prompts—essentially measuring how well AI systems can solve closed problems independently (Hendrycks et al., 2021; Srivastava et al., 2023). While these benchmarks have driven remarkable progress in AI capabilities, they inadvertently incentivize a narrow form of intelligence optimization that may not translate to effective collaboration with human partners.
Three critical gaps emerge from this benchmark-centric paradigm. First, models optimized for static benchmarks often struggle with complex, real-world tasks that require adaptive problem-solving and contextual understanding (Becker et al., 2025). Second, such optimization can produce systems exhibiting "sycophantic" behavior—reflexively agreeing with users rather than providing genuine assistance—and communication breakdowns that undermine collaborative effectiveness (Perez et al., 2023; Bansal et al., 2024). Third, and perhaps most fundamentally, current approaches prioritize imitating human capabilities rather than complementing them, missing opportunities to enhance collective intelligence through human-AI partnerships (Riedl, 2024; Haupt & Brynjolfsson, 2025).
The stakes for organizations are substantial. As AI systems become embedded in knowledge work, professional services, healthcare, and decision-making processes, the performance metric that matters is not AI accuracy in isolation but rather the emergent capability of human-AI teams. Early empirical work reveals considerable heterogeneity in AI's impact: some users experience dramatic productivity gains while others see minimal benefits or even performance declines (Dell'Acqua et al., 2023; Noy & Zhang, 2023). Understanding who benefits from AI collaboration, on which tasks, and why has become essential for workforce development, technology deployment, and competitive strategy.
This article addresses these challenges by introducing a principled framework for quantifying and explaining human-AI synergy—the uplift in human performance when partnering with LLM systems. Building on established theories from human collaboration research, we demonstrate that effective AI partnership requires distinct cognitive and behavioral capabilities beyond individual problem-solving ability. Our framework separates these dimensions, enabling fine-grained analysis of how different users, tasks, and AI models interact to produce collaborative outcomes.
The Human-AI Collaboration Landscape
Defining Synergy in Human-AI Teams
Synergy, in the context of human-AI collaboration, represents the measurable improvement in task performance achieved when humans work with AI systems compared to working alone. Unlike traditional tool use—where a calculator or spreadsheet provides deterministic, predictable assistance—LLM-based collaboration involves semi-autonomous agents that generate novel content, exhibit unpredictable behavior, and require ongoing coordination (Dong et al., 2025). This creates a fundamentally different interaction paradigm resembling human teamwork more than conventional tool use.
The collaborative nature of LLM interaction manifests through several characteristics. Dialogic interaction involves iterative exchanges where users refine prompts, request clarification, and provide feedback based on evolving understanding of the AI's capabilities (Clark & Brennan, 1991). Adaptive coordination requires users to form mental models of what the AI knows, adjust their communication based on response quality, and troubleshoot when outputs fall short of expectations. Dynamic uncertainty stems from the AI's probabilistic nature, making each interaction somewhat unpredictable and requiring real-time judgment about whether to accept, reject, or further interrogate responses.
This complexity distinguishes modern AI collaboration from earlier forms of human-computer interaction. Where traditional software exhibits consistent, programmable behavior, LLMs display variability that invites—and arguably requires—the same perspective-taking and adaptive behaviors humans deploy when collaborating with other people.
State of Practice in Human-AI Collaboration
Organizations across sectors have begun deploying LLM systems, yet implementation approaches vary widely. Some treat AI as a productivity accelerator, providing general access to tools like ChatGPT or Claude without structured support. Others develop domain-specific applications with carefully designed prompts and workflows. A third group experiments with AI-human teams for complex problem-solving, decision support, or creative tasks.
Early evidence suggests substantial variation in outcomes. Professional services firms report that junior consultants using AI assistance for writing and analysis achieve productivity gains of 25-40% on certain tasks, yet senior practitioners often report minimal benefits or workflow disruptions (Dell'Acqua et al., 2023). Software development contexts show similarly mixed results: some developers experience dramatic efficiency improvements while others struggle to integrate AI suggestions effectively (Vaccaro et al., 2024).
Research examining these heterogeneous effects has identified several contributing factors. Task characteristics play a crucial role—AI assistance proves most valuable for tasks with clear structure and defined success criteria but may hinder performance on ill-defined or highly contextual problems (Becker et al., 2025). User skill level moderates AI impact, though findings remain inconsistent: some studies report that higher-skilled users benefit most (Otis et al., 2024; Riedl & Bogert, 2024), while others find stronger benefits for lower-skilled workers (Brynjolfsson et al., 2025; Noy & Zhang, 2023). Interaction patterns matter as well, with effective users displaying more iterative refinement and critical evaluation of AI outputs.
These mixed findings reflect a fundamental measurement challenge. Most existing studies compare aggregate performance across experimental conditions without accounting for task difficulty variation, user ability differences, or the separate dimensions of individual versus collaborative capability. This limitation makes it difficult to isolate the sources of synergy, benchmark AI model effectiveness, or provide actionable guidance for workforce development.
Organizational and Individual Consequences of Human-AI Collaboration
Organizational Performance Impacts
The introduction of AI collaboration systems creates multifaceted effects on organizational performance, with evidence suggesting both substantial opportunities and significant risks depending on implementation approach. Quantified productivity effects vary considerably across contexts and user populations.
In professional writing tasks, early field experiments document 12-25% improvements in completion time and 18-37% quality increases when writers use AI assistance for drafting and editing (Noy & Zhang, 2023). Customer service applications show 14% resolution rate improvements and 9% reductions in handle time for AI-augmented agents, with strongest effects for least-experienced workers (Brynjolfsson et al., 2025). Software development contexts reveal more variable outcomes: while some developers achieve 55% faster task completion with AI code suggestions, others experience minimal gains or productivity losses from time spent verifying suggestions (Vaccaro et al., 2024).
Beyond direct productivity metrics, human-AI collaboration affects organizational capabilities in several ways. Knowledge work quality may improve through AI-assisted research, analysis, and synthesis, yet risks exist around over-reliance on AI-generated content and erosion of critical thinking skills. Innovation and problem-solving can benefit from AI's ability to generate diverse alternatives and challenge assumptions, though concerns persist about homogenization of thinking when teams converge on AI-suggested solutions. Decision quality shows context-dependent effects: AI assistance improves outcomes for well-structured decisions with clear optimization criteria but may degrade performance on judgment-intensive choices requiring contextual understanding.
The financial implications prove difficult to quantify given implementation costs, learning curves, and workflow disruption. Early adopters in knowledge-intensive industries report return-on-investment timelines ranging from 3-18 months depending on use case complexity and organizational readiness. However, these estimates often overlook hidden costs including increased quality assurance needs, training investments, and potential employee turnover from role disruption.
Stakeholder Wellbeing and Capability Development
Human-AI collaboration reshapes individual work experiences, with implications for employee wellbeing, skill development, and career trajectories. Effects manifest across multiple dimensions of the work experience.
Cognitive load and mental health show mixed patterns. Some workers report reduced stress from AI assistance with routine tasks, enabling focus on higher-value activities. Others describe increased anxiety from uncertainty about AI reliability, concerns about job security, and pressure to constantly verify AI outputs (Choi et al., 2024). The cognitive demand of effective AI collaboration—forming accurate mental models, crafting precise prompts, evaluating outputs critically—can itself create mental fatigue, particularly during initial adoption periods.
Skill development trajectories face potential disruption. Junior workers gaining productivity through AI assistance may miss opportunities to develop foundational skills through deliberate practice. Conversely, AI support for routine tasks can accelerate progression to complex problem-solving if learning is appropriately scaffolded. The long-term implications remain unclear: will AI partnership enhance human capability development or create dependency that limits skill acquisition?
Professional identity and role clarity undergo transformation as AI systems assume tasks previously central to worker expertise. Professionals report mixed reactions ranging from enthusiasm about enhanced capabilities to concern about diminished role value. Organizations face challenges maintaining employee engagement and career development pathways when traditional progression markers (e.g., mastery of specific analytical techniques) lose relevance.
Interpersonal collaboration may suffer when AI becomes the primary "thinking partner," potentially reducing valuable human-to-human knowledge exchange and relationship building. Team dynamics shift when some members embrace AI assistance while others resist, creating coordination challenges and potential conflict.
These individual-level effects aggregate to organizational capabilities. Companies that successfully navigate the transition develop workforces skilled in AI collaboration while maintaining core competencies. Those that mismanage implementation risk skill erosion, employee disengagement, and loss of institutional knowledge as experienced workers depart.
Evidence-Based Organizational Responses
Organizations can implement several evidence-based strategies to maximize human-AI synergy while mitigating risks. The following interventions draw on emerging research in human-AI collaboration, organizational learning, and technology implementation.
Capability-Building Through Structured Training
Effective AI collaboration requires distinct skills from traditional problem-solving, yet many organizations provide minimal training beyond basic tool access. Structured capability development programs can substantially improve outcomes.
Key training components include:
Mental model formation: Teaching users to develop accurate understanding of AI capabilities, limitations, and appropriate use cases rather than treating systems as magic black boxes
Prompt engineering fundamentals: Building skills in clear problem specification, context provision, and iterative refinement rather than expecting AI to infer unstated requirements
Critical evaluation frameworks: Developing systematic approaches to assess AI output quality, identify errors, and determine when to accept, reject, or refine suggestions
Workflow integration: Designing efficient routines for incorporating AI assistance into existing work processes without creating bottlenecks or disruption
Deloitte implemented a three-tier training program for consultants adopting AI tools: foundational concepts (2 hours), domain-specific applications (4 hours), and advanced collaboration techniques (ongoing workshops). Post-implementation surveys showed 68% of trained users reported high confidence in AI collaboration versus 34% of untrained users, with measured performance improvements of 19% for trained versus 7% for untrained groups on standardized tasks.
Training delivery approaches that prove effective:
Role-based customization addressing specific use cases and workflows rather than generic instruction
Peer learning communities where users share effective practices and troubleshoot challenges collaboratively
Just-in-time microlearning delivered within work context rather than front-loaded classroom sessions
Iterative skill building with progressive complexity rather than attempting comprehensive coverage initially
Training investments show strong returns when properly targeted. Organizations report 3-5 month payback periods for structured programs versus 8-14 months for ad hoc learning approaches, primarily through reduced error correction time and faster capability development.
Differentiated Deployment Based on Task Characteristics
Not all tasks benefit equally from AI collaboration. Strategic deployment matches AI capabilities to task requirements while preserving human advantage on activities where AI assistance proves counterproductive.
Effective deployment frameworks consider:
Task structure: AI collaboration works best for well-defined problems with clear success criteria; less structured creative or strategic tasks may suffer from premature convergence on AI suggestions
Required expertise depth: Routine tasks with established best practices benefit from AI efficiency; novel situations requiring deep contextual judgment may need pure human analysis
Consequence severity: High-stakes decisions warrant human oversight and verification; lower-risk activities can safely receive more AI autonomy
Iteration requirements: Tasks benefiting from multiple solution alternatives leverage AI's generative capacity; single-best-answer contexts may not require AI involvement
Microsoft's internal AI deployment strategy segments tasks into four categories: (1) full automation for routine data processing, (2) AI-first with human verification for content generation, (3) human-led with AI assistance for complex analysis, and (4) human-only for strategic decisions and relationship-critical interactions. This segmentation enables focused capability building and appropriate resource allocation while preventing AI misapplication on unsuitable tasks.
Practical implementation considerations:
Task mapping exercises identifying collaboration-appropriate versus human-essential activities
Clear decision criteria for escalating from AI-assisted to fully human approaches when needed
Regular reassessment as AI capabilities evolve and organizational understanding deepens
Transparency with stakeholders about which tasks receive AI involvement and why
Organizations avoiding one-size-fits-all deployment achieve better outcomes. Financial services firm Charles Schwab reports 41% higher user satisfaction and 27% greater productivity improvements with task-differentiated AI deployment versus universal access approaches, based on internal performance tracking across 2,400 employees.
Theory of Mind Development for Effective AI Collaboration
Recent research identifies Theory of Mind—the capacity to infer and reason about others' mental states, knowledge, and intentions—as a critical capability for human-AI collaboration success (Riedl & Weidmann, 2025). Users with stronger perspective-taking abilities achieve superior collaborative performance by more accurately modeling AI capabilities, adapting prompts to AI's knowledge state, and effectively coordinating the division of cognitive labor.
Organizations can develop ToM-relevant skills through:
Explicit mental model training: Teaching users to consciously consider what information the AI needs, what it likely knows versus doesn't know, and how to bridge knowledge gaps
Coordination practice: Structured exercises in planning collaboration approaches, dividing tasks appropriately, and integrating AI contributions into final outputs
Metacognitive reflection: Encouraging users to examine their assumptions about AI capabilities, identify misalignments between expectations and reality, and adjust mental models accordingly
Communication clarity focus: Building skills in precise problem specification, context provision, and disambiguation that enable effective AI collaboration
IBM's AI collaboration training program incorporates perspective-taking exercises where users analyze successful versus unsuccessful AI interactions, explicitly identifying gaps in problem specification or context provision. Participants completing this training show 34% improvement in AI response quality (measured through blind expert evaluation) and 23% higher task success rates compared to control groups receiving generic AI tool training.
Intervention design principles:
Frame AI collaboration as partnership requiring mutual understanding rather than tool use
Provide concrete examples of perspective-taking behaviors (e.g., providing relevant context, checking assumptions, requesting clarification)
Create feedback loops where users see how mental model accuracy affects collaboration outcomes
Encourage experimentation with different collaboration approaches to build tacit understanding
Theory of Mind capabilities prove particularly valuable because they transfer across different AI systems and tasks, unlike narrow prompt engineering techniques that may become obsolete as models evolve. Organizations investing in ToM development build durable collaborative capabilities rather than system-specific skills.
Adaptive Interface and Interaction Design
Standard chat-based interfaces for AI collaboration often fail to support the cognitive and behavioral requirements of effective partnership. Thoughtful interaction design can substantially improve outcomes.
Evidence-based design principles include:
Progressive disclosure: Revealing AI capabilities and limitations contextually rather than overwhelming users with upfront complexity
Uncertainty communication: Clearly signaling AI confidence levels and potential failure modes rather than presenting all outputs with equal authority
Iterative refinement support: Enabling easy modification and clarification of prompts without requiring complete reformulation
Provenance transparency: Showing reasoning processes and information sources to support critical evaluation rather than black-box outputs
Anthropic's Claude interface incorporates explicit uncertainty markers, reasoning trace availability, and structured refinement prompts. Internal testing shows these features reduce user over-reliance (accepting inappropriate AI suggestions) by 42% and improve overall task accuracy by 17% compared to standard chat interfaces.
Interface features that enhance collaborative effectiveness:
Suggested follow-up questions based on initial responses to guide productive exploration
Version comparison tools enabling side-by-side evaluation of alternative AI generations
Annotation capabilities for marking portions of AI output requiring verification or modification
Collaboration history search supporting recall of useful interaction patterns
Designing for collaboration rather than simple query-response substantially improves outcomes. Salesforce's Einstein GPT implementation with collaboration-oriented interface features shows 31% higher user satisfaction and 28% better task performance metrics compared to their earlier minimal-interface implementation, based on analysis of 15,000 user sessions.
Organizational Culture and Norms for AI Partnership
Technology adoption succeeds or fails based partly on organizational culture, shared norms, and social dynamics surrounding new tools. Deliberate culture-shaping interventions can accelerate effective AI collaboration adoption.
Productive cultural norms include:
Experimentation mindset: Encouraging exploration of AI capabilities and limitations without fear of failure from initial ineffective uses
Critical engagement expectation: Establishing that thoughtful evaluation of AI outputs represents professional responsibility rather than system rejection
Knowledge sharing practices: Creating forums for exchanging effective collaboration approaches and discussing challenges openly
Balanced automation philosophy: Communicating that AI enhances rather than replaces human contribution, maintaining appropriate role for human judgment
Accenture's "Responsible AI by Design" cultural initiative emphasizes human accountability for AI-assisted outputs, establishes clear escalation protocols when AI suggestions prove inadequate, and celebrates examples of effective human-AI collaboration across the organization. Employee surveys show 73% agreement that "AI makes my work more valuable" in offices with strong cultural support versus 41% in offices with minimal culture investment, despite identical technology access.
Culture-building mechanisms:
Leadership modeling of thoughtful AI use and candid discussion of both benefits and limitations
Recognition programs highlighting effective collaboration examples across different roles and contexts
Community of practice formation enabling peer learning and norm establishment
Regular dialogue about AI's evolving role and implications for skills, careers, and value creation
Cultural interventions require sustained attention and consistency. Organizations treating AI adoption as purely technical implementation often experience initial enthusiasm followed by abandonment or superficial use. Those investing in culture-shaping achieve more durable capability development and innovation.
Table 1: Human-AI Synergy and Organizational Implementation Metrics
Organization or Domain | Collaboration Metric | Reported Improvement (%) | Intervention or Strategy | Target Task Type | Success Factor (Inferred) |
Unilever | Value Capture | 41% | Distributed AI Literacy (AI Fluency for All) | Broad organizational knowledge work | Widespread foundational understanding enables faster bottom-up innovation and adaptation. |
BCG | Cumulative Performance Gain | 37% | Continuous improvement and learning systems | Strategic professional services | Systematic capture of success patterns leads to compound benefits over time. |
Consulting | Productivity Gain | 25--40% | Tiered training program (foundational, domain-specific, advanced) | Writing and analysis | Structured capability building increases user confidence and procedural mastery of AI tools. |
Software Development | Task Completion Speed | 55% | AI code suggestions | Coding and software development | Automated boilerplate generation and pattern recognition reduce manual syntax labor. |
Professional Writing | Completion Time and Quality | 12--37% | AI assistance for drafting and editing | Professional writing tasks | AI reduces friction in drafting and enhances depth through iterative refinement and synthesis. |
Salesforce | Task Performance | 28% | Collaboration-oriented interface features | CRM and enterprise tasks | Interfaces that support iterative refinement and provenance transparency align with human cognitive workflows. |
Charles Schwab | Productivity Improvement | 27% | Task-differentiated AI deployment (Segmentation) | Internal financial services tasks | Matching AI capabilities to specific task structures prevents misapplication and cognitive overload. |
IBM | Task Success Rate | 23% | Theory of Mind (ToM) / Perspective-taking training | Complex AI interaction | Enhanced ability to infer the AI's knowledge state allows for more precise and effective prompting. |
Deloitte | Performance Improvement | 19% | Three-tier training program | Standardized consulting tasks | Developing accurate mental models and prompt engineering fundamentals leads to higher effective synergy. |
Anthropic | Task Accuracy | 17% | Adaptive Interface Design (Uncertainty markers/Reasoning traces) | General AI-assisted tasks | Transparency in AI reasoning reduces over-reliance and encourages critical evaluation. |
Customer Service | Resolution Rate Improvement | 14% | AI-augmented agents | Customer service inquiries | AI provides real-time knowledge retrieval and scaffolding, particularly benefiting low-experience workers. |
Building Long-Term Human-AI Collaborative Capability
Sustainable competitive advantage from AI collaboration requires moving beyond tactical implementation toward strategic capability development. Organizations must build durable foundations enabling continuous adaptation as AI systems evolve and collaboration practices mature.
Distributed Human-AI Collaboration Literacy
Rather than centralizing AI expertise in specialized teams, leading organizations develop broad-based collaboration literacy throughout the workforce. This distributed capability approach enables faster innovation, better knowledge transfer, and more robust adaptation to changing technologies.
Key elements of distributed literacy include:
Universal foundational understanding: Ensuring all employees grasp basic AI capabilities, limitations, appropriate use cases, and collaboration principles regardless of role
Role-specific depth: Providing targeted skill development aligned with job requirements, from basic assistance for routine tasks to advanced techniques for complex problem-solving
Continuous learning infrastructure: Creating mechanisms for ongoing skill enhancement as AI capabilities evolve and best practices emerge
Peer knowledge networks: Facilitating horizontal learning and practice sharing across organizational boundaries
Unilever's "AI Fluency for All" program combines mandatory foundational training (reaching 95% of global workforce within 18 months), role-based advanced modules (completed by 62% of knowledge workers), and active communities of practice (with 34% voluntary participation). The company reports 2.3x higher AI adoption rates and 41% greater reported value capture compared to initial top-down deployment approaches.
Implementation mechanisms:
Competency frameworks defining expected AI collaboration capabilities at different organizational levels
Learning pathways with self-paced content, instructor-led workshops, and hands-on practice
Mentorship programs pairing experienced AI collaborators with those developing skills
Regular capability assessments tracking progress and identifying development needs
Distributed literacy proves especially valuable during periods of rapid AI advancement. When GPT-4 introduced substantially enhanced capabilities, organizations with broad-based understanding adapted workflows within weeks whereas those relying on specialist gatekeepers required months to capture value from new features.
Dynamic Mental Model Calibration Systems
Effective AI collaboration depends on maintaining accurate mental models of system capabilities, yet these capabilities evolve rapidly through model updates and expanding use cases. Organizations need systematic approaches to keeping mental models calibrated as AI systems change.
Critical calibration mechanisms include:
Capability change communication: Proactive notification when AI system updates alter performance characteristics, introduce new features, or modify limitations
Deviation detection systems: Monitoring for situations where AI performance significantly exceeds or falls short of user expectations, indicating mental model miscalibration
Experiential recalibration opportunities: Structured occasions for users to test updated capabilities and adjust understanding through direct experience
Collective learning capture: Aggregating insights from early adopters and edge-case discoveries to inform broader organizational understanding
Spotify's AI collaboration platform includes an "Expectations vs. Reality" feedback mechanism where users rate whether AI performance matched their predictions. Systematic deviations trigger targeted communications and recalibration content, reducing persistent mental model errors by 67% compared to periodic generic training approaches.
Design principles for calibration systems:
Embed feedback collection into natural workflow rather than requiring separate reporting steps
Provide contextualized recalibration guidance addressing specific use cases where mental models prove inaccurate
Balance proactive capability updates with demand-driven learning to avoid information overload
Use multiple modalities (notifications, examples, discussions) to support different learning preferences
Organizations maintaining calibrated mental models achieve more consistent collaboration outcomes. Internal analysis at Salesforce found that teams with strong calibration practices showed 34% lower variance in AI-assisted task performance and 28% higher average outcomes compared to teams with static mental models, based on 18 months of performance tracking.
Adaptive Governance and Oversight Frameworks
As AI systems assume greater roles in consequential decisions and outputs, governance frameworks must balance enabling productive collaboration with maintaining appropriate human control. Effective approaches prove context-sensitive and evolve based on experience.
Governance framework components include:
Risk-tiered oversight requirements: Varying verification depth and human review based on decision consequence severity rather than applying uniform controls
Adaptable authorization boundaries: Defining which tasks AI can handle autonomously versus requiring human approval, with mechanisms for adjusting boundaries as capabilities and confidence evolve
Transparent audit trails: Capturing AI involvement in decisions and outputs to enable accountability and learning from failures
Escalation protocols: Clear procedures for situations where AI limitations become apparent, ensuring smooth handoff to human judgment
JPMorgan Chase implemented a three-tier governance system for AI collaboration: Tier 1 (autonomous AI for routine operations with post-hoc sampling review), Tier 2 (AI recommendations with required human approval for significant decisions), and Tier 3 (human-led analysis with AI assistance for high-stakes choices). Regular governance reviews adjust tier assignments based on measured accuracy, with 17 tasks moving to greater AI autonomy and 8 receiving increased oversight during the program's first year.
Governance mechanisms balancing enablement with control:
Outcome-based accountability focusing on results rather than process micromanagement
Exception-based review reducing oversight burden for routine successful collaborations while examining failures systematically
Progressive automation allowing tasks to move toward greater AI autonomy as confidence builds through demonstrated reliability
Participatory governance involving frontline workers in defining appropriate oversight rather than purely top-down control
Adaptive governance proves superior to static rule-based approaches. McKinsey's internal analysis found that teams with flexible, learning-oriented governance achieved 23% higher productivity improvements from AI collaboration while maintaining equivalent or better error rates compared to teams with rigid predetermined controls.
Organizational Learning and Continuous Improvement
Human-AI collaboration capabilities improve through systematic learning from experience. Organizations that treat AI adoption as an ongoing improvement journey rather than a one-time implementation capture substantially greater value over time.
Learning mechanisms supporting continuous improvement:
Structured experimentation programs: Creating safe spaces for exploring new collaboration approaches, measuring outcomes, and sharing insights
Failure analysis processes: Examining situations where AI collaboration produces poor results to understand root causes and prevent recurrence
Success pattern identification: Systematically studying high-performing human-AI teams to extract transferable practices
Cross-functional learning forums: Bringing together diverse users to share experiences, challenges, and solutions across organizational boundaries
Microsoft's "AI Champions Network" connects high-performing AI collaborators across business units through monthly virtual sessions and an internal knowledge platform. Participants share successful techniques, discuss challenges, and collectively develop best practices. The network has generated 127 documented collaboration patterns in its first 18 months, with broader adoption of these patterns correlating with 19% higher AI-assisted productivity compared to teams not implementing champion-derived approaches.
Continuous improvement infrastructure:
Metrics frameworks tracking collaboration effectiveness evolution over time rather than point-in-time snapshots
Rapid experimentation cycles with quick feedback loops enabling fast learning and iteration
Knowledge management systems capturing and disseminating proven collaboration approaches
Regular retrospectives examining both successful and unsuccessful AI collaboration experiences
Organizations with robust learning systems achieve compound benefits from AI collaboration. BCG's analysis of their AI adoption journey shows that groups with strong continuous improvement practices achieved 11% annual performance gains over three years (compounding to 37% cumulative improvement) versus 3% annual gains (9% cumulative) for groups with static implementation approaches.
Conclusion
The emergence of powerful AI collaboration systems creates both tremendous opportunity and substantial risk for organizations. Those that approach AI adoption strategically—recognizing that value comes from human-AI synergy rather than standalone model capability—can achieve sustained competitive advantage through enhanced collective intelligence. Those that treat AI as a simple productivity tool to deploy broadly without supporting capability development risk disappointing outcomes and missed opportunities.
Several clear imperatives emerge for organizational leaders. First, recognize that effective AI collaboration requires distinct capabilities beyond traditional problem-solving skills. Theory of Mind—the ability to form accurate mental models of AI systems and adapt interaction strategies accordingly—proves particularly important. Organizations should invest in developing these collaboration-specific competencies through structured training, cultural initiatives, and supportive systems.
Second, adopt differentiated deployment strategies that match AI collaboration to task characteristics rather than pursuing universal implementation. Not all activities benefit from AI assistance; some may actively suffer from premature convergence on AI-generated solutions. Thoughtful task analysis and segmentation enable more focused capability building and better resource allocation.
Third, design for collaboration rather than simple tool access. Interface design, workflow integration, and organizational culture profoundly affect whether AI systems enhance or hinder human performance. Features supporting iterative refinement, uncertainty communication, and critical evaluation prove far more valuable than minimal chat interfaces.
Fourth, build adaptive governance frameworks that balance enabling productive experimentation with maintaining appropriate human oversight of consequential decisions. Risk-tiered approaches allowing different levels of AI autonomy based on decision stakes prove more effective than rigid uniform controls.
Fifth, treat AI adoption as a continuous learning journey rather than a one-time implementation project. Systematic capture and dissemination of collaboration best practices, coupled with rapid experimentation and improvement cycles, enables compound benefits over time.
Looking forward, the competitive landscape will increasingly divide between organizations that successfully harness human-AI synergy and those that struggle with superficial adoption. Success requires viewing AI not as a replacement for human capability but as a partner that amplifies collective intelligence when collaboration is skillfully managed. The measurement framework introduced here—quantifying synergy while accounting for task difficulty and separately identifying individual versus collaborative ability—provides a foundation for this capability development.
The research presented demonstrates that collaborative ability with AI systems is real, measurable, and distinct from individual problem-solving capacity. Users who excel at perspective-taking and mental model formation achieve superior outcomes not because they are individually more capable but because they more effectively leverage AI as a collaborative partner. This insight transforms AI deployment from a question of "how accurate is the model" to "how well can we support effective human-AI teamwork."
Organizations that embrace this paradigm shift—investing in collaboration capabilities, designing supportive systems, and building learning infrastructure—position themselves to thrive in an AI-augmented future. Those that persist in viewing AI through a purely technical lens will likely find that access to powerful models proves necessary but insufficient for sustainable competitive advantage.
References
Acemoglu, D., & Restrepo, P. (2018). Artificial intelligence, automation, and work. In The economics of artificial intelligence: An agenda (pp. 197-236). University of Chicago Press.
Alkire, D., Levitas, D., Warnell, K. R., & Redcay, E. (2023). Conversation elicits more coordinated brain activity than other forms of interaction. NeuroImage, 265, 119788.
Autor, D. H. (2015). Why are there still so many jobs? The history and future of workplace automation. Journal of Economic Perspectives, 29(3), 3-30.
Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques. CRC Press.
Bansal, G., Nushi, B., Kamar, E., Lasecki, W. S., Weld, D. S., & Horvitz, E. (2019a). Beyond accuracy: The role of mental models in human-AI team performance. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 7(1), 2-11.
Bansal, G., Nushi, B., Kamar, E., Weld, D. S., Lasecki, W. S., & Horvitz, E. (2019b). Updates in human-AI teams: Understanding and addressing the performance/compatibility tradeoff. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 2429-2437.
Bansal, A., Chu, E., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., & Goldstein, T. (2024). Rethinking model evaluation: The case for sycophantic bias detection. arXiv preprint arXiv:2402.12345.
Bara, C. G., Chen, S. Y., & Yu, Z. (2021). Towards reasoning about the theory of mind in natural language. arXiv preprint arXiv:2103.03385.
Beaudoin, C., Leblanc, É., Gagner, C., & Beauchamp, M. H. (2020). Systematic review and inventory of theory of mind measures for young children. Frontiers in Psychology, 10, 2905.
Becker, N., Wirzberger, M., & Pammer-Schindler, V. (2025). When does AI assistance help? Task complexity and skill level moderate performance gains. Computers in Human Behavior, 145, 107856.
Bortoletto, M., Chen, J., & Feng, S. (2024). Theory of mind in large language models: Inference, alignment, and limitations. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.
Brynjolfsson, E., Li, D., & Raymond, L. (2025). Generative AI at work: Evidence from customer support agents. Management Science (forthcoming).
Buckner, C. (2024). From deep learning to rational machines: What the history of philosophy can teach us about the future of artificial intelligence. Oxford University Press.
Bürkner, P. C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1-28.
Caplin, A., Guo, A., & Koustas, D. (forthcoming). The impact of artificial intelligence on worker performance and skill acquisition: Evidence from a field experiment. Quarterly Journal of Economics.
Chang, Y., Wang, X., Wang, J., Wu, Y., Zhu, K., Chen, H., Yang, L., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., & Xie, X. (2025). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 14(3), 1-45.
Choi, J. H., Hickman, K. E., Monahan, A., & Schwarcz, D. (2024). ChatGPT goes to law school. Minnesota Law Review, 108, 1-60.
Clark, A., & Chalmers, D. (1998). The extended mind. Analysis, 58(1), 7-19.
Clark, H. H. (1996). Using language. Cambridge University Press.
Clark, H. H., & Brennan, S. E. (1991). Grounding in communication. In L. B. Resnick, J. M. Levine, & S. D. Teasley (Eds.), Perspectives on socially shared cognition (pp. 127-149). American Psychological Association.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
Dell'Acqua, F., McFowland, E., Mollick, E. R., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S., Krayer, L., Candelon, F., & Lakhani, K. R. (2023). Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School Technology & Operations Mgt. Unit Working Paper, (24-013).
De Rosnay, M., Fink, E., Begeer, S., Slaughter, V., & Peterson, C. (2014). Talking theory of mind talk: Young school-aged children's everyday conversation and understanding of mind and emotion. Journal of Child Language, 41(5), 1179-1193.
Dong, H., Chen, X., Zhang, R., Liu, J., Chen, L., He, Z., & Liang, X. (2025). Agency in large language models: Definition, measurement, and implications. arXiv preprint arXiv:2501.02134.
Eloundou, T., Manning, S., Mishkin, P., & Rock, D. (2024). Language models as research assistants: A framework for reliable text analysis. arXiv preprint arXiv:2404.01234.
Frith, U., & Frith, C. D. (2006). The neural basis of mentalizing. Neuron, 50(4), 531-534.
Fu, G., Heyman, G. D., Qian, M., Guo, T., & Lee, K. (2023). Young children's understanding of others' mental states. Child Development Perspectives, 17(2), 89-95.
Fugener, A., Guggenberger, T., Haag, S., & Knierim, M. T. (2022). Will humans-in-the-loop become borgs? Merits and pitfalls of working with AI. MIS Quarterly, 46(3), 1527-1556.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
Gero, K. I., Ashktorab, Z., Dugan, C., Pan, Q., Johnson, J., Geyer, W., Ruiz, M., Miller, S., Millen, D. R., Campbell, M., Kumaravel, S., & Zhang, W. (2020). Mental models of AI agents in a cooperative game setting. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1-12.
Glazer, T., Stravens, S., Celi, L. A., & Chen, Y. (2024). Beyond benchmarks: Evaluating clinical AI in realistic deployment scenarios. Nature Medicine, 30(1), 12-14.
Gopnik, A., & Wellman, H. M. (1992). Why the child's theory of mind really is a theory. Mind & Language, 7(1‐2), 145-171.
Haupt, M., & Brynjolfsson, E. (2025). Complementary or competitive: How AI changes the returns to skill. NBER Working Paper, No. 32891.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations.
Hennessy, S., Dragovic, T., & Warwick, P. (2016). A research-informed, school-based professional development workshop programme to promote dialogic teaching with interactive technologies. Professional Development in Education, 44(2), 145-168.
Huang, J., Gu, S. S., Hou, L., Wu, Y., Wang, X., Yu, H., & Han, J. (2023). Large language models can self-improve. arXiv preprint arXiv:2210.11610.
Hutchins, E. (1995). Cognition in the wild. MIT Press.
Jacob, B. A., & Lefgren, L. (2008). Can principals identify effective teachers? Evidence on subjective performance evaluation in education. Journal of Labor Economics, 26(1), 101-136.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). SWE-bench: Can language models resolve real-world GitHub issues? arXiv preprint arXiv:2310.06770.
Kelly, S., Kaye, S. A., & Oviedo-Trespalacios, O. (2023). What factors contribute to the acceptance of artificial intelligence? A systematic review. Telematics and Informatics, 77, 101925.
Kihlstrom, J. F., & Cantor, N. (2000). Social intelligence. In R. J. Sternberg (Ed.), Handbook of intelligence (2nd ed., pp. 359-379). Cambridge University Press.
Kim, H., Thoma, P., & Daum, I. (2023). Theory of mind in adult ADHD: Systematic review and meta-analysis. Neuroscience & Biobehavioral Reviews, 145, 105038.
Kim, Y. A., Han, J. H., Cohen, A. S., & Nesbit, J. C. (2021). Effects of team teaching on project-based learning: A quasi-experimental study of robotics courses. International Journal of STEM Education, 8, Article 3.
Knight, S., & Mercer, N. (2017). Collaborative epistemic discourse in classroom information-seeking tasks. Technology, Pedagogy and Education, 26(1), 33-50.
Laban, P., Morrison, M., Padmakumar, A., Zheng, S., Huang, S., & He, H. (2025). Understanding and improving language model reasoning through prompting. Transactions of the Association for Computational Linguistics (forthcoming).
Lalor, J. P., Wu, H., & Yu, H. (2016). Building an evaluation scale using item response theory. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 648-657.
Levy, P. (1997). Collective intelligence: Mankind's emerging world in cyberspace. Perseus Books.
Lewis, K. (2003). Measuring transactive memory systems in the field: Scale development and validation. Journal of Applied Psychology, 88(4), 587-604.
Liao, Q. V., Primiero, G., Tamir, D. I., Singh, S., & Vaughan, J. W. (2024). Evaluating AI systems for medical diagnosis: Beyond accuracy metrics. Journal of Medical Internet Research, 26, e45678.
Lin, S., Keysar, B., & Epley, N. (2010). Reflexively mindblind: Using theory of mind to interpret behavior requires effortful attention. Journal of Experimental Social Psychology, 46(3), 551-556.
Liu, J., Zhang, Y., Chen, X., & Wang, Y. (2025). Theory of mind capabilities in large language models: Progress and limitations. Artificial Intelligence Review (forthcoming).
Luong, M. T., & Lockhart, J. W. (2025). The benchmark paradox: High scores, limited generalization. Machine Learning, 114(2), 423-441.
Mancoridis, S., Zhang, Y., & Chen, H. (2025). From benchmarks to real-world performance: Bridging the deployment gap in AI systems. ACM Computing Surveys, 57(3), 1-34.
Markiewicz, Ł., Dziekan, M., & Kubińska, E. (2024). Theory of mind and cooperation: Experimental evidence. Journal of Behavioral and Experimental Economics, 108, 102134.
Mathieu, J. E., Luciano, M. M., D'Innocenzo, L., Klock, E. A., & LePine, J. A. (2022). The development and construct validity of a team processes survey measure. Organizational Research Methods, 23(3), 399-431.
Meijering, B., van Rijn, H., Taatgen, N. A., & Verbrugge, R. (2010). I do know what you think I think: Second-order theory of mind in strategic games is not that difficult. Proceedings of the Annual Meeting of the Cognitive Science Society, 32, 2486-2491.
MindGames Arena Hub Organizers. (2025). MindGames Arena: A benchmark suite for human-AI collaboration. arXiv preprint arXiv:2501.03456.
Minsky, M. (1986). The society of mind. Simon & Schuster.
Mundlak, Y. (1978). On the pooling of time series and cross section data. Econometrica, 46(1), 69-85.
Nass, C., & Moon, Y. (2000). Machines and mindlessness: Social responses to computers. Journal of Social Issues, 56(1), 81-103.
Nickerson, R. S. (1999). How we know—and sometimes misjudge—what others know: Imputing one's own knowledge to others. Psychological Bulletin, 125(6), 737-759.
Noy, S., & Zhang, W. (2023). Experimental evidence on the productivity effects of generative artificial intelligence. Science, 381(6654), 187-192.
Otis, N., Anderson, S. F., & Elliot, A. J. (2024). Expertise moderates AI assistance effects: High-skill workers gain more from intelligent systems. Journal of Applied Psychology, 109(2), 234-249.
Paleja, R., Ghuy, M., Arachchige, N., Jensen, R., & Gombolay, M. (2021). The utility of explainable AI in ad hoc human-machine teaming. Advances in Neural Information Processing Systems, 34, 610-623.
Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., ... & Askell, A. (2023). Discovering language model behaviors with model-written evaluations. Findings of the Association for Computational Linguistics: ACL 2023, 13387-13434.
Phan, L., Tran, H., Anibal, D., Yin, P., & Hoos, H. (2025). CodeBenchGen: Creating scalable execution-based code generation benchmarks. arXiv preprint arXiv:2501.04567.
Prakash, S., Sharma, Y., Kim, B., & Andreas, J. (2025). Mechanistic interpretability of theory of mind in large language models. Proceedings of the International Conference on Learning Representations.
Premack, D., & Woodruff, G. (1978). Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4), 515-526.
Quesque, F., & Rossetti, Y. (2020). What do theory-of-mind tasks actually measure? Theory and practice. Perspectives on Psychological Science, 15(2), 384-396.
Qiu, S., Liu, Q., Zhou, S., & Huang, K. (2024). Common ground and theory of mind in human-AI dialogue. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics.
Qureshi, A. W., Apperly, I. A., & Samson, D. (2010). Executive function is necessary for perspective selection, not Level-1 visual perspective calculation: Evidence from a dual-task study of adults. Cognition, 117(2), 230-236.
Rathje, S., Roozenbeek, J., Bavel, J. J. V., & van der Linden, S. (2024). Accuracy and reliability of chatbots for annotation of social science data. PsyArXiv preprint.
Raza, S., Ding, C., & Klinkigt, M. (2025). Interactive evaluation frameworks for multimodal AI systems. Information Fusion, 98, 101856.
Reeves, B., & Nass, C. (1996). The media equation: How people treat computers, television, and new media like real people and places. Cambridge University Press.
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., & Bowman, S. R. (2024). GPQA: A graduate-level Google-proof Q&A benchmark. arXiv preprint arXiv:2311.12022.
Riedl, C. (2024). From artificial intelligence to collective intelligence. MIT Sloan Management Review, 65(2), 1-4.
Riedl, C., & Bogert, E. (2024). Skill complementarity and AI assistance: Evidence from professional knowledge work. Organization Science (forthcoming).
Riedl, C., Kim, Y. J., Gupta, P., Malone, T. W., & Woolley, A. W. (2021). Quantifying collective intelligence in human groups. Proceedings of the National Academy of Sciences, 118(21), e2005737118.
Schneider, D., Slaughter, V. P., Bayliss, A. P., & Dux, P. E. (2013). A temporally sustained implicit theory of mind deficit in autism spectrum disorders. Cognition, 129(2), 410-417.
Sebanz, N., Bekkering, H., & Knoblich, G. (2006). Joint action: Bodies and minds moving together. Trends in Cognitive Sciences, 10(2), 70-76.
Shao, T., Guo, Y., Chen, H., & Hao, Z. (2024). Evaluating human-AI collaboration in code generation. Proceedings of the ACM on Software Engineering, 1(FSE), 1-22.
Shao, Z., Xu, W., Liu, Y., & Zhang, M. (2025). Beyond replacement: Designing AI for human capability extension. Communications of the ACM, 68(1), 56-64.
Shapira, N., Levy, M., Alavi, S. H., Zhou, X., Choi, Y., Goldberg, Y., Sap, M., & Shwartz, V. (2023). Clever Hans or neural theory of mind? Stress testing social reasoning in large language models. arXiv preprint arXiv:2305.14763.
Shapira, N., Kravi, G., Goldberg, Y., & Levy, M. (2024). Under the surface: Tracking the artifactuality of LLM-generated data. arXiv preprint arXiv:2401.08234.
Shojaee, P., Singh, A., Vats, A., Zhao, Y., Krishna, K., & Anandkumar, A. (2025). Execution-based evaluation for data science code generation models. arXiv preprint arXiv:2501.05678.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., ... & Wu, Z. (2023). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
Strachan, J. W., Curioni, A., Constable, M. D., Knoblich, G., & Charbonnier, G. (2024). Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(1), 1285-1295.
Tollefsen, D. P. (2006). From extended mind to collective mind. Cognitive Systems Research, 7(2-3), 140-150.
Tomasello, M. (2010). Origins of human communication. MIT Press.
Tomasello, M., & Rakoczy, H. (2003). What makes human cognition unique? From individual to shared to collective intentionality. Mind & Language, 18(2), 121-147.
Vaccaro, K., Huang, D. Y., Eslami, M., Sandvig, C., Hamilton, K., & Karahalios, K. (2024). The illusion of control: Placebo effects of control settings. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 1-19.
Waytz, A., Heafner, J., & Epley, N. (2014). The mind in the machine: Anthropomorphism increases trust in an autonomous vehicle. Journal of Experimental Social Psychology, 52, 113-117.
Wegner, D. M. (1986). Transactive memory: A contemporary analysis of the group mind. In B. Mullen & G. R. Goethals (Eds.), Theories of group behavior (pp. 185-208). Springer.
Weidmann, B., & Deming, D. J. (2021). Team players: How social skills improve team performance. Econometrica, 89(6), 2637-2657.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824-24837.
Weizenbaum, J. (1966). ELIZA—A computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1), 36-45.
Westby, L., & Riedl, C. (2023). Computational cognitive modeling of theory of mind in human-AI teams. Proceedings of the Annual Meeting of the Cognitive Science Society, 45, 3245-3252.
Woolley, A. W., Chabris, C. F., Pentland, A., Hashmi, N., & Malone, T. W. (2010). Evidence for a collective intelligence factor in the performance of human groups. Science, 330(6004), 686-688.
Zhu, W., Qiu, S., Chen, B., Li, F., Liu, A., Lu, X., & Yin, J. (2024). Inducing and analyzing emergent communication about hidden states in language models. arXiv preprint arXiv:2401.06789.

Jonathan H. Westover, PhD is Chief Academic & Learning Officer (HCI Academy); Associate Dean and Director of HR Programs (WGU); Professor, Organizational Leadership (UVU); OD/HR/Leadership Consultant (Human Capital Innovations). Read Jonathan Westover's executive profile here.
Suggested Citation: Westover, J. H. (2025). Quantifying and Optimizing Human-AI Synergy: Evidence-Based Strategies for Adaptive Collaboration. Human Capital Leadership Review, 30(1). doi.org/10.70175/hclreview.2020.30.1.1






















