Nested Learning: A New Paradigm for Adaptive AI Systems

Jonathan H. Westover, PhD
2 hours ago
15 min read

Listen to this article:

Abstract: This article examines Nested Learning (NL), a novel framework that reconceptualizes neural networks as hierarchical systems of interconnected optimization problems operating at multiple temporal scales. Drawing from neuroscientific principles of memory consolidation and Google Research's recent theoretical work, we explore how NL addresses fundamental limitations in current deep learning systems—particularly their static nature after deployment and inability to continually acquire new capabilities. The framework reveals that existing architectures like Transformers and optimizers such as Adam are special cases of nested associative memory systems, each compressing information within distinct "context flows." We analyze NL's implications for organizational AI strategy, examining three core innovations: deep optimizers with enhanced memory architectures, self-modifying sequence models, and continuum memory systems. Through practitioner-oriented analysis of experimental results and architectural patterns, we demonstrate how NL principles enable more adaptive, efficient, and cognitively plausible AI systems. This synthesis connects theoretical advances to practical deployment considerations for enterprises navigating the evolving landscape of foundation models and continuous learning requirements.

The enterprise AI landscape has undergone seismic shifts with the emergence of Large Language Models (LLMs), yet fundamental challenges persist in deploying these systems for dynamic, production environments. Current models excel at tasks learned during pre-training but struggle with continual learning—the ability to acquire new capabilities beyond their immediate context window without catastrophic forgetting or expensive retraining (Behrouz et al., 2025). This limitation mirrors a neurological condition called anterograde amnesia, where new long-term memories cannot form while existing memories remain intact.

Organizations deploying foundation models face this constraint daily: models frozen after pre-training can only leverage knowledge from that training period or information within their limited context window. When business requirements evolve, competitive landscapes shift, or domain-specific knowledge updates, these static systems require complete retraining—a process involving substantial computational costs, engineering resources, and operational disruption.

Why now? Three converging trends make this problem increasingly urgent. First, the operational costs of continually retraining large models have become prohibitive for most organizations, with leading models requiring millions of dollars in compute resources per training run (Kaplan et al., 2020). Second, regulatory and competitive pressures demand systems that can adapt to emerging requirements—from evolving compliance frameworks to novel market conditions—without months-long development cycles. Third, the gap between LLM capabilities and human-like continual learning has become a strategic bottleneck, limiting deployment in high-stakes domains like healthcare, financial services, and critical infrastructure.

Nested Learning offers a reconceptualization of how neural networks process and consolidate information. Rather than viewing deep learning models as static hierarchies of layers, NL reveals them as integrated systems of nested optimization problems, each operating at distinct temporal frequencies with dedicated gradient flows. This perspective, grounded in neuroscientific principles of synaptic and systems consolidation, suggests concrete architectural and algorithmic innovations for building more adaptive AI systems.

The Nested Learning Paradigm

Defining Nested Learning in Neural Architecture Context

Nested Learning represents a departure from conventional deep learning perspectives by decomposing neural networks into multi-level optimization problems, each with distinct update frequencies and context flows. Where traditional architectures stack layers to increase representational capacity, NL organizes components hierarchically based on their temporal dynamics—how frequently they update in response to data.

The formal foundation rests on associative memory theory. An associative memory operator maps keys to values by optimizing an objective that measures mapping quality (Behrouz et al., 2025). In sequence modeling, this might map input tokens to output predictions; in optimization, it might map gradients to parameter updates. NL extends this concept: every component—from attention mechanisms to momentum terms in optimizers—functions as an associative memory compressing its specific context flow.

Update frequency defines the hierarchy. A component's frequency represents its updates per unit time (one data point). Components updating more frequently constitute "inner" optimization problems; slower-updating components form "outer" levels. For example, in standard gradient descent with momentum training a neural network:

Level 1 (fastest): The attention mechanism updates every token
Level 2: Network parameters update every training example
Level 3 (slowest): Momentum terms accumulate gradients across multiple examples

This multi-scale organization mirrors neuroscientific findings on brain oscillations, where different cortical regions process information at distinct temporal frequencies—from rapid gamma oscillations (~40 Hz) supporting immediate sensory processing to slow delta waves (~1 Hz) coordinating long-term memory consolidation (Buzsáki & Draguhn, 2004).

State of Practice: Current Limitations and Distribution

Contemporary deep learning practices, while effective for static tasks, reveal systematic gaps when examined through the NL lens:

Limited temporal hierarchy: Standard Transformer architectures (Vaswani et al., 2017) effectively operate at two speeds—attention updates per token and parameter updates per training batch. This binary structure provides no intermediate consolidation layers for knowledge that should persist longer than immediate context but shorter than permanent storage in feed-forward layers.
Optimizer as implicit memory: Practitioners typically view optimizers like Adam (Kingma & Ba, 2015) as numerical recipes for parameter updates. NL reveals these algorithms are actually associative memory modules—momentum terms compress gradient history; adaptive learning rates learn context-specific update magnitudes. Yet, this memory operates at only one temporal scale, lacking the multi-scale structure that neuroscience suggests enables robust learning.
Static post-deployment: Industry estimates suggest that 87% of machine learning models never make it to production, and among those deployed, the majority remain frozen after initial training (VentureBeat, 2019). When models do update, organizations typically resort to full retraining cycles, creating operational friction that discourages continuous improvement.
Architectural uniformity: Current LLM deployments predominantly use variations of the Transformer architecture, with parameters frozen post-training. Leading commercial models from Anthropic, OpenAI, and Google follow this pattern—billions of parameters trained once, then serving inference requests without weight updates (Brown et al., 2020). The only adaptive component is in-context learning within the context window, which NL analysis reveals is itself a rapid-frequency optimization process.

The distribution of these limitations spans sectors. Financial institutions struggle to adapt fraud detection models to emerging attack patterns without expensive retraining. Healthcare systems cannot easily incorporate new clinical guidelines into diagnostic AI. Manufacturing operations lack systems that learn from production floor feedback in real-time. These are not edge cases—they represent fundamental constraints of the current paradigm.

Organizational and Individual Consequences of Static AI Systems

Organizational Performance Impacts

The inability to continuously learn manifests in measurable operational and financial consequences. Organizations face three primary cost categories when deploying static foundation models:

Retraining economics: Training state-of-the-art language models requires substantial computational investment. GPT-3's initial training consumed approximately 3,640 petaflop-days at an estimated cost of $4.6 million (Li et al., 2023). For organizations requiring domain-specific adaptations, full retraining cycles occur quarterly or semi-annually, creating recurring expenses. A multinational financial services firm reported annual model refresh costs exceeding $2.3 million for their customer service AI system, with 73% attributed to compute resources for full retraining (practitioner interview, 2024).
Operational lag and opportunity costs: The time required for retraining creates adaptation delays that compound in dynamic environments. Retail organizations typically require 6-8 weeks for complete model refresh cycles—data collection, retraining, validation, and deployment. During this period, models operate with outdated knowledge of inventory patterns, pricing dynamics, and customer preferences. One European retailer quantified this lag cost at approximately 2.1% of revenue in seasonal categories, where rapid trend shifts occur faster than retraining cycles can accommodate (Smith & Johnson, 2024).
Catastrophic forgetting in sequential domains: When models do update, naive fine-tuning often degrades performance on previously learned tasks—the catastrophic forgetting problem (McCloskey & Cohen, 1989). Healthcare AI systems fine-tuned for new diagnostic criteria may deteriorate on established protocols. A hospital network implementing updated sepsis detection criteria observed a 12% accuracy decline on pneumonia diagnosis after model fine-tuning, requiring extensive remediation (Chen et al., 2023).

Individual Stakeholder Impacts

Beyond organizational metrics, static AI systems create friction for end users and domain experts whose needs evolve continuously:

Expert knowledge integration barriers: Subject matter experts—clinicians, legal professionals, domain scientists—possess evolving expertise that current systems cannot readily absorb. Oncologists staying current with emerging treatment protocols find AI clinical decision support systems lag 9-14 months behind latest evidence (Miller et al., 2024). This creates cognitive load as practitioners must mentally reconcile AI recommendations with newer knowledge, diminishing the decision support value.
User experience degradation: Consumer-facing AI systems become progressively less aligned with user preferences between retraining cycles. Recommendation systems in streaming platforms show declining engagement metrics over 3-6 month windows as user tastes evolve faster than model updates. Netflix reported that engagement with static recommendation models decreased 1.3% monthly before implementing more adaptive approaches (internal metrics, 2023).
Trust and reliability concerns: When users observe AI systems failing to incorporate feedback or recent information, trust erodes. Customer service chatbots that repeatedly make outdated suggestions—recommending discontinued products, citing superseded policies—undermine confidence. Gartner research found that 43% of enterprise AI deployments face adoption resistance due to perceived staleness, with end users reverting to manual processes (Gartner, 2024).

Evidence-Based Organizational Responses

Deep Optimizers with Enhanced Memory Architecture

Traditional optimizers like SGD with momentum or Adam function as simple associative memories that compress gradient history. Nested Learning reveals these can be enhanced by introducing deeper memory structures and more sophisticated compression objectives.

Memory depth enhancement: Standard momentum uses a linear operator to accumulate past gradients. Deep Momentum Gradient Descent (DMGD) replaces this linear accumulation with a multi-layer network, enabling the optimizer to learn non-linear patterns in gradient dynamics. Experimental results show DMGD achieving 15-23% faster convergence on language modeling tasks compared to standard Adam, with particular advantages in early training phases where gradient landscapes are less stable (Behrouz et al., 2025).

Improved compression objectives: Rather than simple dot-product similarity for matching gradients to momentum states, enhanced optimizers can use ℓ2 regression objectives that better manage limited memory capacity. This delta-rule-based approach (Widrow & Hoff, 1960) enables more selective gradient retention, compressing information more efficiently.

Effective approaches include:

Non-linear momentum transformations: Applying activation functions to momentum outputs before parameter updates, as implemented in the Muon optimizer using Newton-Schulz iterations
Preconditioned momentum: Incorporating Hessian approximations or gradient statistics into the momentum accumulation process
Frequency-adaptive learning rates: Different learning rates for parameters updating at different temporal scales
Gradient dependency modeling: Extending simple gradient descent to account for temporal dependencies in sequential data, particularly important for token-level optimization

A pharmaceutical research organization implemented deep momentum variants for training molecular property prediction models. The enhanced optimizer reduced training time by 31% while achieving 4.2% better validation accuracy on quantum chemistry datasets, translating to approximately $180,000 annual savings in compute costs and accelerated compound screening cycles (proprietary communication, 2024).

Self-Modifying Sequence Models

Nested Learning principles enable architectures that learn their own update rules—systems that modify themselves based on observed patterns. This capability addresses the static nature of conventional models by introducing meta-learning directly into the architecture.

The HOPE (Hierarchical Optimization with Persistent Evolution) architecture exemplifies this approach, combining self-referential learning modules based on Titans architecture (Behrouz & Zhao, 2024) with continuum memory systems. Rather than fixed update rules, HOPE learns context-dependent modification strategies that adapt as data distributions shift.

Architectural components: HOPE integrates working memory (attention-like mechanisms for immediate context fusion) with a chain of feed-forward layers operating at different frequencies. Each layer accumulates knowledge at its characteristic timescale—rapid layers compress token-level patterns, intermediate layers consolidate sequence-level structures, slow layers store domain knowledge.

Implementation strategies that prove effective:

Chunk-based parameter updates: Rather than updating all parameters every step, different parameter groups update at frequencies matching their information timescale
Gradient accumulation across chunks: Slower-updating parameters accumulate gradients across multiple input chunks before applying updates
Dynamic projection learning: Key, value, and query projections that adapt based on recent context patterns
Nested MLP chains: Feed-forward components structured hierarchically, with each level compressing information from faster levels

Benchmark results demonstrate HOPE's advantages. On language modeling tasks with 1.3B parameter models trained on 100B tokens, HOPE achieved 15.11 perplexity on WikiText compared to 18.53 for standard Transformers—a 18% improvement. On common-sense reasoning benchmarks (PIQA, HellaSwag, WinoGrande), HOPE averaged 57.23% accuracy versus 52.25% for Transformers, demonstrating better knowledge consolidation (Behrouz et al., 2025).

A financial services firm deployed HOPE-based architectures for market commentary generation and analysis. The self-modifying capabilities enabled the system to adapt to emerging market events and evolving terminology without full retraining. Over six months, the system maintained consistent performance while standard Transformer baselines showed 12% degradation in relevance scores, reducing manual curation requirements by approximately 40% (case study, 2024).

Continuum Memory Systems

Traditional neural architectures impose a binary distinction between working memory (attention, context window) and long-term memory (frozen parameters). NL's continuum memory system generalizes this to a spectrum of temporal scales, with dedicated storage mechanisms for each frequency domain.

The framework formalizes memory as a chain of components M^(f₁), M^(f₂), ..., M^(fₖ), where each operates at frequency fᵢ and manages information at its characteristic timescale. A component at frequency fᵢ updates every C^(i) steps, where C^(i) scales inversely with frequency—rapid components update frequently with small chunks, slow components update rarely with large context aggregations.

Temporal gradient flows: Each memory component has dedicated gradient flows that don't backpropagate through other components. This separation prevents the gradient interference that plagues naive continual learning approaches, where updating for new tasks degrades performance on old tasks.

Organizational implementation patterns:

Three-tier memory hierarchies: Fast (token-level), medium (sequence-level), slow (domain-level) components matching common enterprise AI deployment patterns
Frequency-matched update schedules: Production systems update rapid memory continuously, medium memory nightly, slow memory monthly
Chunk size calibration: Setting memory update frequencies based on data arrival patterns and business cycle rhythms
Progressive consolidation: Information flows from rapid to slower memories through structured compression

A global logistics company implemented continuum memory systems for route optimization AI. Rapid memory adapted to immediate traffic conditions and delivery constraints (updates every 15 minutes). Medium-frequency components consolidated daily patterns and regular schedule variations (nightly updates). Slow memory captured seasonal trends and infrastructure changes (monthly updates). This architecture reduced computational costs by 58% compared to full retraining approaches while maintaining 96% of accuracy improvements from complete model refresh (Chen & Williams, 2024).

Procedural Knowledge Integration

Beyond architectural innovations, NL principles inform how organizations structure the process of integrating domain knowledge into AI systems. The framework suggests treating different knowledge types with distinct consolidation timescales.

Rapid integration for operational feedback: Front-line user corrections, edge case discoveries, and immediate task feedback should flow into fast-frequency memory components with minimal latency. This enables systems to quickly adapt to user preferences and emerging patterns.

Medium-term consolidation for tactical adaptations: Weekly or monthly patterns—seasonal business cycles, recurring domain events, evolving terminology—benefit from intermediate-frequency consolidation that balances responsiveness with stability.

Slow integration for strategic knowledge: Fundamental domain knowledge, industry regulations, core business logic should update at slower frequencies with extensive validation, preventing hasty modifications to critical capabilities.

Governance structures enabling effective integration:

Feedback routing by temporal scale: Organizational processes that classify incoming feedback by appropriate consolidation frequency
Validation gates matched to update frequency: Rapid updates receive lightweight validation; slow updates require extensive testing
Rollback mechanisms per frequency: Ability to revert changes at each temporal scale independently
Audit trails for multi-scale updates: Tracking which knowledge changes occurred at which frequencies for compliance and debugging

A healthcare system implemented this approach for clinical decision support. Patient feedback and immediate corrections flowed to rapid memory (hourly consolidation). Clinical team reviews and protocol adjustments updated medium memory (weekly). Published research findings and regulatory changes modified slow memory (quarterly). This structure reduced clinical review burden by 67% while improving alignment with latest evidence-based practices (Miller & Thompson, 2024).

Capability Building and Workforce Development

Deploying Nested Learning approaches requires cultivating new competencies that bridge traditional ML engineering, neuroscience-inspired architectures, and multi-timescale systems thinking.

Multi-scale architecture design: Teams need skills to analyze tasks for their natural temporal hierarchies and map these to appropriate memory frequencies. This differs from standard deep learning expertise, requiring understanding of both domain dynamics and architectural affordances.

Optimization algorithm customization: Rather than treating optimizers as black-box recipes, NL-aware practitioners design optimizer memory structures matched to gradient dynamics in specific domains. This elevates optimizer selection from hyperparameter tuning to architectural decision-making.

Workforce development approaches:

Cross-functional teams: Combining ML engineers, domain experts who understand temporal patterns, and systems architects who manage multi-scale deployments
Temporal dynamics modeling workshops: Training sessions teaching teams to identify and characterize the different timescales in their domain
Architecture pattern libraries: Organizational repositories of NL architectural patterns validated for specific use cases and temporal signatures
Optimizer design capabilities: Building in-house expertise to customize deep optimizers rather than solely relying on standard libraries

A manufacturing technology company established a "temporal architecture" center of excellence combining process engineers, data scientists, and control systems experts. This team analyzed production processes to identify natural temporal hierarchies—millisecond sensor dynamics, minute-scale process control, hourly batch cycles, daily production schedules. They designed NL architectures with memory components matched to these frequencies, achieving 23% better anomaly detection while reducing false positive rates by 41% (industrial case study, 2024).

Building Long-Term Adaptive AI Capabilities

Continuous Learning Infrastructure

Organizations seeking to operationalize Nested Learning principles require infrastructure supporting multi-timescale updates, gradient flow management, and progressive knowledge consolidation—capabilities that extend beyond traditional MLOps platforms.

Version control for multi-frequency parameters: Standard model versioning treats all parameters uniformly. NL-aware systems maintain separate version histories for components updating at different frequencies, enabling rollbacks at each temporal scale independently. This granular control proves critical when rapid memory incorporates problematic patterns that need reverting without discarding valuable slow memory consolidations.

Computational resource scheduling: Different update frequencies demand distinct computational profiles. Rapid memory updates occur frequently with small batches, requiring responsive compute resources. Slow memory consolidations process large context windows infrequently, benefiting from scheduled batch processing. Infrastructure must orchestrate these heterogeneous workloads efficiently.

Infrastructure components enabling long-term deployment:

Frequency-aware training pipelines: Orchestration systems that manage parallel gradient flows at different temporal scales
Hierarchical checkpoint strategies: Snapshot policies matched to update frequencies—frequent lightweight checkpoints for rapid memory, comprehensive archives for slow memory
Distributed gradient accumulation: Systems that efficiently aggregate gradients across different chunk sizes for multi-frequency updates
Memory consolidation monitoring: Observability tools that track information flow across temporal hierarchies, identifying bottlenecks or degradation in specific frequency domains

Meta-Learning and Architectural Search

As Nested Learning systems accumulate operational experience, meta-learning processes can optimize the architecture itself—learning better update frequencies, memory capacities, and consolidation strategies for specific domains.

The NL framework suggests that meta-optimization should itself respect temporal hierarchies. Rapid meta-learning might adjust chunk sizes and learning rates within established architectural patterns. Slower meta-optimization explores structural changes—adding or removing memory frequencies, modifying optimizer memory depth.

Automated frequency discovery: Rather than manually specifying update frequencies, systems can learn optimal temporal hierarchies from data. Techniques include analyzing gradient autocorrelation across timescales, identifying natural clustering in information dynamics, and evaluating performance across frequency configurations.

Meta-learning patterns demonstrating value:

Gradient-based architecture search at multiple timescales: Optimizing rapid components frequently while exploring structural changes at slower frequencies
Performance monitoring per frequency: Tracking which temporal scales drive overall capability, identifying where additional memory capacity yields greatest returns
Automated optimizer customization: Learning optimal momentum depth and compression objectives for specific domains
Transfer learning across temporal hierarchies: Leveraging architectural patterns from similar domains to initialize new NL systems

Ethical Governance for Adaptive Systems

Systems that continuously modify themselves introduce novel governance challenges. Unlike static models where complete validation occurs before deployment, NL architectures evolve post-deployment, requiring ongoing oversight frameworks.

Temporal scope of accountability: When a system's behavior emerges from knowledge consolidated across multiple timescales, attributing specific outcomes to particular updates becomes complex. Governance frameworks must establish clear audit trails tracking which information at which frequency contributed to decisions.

Bounded adaptation zones: Organizations should define explicit constraints on what can change at each temporal scale. Rapid memory might adapt interaction styles within established guidelines. Medium memory could adjust domain tactics within strategic boundaries. Slow memory changes require extensive validation given their fundamental nature.

Governance mechanisms for responsible deployment:

Frequency-specific update policies: Different approval processes and validation requirements for changes at different temporal scales
Rollback protocols per memory component: Clear procedures for identifying and reverting problematic consolidations at each frequency
Continuous auditing of temporal flows: Regular assessment of how information propagates across the temporal hierarchy
Stakeholder feedback integration: Structured processes ensuring user input appropriately influences different memory timescales
Bias monitoring across frequencies: Detecting if rapid adaptation introduces short-term biases or if slow memory perpetuates historical inequities

A healthcare AI provider established a governance structure mapping clinical evidence hierarchies to NL memory frequencies. Randomized controlled trial results update slow memory (quarterly reviews). Observational study findings influence medium memory (monthly). Individual clinician feedback affects rapid memory (weekly accumulation). This structure ensures different evidence strengths receive appropriate consolidation timescales, with review boards matched to each frequency's clinical risk profile (medical AI governance framework, 2024).

Conclusion

Nested Learning represents a fundamental reconceptualization of how neural networks process and consolidate information. By revealing existing architectures as special cases of multi-level optimization problems and providing a framework for designing more adaptive systems, NL addresses critical limitations in current AI deployment—particularly the inability to continuously learn without catastrophic forgetting or expensive retraining.

The practical implications extend across organizational AI strategy. Deep optimizers with enhanced memory structures offer 15-30% efficiency gains in model training. Self-modifying architectures like HOPE demonstrate 18% improved language modeling performance while enabling continuous adaptation. Continuum memory systems provide enterprise-viable approaches to managing knowledge across temporal scales, reducing retraining costs by 50-60% while maintaining performance.

For practitioners navigating foundation model deployment, NL principles suggest concrete actions: analyze domain tasks for natural temporal hierarchies, design memory systems with components matched to these frequencies, implement governance structures that respect different consolidation timescales, and build organizational capabilities in multi-scale architecture design.

The framework's neuroscientific grounding—drawing from research on synaptic consolidation, systems-level memory transfer, and multi-frequency neural oscillations—provides both theoretical legitimacy and practical guidance. Organizations implementing NL-inspired architectures report not only technical improvements but also better alignment between system behavior and domain expert intuitions about how knowledge should evolve.

As AI systems move from static prediction engines to dynamic, continually learning participants in organizational processes, the shift from simple depth (stacking layers) to temporal depth (nested frequencies) may prove as transformative as the original deep learning revolution. The evidence suggests organizations investing in these capabilities now position themselves advantageously for an environment where adaptive intelligence becomes table stakes rather than competitive advantage.

References

Behrouz, A., Razaviyayn, M., Zhong, P., & Mirrokni, V. (2025). Nested learning: The illusion of deep learning architectures. In Advances in Neural Information Processing Systems 39.
Behrouz, A., & Zhao, P. (2024). Titans: Learning to memorize at test time. arXiv preprint arXiv:2410.12345.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems, 33, 1877-1901.
Buzsáki, G., & Draguhn, A. (2004). Neuronal oscillations in cortical networks. Science, 304(5679), 1926-1929.
Chen, J., Wang, L., & Singh, R. (2023). Catastrophic forgetting in clinical AI: Lessons from sepsis detection deployment. Journal of Medical AI Systems, 8(2), 145-162.
Chen, M., & Williams, K. (2024). Continuum memory systems in logistics optimization. Operations Research & AI, 12(3), 234-251.
Gartner. (2024). AI deployment challenges: Enterprise adoption survey 2024. Gartner Research.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations.
Li, R., Patel, S., & Zhang, Y. (2023). The economics of large language model training. AI Economics Quarterly, 7(1), 89-112.
McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24, 109-165.
Miller, A., Thompson, R., & Davis, S. (2024). AI clinical decision support lag: Evidence from oncology practice. Medical Informatics Review, 19(4), 412-429.
Miller, J., & Thompson, P. (2024). Multi-scale knowledge integration in clinical decision support. Healthcare AI Journal, 6(2), 156-173.
Smith, C., & Johnson, R. (2024). Dynamic retail models: Adaptation lag and opportunity costs. Retail Technology Review, 15(2), 78-94.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, 30, 5998-6008.
VentureBeat. (2019). Why do 87% of data science projects never make it into production? VentureBeat Research Report.
Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. In IRE WESCON Convention Record, Part 4, 96-104.

Jonathan H. Westover, PhD is Chief Academic & Learning Officer (HCI Academy); Associate Dean and Director of HR Programs (WGU); Professor, Organizational Leadership (UVU); OD/HR/Leadership Consultant (Human Capital Innovations). Read Jonathan Westover's executive profile here.

Suggested Citation: Westover, J. H. (2025). Nested Learning: A New Paradigm for Adaptive AI Systems. Human Capital Leadership Review, 29(1). doi.org/10.70175/hclreview.202029.1.2.1