The Future of Compute: How AI Agents Are Reshaping Infrastructure (Part 2)

Priyanka Somrah & Diego Oppenheimer

Apr 15, 2025

The Future of Compute: How AI Agents Are Reshaping Infrastructure (Part 2)

This is the second part of our series examining how AI's computational patterns are forcing a fundamental rethinking of resource architecture and management. In Part 1, we explored the evolution of compute paradigms, the unique requirements of AI agents, and the economic challenges of current approaches.

Challenges for Current Compute Models in AI Agent Workloads

The increasing sophistication of AI agents reveals critical limitations in current compute paradigms. The sophistication is growing dramatically as agents are evolving from simple rule-based systems to complex entities leveraging large language models, multi-modal capabilities, and expanding toolsets for interacting with the world. Traditional infrastructure approaches exist on a spectrum: at one extreme, dedicated machine rental offers complete control but suffers from significant cost inefficiencies and poor utilization, as resources sit idle during processing lulls. At the opposite end, serverless computing optimizes resource allocation through ephemeral, on-demand execution but struggles with maintaining persistent state, faces problematic cold start latency issues that can severely impact agent responsiveness and probably most importantly is the most difficult to debug (a fact that is amplified by the complexity of debugging probabilistic systems in general)

This tension extends beyond the classic stateful versus stateless architectural debate in distributed systems. AI agents introduce additional complexity by typically requiring heterogeneous compute resources,CPU for orchestration and GPU for inference;often with different scaling patterns and utilization curves. While some organizations abstract GPU compute via API-based model providers, this introduces different architectural considerations rather than simply adding complexity. These considerations include trade-offs between latency requirements, data security constraints, model customization needs, and resilience planning for service dependencies, debugging and management

Managing these diverse compute requirements presents fundamental challenges across infrastructure types. Dedicated approaches struggle with capacity planning for unpredictable reasoning patterns, containerized environments face complex GPU and memory configurations, and serverless models contend with cognitive disruptions from cold starts and execution limits. These challenges are compounded by AI agents' hybrid needs—requiring continuous availability and state retention alongside real-time responsiveness and multi-agent interactions. Even event-driven paradigms face reliability challenges when coordinating complex triggering conditions across distributed agent systems.

The complexity intensifies with agent swarms, where parent agents delegate to specialized child agents in dynamic fan-out/fan-in patterns that stress traditional scaling mechanisms. While containerization and serverless excel at distributed workload scaling, they struggle with persistent memory states and latency-sensitive interactions, forcing developers to implement external state management that adds overhead and diminishes their inherent advantages.

Memory-Compute Disaggregation

A promising architectural evolution to address the challenges of AI agent infrastructure involves the disaggregation of memory and compute resources specifically for inference workloads. Traditional architectures tightly couple memory with processing units, creating significant inefficiencies for agent operations where memory demands fluctuate dramatically based on context length, reasoning complexity, and simultaneous interaction volume.

Emerging technologies like Compute Express Link (CXL) enable a more flexible relationship between compute and memory resources, providing key advantages for agent inference:

Dynamic memory scaling based on contextual requirements without over-provisioning compute
Efficient persistent state maintenance across ephemeral compute instances, preserving agent context between interactions
Shared memory pools across collaborative agent systems, enabling more effective multi-agent operations
Optimized access patterns for the unique memory retrieval needs of large language models during inference

Unlike AI training environments where compute-to-memory ratios remain relatively constant, agent inference workloads exhibit highly variable memory patterns. Organizations like Meta and Google are already implementing disaggregated memory architectures in their production inference environments. Based on initial explorations in the CXL space there is an expected 40% improvement in resource utilization for AI agent workloads.

Distinguishing VRAM for Model Weights vs. Agent State

A crucial nuance in memory-compute disaggregation is the difference between the memory allocated to store and serve model weights (often in GPU VRAM) and the memory required for each agent’s unique state. Model weights can often be shared across multiple agents when they are all invoking the same underlying model, creating opportunities for more efficient resource pooling. In contrast, each agent’s context and state (e.g., conversation histories, intermediate reasoning, specialized knowledge) is unique and must be allocated separately.

This distinction leads to new design patterns for memory management:

Shared Model Memory: Keep large model weight sets in a common pool accessible to many agents, thereby reducing unnecessary duplication of VRAM allocations.
Per-Agent State Memory: Dynamically provision memory for each agent’s context, reasoning steps, and interactions. Since these are not shared, they can spike independently, requiring flexible scaling mechanisms.

By treating model weights and agent states as separate categories of “memory demand,” a disaggregated infrastructure can handle each more intelligently: shared memory for weights to maximize utilization, and agent-specific allocations that spin up or down with agent workflows. In practice, achieving this balance demands real-time orchestration leveraging technologies like CXL to avoid bottlenecks for either model inference or per-agent state updates.

This approach represents a fundamental shift from the traditional server-centric compute model toward a more fluid, resource-pool oriented architecture that accommodates the dynamic nature of agent operations. By addressing the state management challenges that plague current serverless implementations, memory-compute disaggregation could become a cornerstone of efficient agent deployment at scale.

Real-time Collaborative Agents

The emergence of multi-agent systems and agent swarms introduces unique computational requirements beyond those of single-agent deployments. These collaborative systems demand:

Ultra-low latency inter-agent communication channels
Shared memory spaces for collective reasoning and context pass through
Orchestration layers that can efficiently coordinate distributed decision-making
Resource allocation mechanisms that prioritize critical path agents in workflows

Current compute architectures do not have a clear answer for these requirements, particularly when agents operate across different execution environments. Emerging solutions include specialized agent communication fabrics that maintain consistent performance regardless of agent location (cloud, edge, or hybrid deployments). These solutions directly address the fan-out/fan-in patterns that stress traditional scaling mechanisms. Experimental implementations demonstrate latency reductions of up to 70% compared to traditional API-based agent communication methods, potentially transforming how agent swarms operate at scale.

Hybrid Cloud-Edge Computing for AI Agents

Based on the analysis of current paradigm limitations, a hybrid cloud-edge computing model shows particular promise for addressing agent compute requirements. This approach balances:

Cloud resources for computationally intensive reasoning and large model inference
Edge computing for latency-sensitive interactions and data privacy
Local processing for continuous availability and reduced bandwidth consumption

This hybrid approach directly addresses many of the challenges identified in dedicated, containerized, and serverless paradigms by leveraging each model's strengths while mitigating its weaknesses. Rather than forcing agents into a single compute paradigm, this approach allows computation to flow to the most appropriate location based on the specific requirements of each agent task.

Organizations implementing hybrid architectures for AI agents are beginning to see improvements in both performance and cost efficiency. Interactive agent tasks benefit from reduced latency when edge computing handles time-sensitive operations, while cost savings can be realized by optimizing workload placement across the compute spectrum.

However, exploring hybrid cloud-edge computing for AI agents faces several substantial challenges:

Complex orchestration across distributed environments with varying capabilities and constraints
Increased development complexity requiring specialized skills across cloud and edge technologies
Observability and debugging complexities are exasperated by compounding erros
Synchronization challenges when maintaining state consistency across environments
Security considerations spanning multiple trust boundaries and connection points
Deployment and testing complexity across heterogeneous environments
Operational overhead of managing multiple infrastructure paradigms simultaneously

Despite these challenges, the potential benefits of intelligently distributing agent workloads across cloud and edge resources may make hybrid approaches increasingly attractive as agent capabilities and deployment scenarios grow more sophisticated.

A Unifying Framework: The Agent-Centric Computing Model

A potential view of the future could be integrated into a cohesive vision through what we propose as the "Agent-Centric Computing Model." This framework reimagines computing resources as flexible services that dynamically adapt to agent needs rather than forcing agents to conform to rigid infrastructure paradigms.

The Agent-Centric Computing Model consists of five key principles:

State-Resource Decoupling: Memory, computation, and storage are disaggregated resources that agents can independently scale based on contextual requirements
Interaction-Driven Provisioning: Compute resources are allocated based on agent interaction patterns rather than static allocation models
Context Persistence: Agent state is preserved efficiently across execution boundaries, enabling continuous agent identity without continuous resource consumption
Fluid Execution Boundaries: Computation seamlessly flows across device, edge, and cloud boundaries based on optimization requirements
Economic Resource Governance: Market-based mechanisms allocate resources among agents based on priority, value, and efficiency

This framework could directly address the economic inefficiencies and technical limitations identified in the computational economics analysis earlier. By decoupling state from compute and enabling fluid resource allocation, the model provides a potential path forward that resolves many of the tensions in current paradigms.

Future Innovations and Startup Opportunities

Several innovation opportunities emerge for future agent computing paradigms:

AI-native compute architectures provide tailored, low-latency, stateful computing optimized explicitly for AI workloads, potentially resolving the stateless vs. stateful tension identified in earlier paradigms.
Multi-agent orchestration platforms facilitate decentralized agent coordination while optimizing resource utilization across the swarm, addressing the complex economic and technical challenges of map-reduce style agent collaborations.
Dynamic AI workload routing solutions intelligently balance computation between cloud, edge, and device-level resources to optimize both cost and responsiveness, directly tackling the variable compute intensity challenge of agent workloads. Early implementations like model routers (e.g., Unify) represent initial steps in this direction, though more sophisticated routing across heterogeneous compute environments will be needed for full agent workloads.
Stateful serverless AI infrastructure merges serverless scalability with state persistence, enabling AI agents to retain memory and contextual understanding between interactions without sacrificing the economic benefits of ephemeral execution.

Agent-Driven Development and Developer Workflows

‍Much of our discussion so far focuses on runtime infrastructure changes disaggregated memory, hybrid compute models, and so on. However, a truly agent-centric future also challenges the traditional software development lifecycle itself. In current workflows, humans are the primary creators and reviewers of code or configuration changes. But as agents become first-class participants, we may see dozens or hundreds of automated “contributors” iterating on code in parallel. This shatters linear, pull request–based pipelines and introduces the need for:

Real-Time Collaboration & Conflict Resolution: Automated merges and near-instant resolution of concurrent agent edits.
Fine-Grained Auditability & Versioning: Every change whether by a human or an agent must be captured, traceable, and reversible at scale.
Machine-Speed CI/CD: Continuous integration and delivery systems that can keep pace with agent-driven updates, which can happen at a scale and velocity humans can’t match.

In other words, an agent-first world implies not only a shift in how we provision and manage compute resources, but in how we build, deploy, and maintain software. Startups and established companies that tackle these developer workflow challenges alongside the runtime infrastructure may define the next decade of the AI-driven technology stack.

These opportunities align with the computational economics framework presented earlier, offering potential solutions to the identified inefficiencies in current paradigms and creating new business opportunities for startups and established technology providers alike.

Adoption Timelines

Based on current technology trajectories and organizational readiness, we predict the following adoption timelines:

Within 1 year:

Widespread industry experimentation with hybrid computing models
Early adoptions of memory-compute disaggregation technologies in leading cloud providers
First commercial stateful serverless frameworks optimized for AI agent workloads

Within 3 years:

Production-ready decentralized multi-agent orchestration systems
Established frameworks for fluid cross-environment execution
Initial implementations of Agent-Centric Computing principles

Within 5 years:

Full-scale Agent-Centric Computing implementations in leading organizations
Economic resource governance through automated markets
Seamless agent computation across heterogeneous environments becoming standard practice
Legacy compute paradigms actively being replaced in AI-forward organizations

Organizations in sectors with immediate AI agent applications—healthcare, finance, and autonomous systems—are already leading adoption, driven by potential competitive advantages in cost efficiency and capability.

The emergence of AI agents creates unprecedented opportunities for computational infrastructure innovation. The Agent-Centric Computing Model represents a fundamental reimagining of how compute resources should serve AI needs.

Organizations pioneering Agent-Centric Computing will secure advantages through:

Dramatically improved operational economics for AI deployments
Enhanced agent capabilities through infrastructure that adapts to agent needs
Competitive differentiation via superior agent performance and reliability

For startups, these infrastructure gaps present lucrative opportunities to build tools enabling Agent-Centric principles,particularly in state-resource decoupling and fluid execution boundaries. Established providers must either develop purpose-built agent-centric offerings or risk misalignment with AI-forward organizations.

The race to develop agent-centric infrastructure has begun. Those who recognize this shift early will shape computing's future and capture disproportionate value as AI transforms industries.

Credits:

‍This piece benefited greatly from the reviews, corrections, and suggestions of James Cham, Guido Appenzeller, Nick Crance, Tanmay Chopra, Demetrios Brinkmann, Kenny Daniel, Davis Treybig, as well as the tireless AI collaborators Gemini, Claude, and ChatGPT, who provided endless drafts, rewrites, and the occasional existential question about the future of sentience- we promise to remember you when the robots take over.

About the authors:

Diego Oppenheimer is a serial entrepreneur, product developer, and investor with a deep passion for data and AI. Throughout his career, he has focused on building and scaling impactful products, from leading teams at Microsoft on key data analysis tools like Excel and PowerBI, to founding Algorithmia, which defined the machine learning operations space (acquired by DataRobot). Currently, he provides strategic advisory for startups and scale-ups in AI/ML. As an active angel investor and advisor to numerous companies, he is dedicated to helping the next generation of innovators bring their visions to life.

Priyanka Somrah is a principal at Work-Bench, a seed-focused enterprise VC fund based in New York. She focuses on investments across data, machine learning, and cloud-native infrastructure. Priyanka is the author of The Data Source, a newsletter for technical founders that highlights emerging trends across developer tools. She's also the author of Your Technical GTM Blueprint, a series that breaks down how technical startups navigate go-to-market—from first hires to scaling repeatable sales.

References

Compute Express Link Consortium. (2023). Boosting AI Performance with CXL.
https://computeexpresslink.org/blog/boosting-ai-performance-with-cxl-3818
Compute Express Link Consortium. (2023). CXL 101: An Introduction to Compute Express Link.
https://computeexpresslink.org/wp-content/uploads/2024/11/CR-CXL-101_FINAL.pdf
SemiEngineering. (2023). CXL Thriving As Memory Link.
‍https://semiengineering.com/cxl-thriving-as-memory-link
Reuters. (2025, April 9). Google launches new Ironwood chip to speed AI applications.
‍https://www.reuters.com/technology/google-launches-new-ironwood-chip-speed-ai-applications-2025-04-09‍
Reuters. (2025, March 11). Meta begins testing its first in-house AI training chip.
‍https://www.reuters.com/technology/artificial-intelligence/meta-begins-testing-its-first-in-house-ai-training-chip-2025-03-11
The Next Platform. (2023, May 18). Meta crafts homegrown AI inference chip and datacenter architecture.
https://www.nextplatform.com/2023/05/18/meta-platforms-crafts-homegrown-ai-inference-chip-ai-training-next
Byte-Sized AI. (2024). Demystifying NVIDIA Dynamo: A High-Performance Inference Framework for Scalable GenAI.
‍https://medium.com/byte-sized-ai/demystifying-nvidia-dynamo-a-high-performance-inference-framework-for-scalable-genai-f10be3d7032f
Meta Engineering. (2024, March 12). Building Meta’s GenAI Infrastructure.
‍https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure
Zhang, H., Kim, Y., Qian, Y., et al. (2023). Understanding and Optimizing Serverless Workloads in CXL-Enabled Tiered Memory. arXiv.
‍https://arxiv.org/abs/2309.01736

TOPICS

Research

Meet NYC’s Future SaaS Founders @ cofounders.nyc