Team

Framework

Companies

AI House

Articles

Apply

Insights 15: The State of AI Agents in 2025: Balancing Optimism with Reality

By Vu Ha - Technical Director

Introduction

AI “agents” have burst onto the tech scene with tremendous buzz. Many are even calling 2025 the “Year of the AI Agent” (REAL AI: Year of the AI Agent, AI facts, headlines and AI quote of the week - WAV Group Consulting) expecting autonomous AI assistants to transform how we work and live. Unlike passive chatbots of 2024, these agents promise to not just answer you, but act on your behalf – whether that’s booking travel, researching complex questions, or handling business workflows.

However, amid the excitement, it’s crucial for startups to separate hype from reality. As one commentator wryly noted, “Everyone is talking about AI agents. But so far, a lot of that has just been, well, talk.” (93% of IT leaders see value in AI agents but struggle to deliver, Salesforce finds | VentureBeat) In other words, the potential is huge, but real-world results are still catching up. This post takes a hard look at the state of AI agents in 2025 – what’s been achieved, what hasn’t, and what it all means for technical teams and startup executives. We’ll explore the latest developments from major AI players, industry-wide trends, and the myriad challenges that remain. The goal is to provide an insightful, balanced view: optimism for what’s coming, tempered by healthy skepticism so you know where the pitfalls lie.

Major Developments from Big Players

The push toward agentic AI is led by the usual suspects – OpenAI, Anthropic, and Google DeepMind – each making notable strides (and stumbles) in the past year. Here we break down their flagship efforts, along with key strengths and limitations:

OpenAI – “Deep Research” and Operator: OpenAI has doubled down on agent research. In early 2025 they unveiled Deep Research, an AI agent designed for in-depth, multi-step internet research (OpenAI's New Research-Focused AI Agent and Altman's Open Source Revelation) Built on a specialized version of their latest reasoning model, Deep Research can autonomously browse the web, use tools (like Python code), and compile information. OpenAI claims it can accomplish “in tens of minutes what would take a human many hours” (OpenAI's New Research-Focused AI Agent and Altman's Open Source Revelation) on tasks ranging from scientific research to hyper-personalized shopping advice. Early benchmarks back some of the hype: Deep Research set a new high score of 26.6% on Humanity’s Last Exam, a tough expert-level reasoning test (for reference, previous models scored much lower) (OpenAI's New Research-Focused AI Agent and Altman's Open Source Revelation) In practice, it means the agent can independently find and reason through information to answer very complex queries.

Yet, like all current agents, Deep Research isn’t magic. OpenAI openly acknowledges it hallucinates facts and lacks judgment at times – mixing up authoritative sources and rumors, or failing to convey uncertainty (OpenAI's New Research-Focused AI Agent and Altman's Open Source Revelation) In their internal evals, it sometimes confidently presents wrong answers or misattributes info. Such issues are especially concerning for a tool meant to do “research.” So, while Deep Research showcases the promise of an AI research assistant, it remains a research preview itself – available to Pro tier users with limits (100 queries/month) (OpenAI's New Research-Focused AI Agent and Altman's Open Source Revelation) – indicating OpenAI knows it’s not yet ready for wide, unsupervised use.

OpenAI’s second big play is Operator, introduced as “an agent that can go to the web to perform tasks for you” (OpenAI debuts Operator, an AI agent with ecommerce applications) (OpenAI’s big, new Operator AI already has problems | Digital Trends) Unlike Deep Research which focuses on information gathering, Operator is more about getting things done online. It’s essentially a superpowered digital assistant that can take actions in a browser the way a person would: clicking, scrolling, filling forms, etc (OpenAI debuts Operator, an AI agent with ecommerce applications) Want to order groceries or book a flight? Instead of just giving you instructions, Operator will attempt to do it for you end-to-end. In OpenAI’s demo, a user could say “Find me a 4-star hotel in Paris under $200 and book it,” and the agent would autonomously navigate travel websites to fulfill the request. Operator was launched as OpenAI’s first official “agentic AI” tool (in a limited U.S. preview) and already partnered with early adopters like eBay, Etsy, and Instacart for e-commerce use cases (OpenAI debuts Operator, an AI agent with ecommerce applications)

The strength of Operator is its proactive autonomy – it transforms ChatGPT from a talker into an actor. Under the hood, it runs a version of GPT-4 (dubbed GPT-4o) augmented with a built-in web browser and vision, allowing it to see and interact with webpages much like a human user (OpenAI’s big, new Operator AI already has problems | Digital Trends) In theory, this could save users lots of tedious clicks and form-filling. However, the current limitations are notable. Operator is only available to a small pool of users (those on OpenAI’s $200/month ChatGPT Pro plan) (OpenAI debuts Operator, an AI agent with ecommerce applications) (OpenAI’s big, new Operator AI already has problems | Digital Trends) reflecting its research preview status. Early testers have reported the agent can be slow or get confused on real websites, sometimes hallucinating actions or misreading pages (OpenAI’s big, new Operator AI already has problems | Digital Trends) In one instance, it mishandled a news site until Sam Altman himself jumped on Twitter (er, X) to promise a fix (OpenAI’s big, new Operator AI already has problems | Digital Trends) OpenAI also priced Operator high and geofenced it (U.S. only at launch), signaling that it’s not polished enough for a mass roll-out yet (OpenAI’s big, new Operator AI already has problems | Digital Trends) (OpenAI’s big, new Operator AI already has problems | Digital Trends) So while Operator heralds an exciting future where AI agents handle our online drudgery, in 2025 it’s still very much a beta experience – impressive in demos, but needing more refinement and trust before it’s truly ready for prime time.

Anthropic – Claude’s “Computer Use” powers: Anthropic, known for its Claude language model, took a slightly different approach to agents. Rather than releasing a standalone assistant product, they taught Claude how to use a computer’s interface (the way Operator does) and offered this capability via API. Internally dubbed Claude Computer Use (CCU), this feature (now in beta) lets Claude control a virtual machine through vision and mouse/keyboard actions (Developing a computer use model \ Anthropic) Essentially, Claude can look at screenshots of a computer screen and interact with software — clicking buttons, scrolling, typing — based on what it “sees.” This was a major research undertaking: the team had to train Claude to count pixels on the screen accurately (so it clicks the right places) and to interpret GUI elements from images (Developing a computer use model \ Anthropic) With surprisingly little training (on just a calculator app and text editor), Claude generalized these skills to many desktop tasks (Developing a computer use model \ Anthropic) In one demo, you can ask Claude to set up a spreadsheet or edit a photo, and it will dutifully operate the relevant software as if an invisible novice user were at the controls. It’s a completely different paradigm from traditional API-based automation – the AI is literally using the same interfaces a human would.

The strength of Anthropic’s approach is that it can, in theory, turn existing software into “tools” for the AI without those apps needing any special integration. Claude with CCU can handle arbitrary software workflows by visually parsing screens and clicking, which is quite general. In fact, as of its release Claude was state-of-the-art at this kind of computer control: on a new benchmark called OSWorld, which tests AIs on using computers like humans do, Claude scored 14.9% (where most others got ~7.7%, and humans ~70%) (Developing a computer use model \ Anthropic) That’s far from human-level, but a big leap over previous attempts and shows Claude is currently the leader in this niche.

The limitations, however, are pretty glaring. Anthropic themselves admit that even at the cutting edge, Claude’s computer use is slow and often error-prone (Developing a computer use model \ Anthropic) Many common actions we take for granted (drag-and-drop, multi-finger gestures, etc.) Claude simply cannot do yet (Developing a computer use model \ Anthropic) Its view of the screen is like a flipbook — discrete snapshots — so it can miss transient pop-ups or animations (Developing a computer use model \ Anthropic) In practice it stumbles in ways both frustrating and funny: during internal testing, Claude sometimes got “distracted” (at one point it stopped coding and randomly browsed photos of Yellowstone National Park) (Developing a computer use model \ Anthropic) or it clicked the wrong thing and, say, closed a recording window capturing its demo (Developing a computer use model \ Anthropic) In short, it behaves like a well-meaning but clumsy intern on your computer. Recognizing this, Anthropic has only rolled out CCU in a limited beta for developers and warns users to closely supervise any actions it takes (Computer use (beta) - Anthropic API) (What are Claude computer use and ChatGPT Operator? - Zapier) Like OpenAI’s agent, Claude’s computer-control ability is a promising breakthrough that feels like sci-fi – an AI operating a normal PC! – but it’s not yet reliable enough to trust with mission-critical operations. It will need significant improvement in speed, accuracy, and supported actions before it’s ready for wide enterprise adoption.

Google DeepMind – Gemini 2.0 and its prototypes: Not to be outdone, Google’s DeepMind (now merged under Google AI) introduced Gemini 2.0 in December 2024 as their next-generation foundation model built explicitly for the “agentic era” (Google introduces Gemini 2.0: A new AI model for the agentic era) CEO Demis Hassabis and team describe Gemini 2.0 as more capable than any prior model, with native multimodal abilities (it can process images and audio, not just text) and built-in tool-use skills (Google introduces Gemini 2.0: A new AI model for the agentic era) In other words, Gemini was architected from the ground up to power AI agents that can see, hear, and act. An early variant called Gemini 2.0 Flash was made available to select developers and testers, with Google planning broader availability in 2025 (Google introduces Gemini 2.0: A new AI model for the agentic era)

Where Google really stands out is in integrating agents into real products and workflows. They have been experimenting with agentic “experiences” across their ecosystem (Google introduces Gemini 2.0: A new AI model for the agentic era) For example, Project Astra is a prototype of a “universal AI assistant” that has been tested on Android phones (Google introduces Gemini 2.0: A new AI model for the agentic era) (Google introduces Gemini 2.0: A new AI model for the agentic era) It uses Gemini 2.0’s multimodal prowess to converse more naturally (even understanding mixed languages or accents) and can use tools like Google Search, Google Maps, and Lens on its own (Google introduces Gemini 2.0: A new AI model for the agentic era) Another effort, Project Mariner, explores an AI agent that helps you accomplish tasks in the browser. Mariner can actually see your browser window (pixels, text, buttons) via an experimental Chrome extension, reason about the content, and then take actions to complete tasks for you (Google introduces Gemini 2.0: A new AI model for the agentic era) – very similar to Operator’s goal. Impressively, in internal tests Project Mariner achieved a state-of-the-art result of 83.5% on the WebVoyager benchmark (which evaluates end-to-end web task completion) using a single Gemini 2.0 agent (Google introduces Gemini 2.0: A new AI model for the agentic era) For context, that means the agent successfully completed over 80% of the real-world web tasks in the test suite – a new record. (It’s worth noting this was in a controlled setup; still, it shows the raw capability is there.)

Despite these advances, Google is also quick to stress it’s early days. Gemini 2.0’s agent skills are mostly in the prototype phase – limited to trusted testers for now (Google introduces Gemini 2.0: A new AI model for the agentic era) (Google introduces Gemini 2.0: A new AI model for the agentic era) The technology works, but not flawlessly. Of Project Mariner, Google said it’s “not always accurate and slow to complete tasks today,” though they expect it to improve rapidly (Google introduces Gemini 2.0: A new AI model for the agentic era) That candid admission echoes what others have found: current agents can do amazing things one moment and then flub a simple step the next. Google’s edge is having an entire ecosystem (Search, Gmail, Docs, Android, etc.) where they can deploy these agents once ready. They’re already talking about integrating Gemini 2.0 agents into products like the Google Assistant app and even AR glasses down the line (Google introduces Gemini 2.0: A new AI model for the agentic era) (Google introduces Gemini 2.0: A new AI model for the agentic era) In sum, Google DeepMind’s Gemini 2.0 is arguably the most ambitious push for agentic AI – aiming to embed smart agents throughout everyday tech. It’s backed by cutting-edge model capability and some encouraging results (e.g. beating benchmarks). But at the start of 2025, these agents are still mostly in the lab or limited previews. Google, like OpenAI and Anthropic, is feeling its way forward carefully, mindful that robust real-world performance and safety will take more fine-tuning.

Industry-Wide AI Agent Developments

It’s not just the tech giants – the whole industry has caught “agent fever.” 2024 saw a Cambrian explosion of startups and enterprise tools all vying to build agentic AI for various niches. Here’s a survey of notable developments:

Startup innovation – multi-agent orchestration and frameworks: A lot of startups jumped into the fray by creating platforms to help developers build and deploy AI agents. One standout is CrewAI, founded in 2024, which bills itself as a “multi-agent platform” for complex workflows (CrewAI) CrewAI’s open-source framework lets you orchestrate multiple AI agents with defined roles working together on a task (Multi-Agent AI Demystified: CrewAI, LangChain, AutoGen & More) For example, one agent might be tasked with brainstorming a plan, another with executing steps, and another with verifying results – all coordinated by CrewAI’s system. This idea of a “team” of AIs (hence the name CrewAI) collaborating is compelling for complex jobs that might benefit from specialization. The approach caught on: CrewAI’s tools reportedly executed 10+ million agent runs per month by late 2024 and are used in some capacity by nearly half of the Fortune 500 (CrewAI Launches Multi-Agentic Platform to Deliver on the Promise of Generative AI for Enterprise | Insight Partners) That’s a remarkable adoption statistic for such a new company (perhaps inflated by easy integration trials, but still). The startup secured $18M in funding led by top-tier VCs and angel investors including Andrew Ng (CrewAI Launches Multi-Agentic Platform to Deliver on the Promise of Generative AI for Enterprise | Insight Partners) underscoring investor belief in multi-agent tech.

CrewAI and similar frameworks (e.g. Microsoft’s open-source AutoGen (AI Agents 2024 Rewind - A Year of Building and Learning) which CrewAI actually integrates with, or LangChain’s agent tooling) are trying to provide the plumbing and scaffolding to make agent development easier. They offer features like self-iteration loops, shared memory, and performance evaluation out of the box (CrewAI Launches Multi-Agentic Platform to Deliver on the Promise of Generative AI for Enterprise | Insight Partners) For instance, CrewAI has built-in mechanisms for agents to critique and refine their own outputs (“self-heal”) and for developers to monitor and measure how well the agents are doing. These are attempts to tackle the reliability problem and avoid the naive “let it run and pray” approach of early experiments. Startups in this space often tout that their agents can handle more complex, multi-step processes than a typical chatbot, and can be integrated into enterprise workflows without needing deep ML expertise.

It’s worth noting that many early adopters of these frameworks use them for relatively structured tasks. Common use cases include things like: summarizing and routing customer support tickets, automating parts of marketing campaigns, or handling form-filling and data entry across enterprise apps. In other words, tasks that are beyond a single API call, but still well-defined enough that an agent with a bit of reasoning can figure them out. By late 2024, the term “agent” became so trendy that countless products slapped the label on – even if under the hood it was just a clever prompt script. We saw an “agentic CRM assistant,” “agent-powered research tools,” and so on. In truth, many of these are incremental improvements over chatbots or RPA (robotic process automation). The difference is an agent implies more autonomy and decision-making. Startups like CrewAI are providing the toolkits to imbue that autonomy in a controlled way.

Enterprise plays – Salesforce’s Agentforce and others: Large enterprise software companies have also jumped in, integrating AI agents into their platforms where they can add value. Salesforce is a prime example with its launch of Agentforce, which it dubs a “digital labor platform” for autonomous agents in business settings (Agentforce 2.0 Announcement - Salesforce) (Salesforce Unveils Agentforce–What AI Was Meant to Be) Agentforce is essentially Salesforce’s umbrella for AI agents that work across their ecosystem (Sales Cloud, Service Cloud, Slack, etc.), operating on enterprise data. The promise is to have trustworthy agents that can, say, handle customer inquiries, manage sales leads, or execute marketing campaigns 24/7, all fully integrated with a company’s CRM and databases (Agentforce: Create Powerful AI Agents | Salesforce US) Salesforce is pushing this hard – at Dreamforce 2024 they officially launched Agentforce with major fanfare, even showcasing “10,000 agents” running live (in a controlled demo environment) (Agentforce Unleashed: News and Stories that Shaped 2024 - Salesforce) They announced a Partner Network (essentially an app store for agent plugins) and new lifecycle management tools for testing and monitoring agents in production (Agentforce Unleashed: News and Stories that Shaped 2024 - Salesforce)

A key focus for enterprise agents is data integration and security. Unlike a general web agent, an enterprise agent needs to safely connect to proprietary systems and follow business rules. Salesforce leverages its existing platform for this – Agentforce agents can be configured to have certain “skills” (pre-built workflows) and access controls to different data sources (Agentforce Unleashed: News and Stories that Shaped 2024 - Salesforce) By late 2024, Agentforce 2.0 introduced a library of these pre-built skills and the ability to deploy agents into Slack channels (Agentforce Unleashed: News and Stories that Shaped 2024 - Salesforce) which makes them more immediately useful (e.g. a sales rep can instruct a deal management agent right from Slack). They also tout an Atlas Reasoning Engine under the hood that supposedly gives the agents more advanced planning and reasoning with enterprise data (Agentforce Unleashed: News and Stories that Shaped 2024 - Salesforce) (Agentforce Unleashed: News and Stories that Shaped 2024 - Salesforce) (The marketing is a bit vague on how it’s different from a normal LLM, but it’s pitched as proprietary IP that makes Agentforce agents “smarter” on business logic).

The traction in enterprise is real: according to Salesforce, companies in every industry – from staffing firms like Adecco to retailers like Saks – have deployed Agentforce agents in some capacity, building what Salesforce calls a “limitless workforce” of digital workers (Agentforce Unleashed: News and Stories that Shaped 2024 - Salesforce) A Salesforce exec claimed “no other company comes close to offering this complete AI solution for enterprises” (Agentforce Unleashed: News and Stories that Shaped 2024 - Salesforce) That might be self-serving, but it’s true that Salesforce has a huge advantage: direct integration into the software businesses already use. Microsoft, of course, is embedding AI (Copilots) into Office and Dynamics, but those are more human-in-the-loop assistants. Agentforce is about more autonomous operation. Other enterprise software vendors – Oracle, ServiceNow, HubSpot, etc. – are all exploring similar concepts, often partnering with OpenAI or Anthropic to power the LLM side while they handle the enterprise integration side.

Beyond the well-funded startups and big incumbents, there’s a long tail of contributions: open-source projects (many spawned from the GPT-4 tooling hype of early 2023) and independent researchers experimenting with agent architectures. The community-driven Auto-GPT and BabyAGI projects (which we’ll discuss more later) popularized the concept of autonomous agents and inspired many variants. By 2024, more polished successors like GPT-Engineer (for automating programming tasks) and MetaGPT (for multi-agent roleplay) emerged on GitHub, often attracting thousands of stars. Developer tool companies like Zapier also jumped in – Zapier launched Zapier Agents, letting users orchestrate AI agents across their web app integrations without coding (What are Claude computer use and ChatGPT Operator?) This essentially marries the no-code automation of Zapier with AI decision-making, so non-engineers can experiment with agent workflows (e.g. an agent that watches incoming emails, summarizes them, and triggers tasks in other apps).

In summary, the industry is swarming with AI agent initiatives. Startups provide the infrastructure to build agents; enterprises embed agents into their products and processes; and a vibrant open-source community is sharing new agent ideas weekly. This flurry of activity underscores a belief that agentic AI could be the next big platform. But it also means a lot of noise and hype. For every real breakthrough, there are dozens of “me-too” implementations or conceptual demos. As a startup founder or tech exec, it’s easy to get overwhelmed – or worse, overhyped – by these developments. The prudent approach is to watch these innovations closely, maybe pilot some that align with your needs, but also remain cognizant that many are early-stage. The sheer volume of agent projects doesn’t automatically equate to maturity of the tech. In fact, as we explore next, most AI agents in 2025 are not yet ready for prime time despite the excitement.

Why AI Agents Are Not Ready for Prime Time

With all the announcements and demos, one might think fully autonomous AI helpers are already here, ready to hire. The reality: today’s agents are very much prototypes. Even the best offerings from OpenAI and Anthropic carry “beta” labels and usage caveats. Startups integrating agents often find they need significant hand-holding and fail-safes. Here are some reasons why agents aren’t yet ready for wide, unsupervised deployment:

Research preview status: Both OpenAI’s Operator and Anthropic’s Claude-with-computer-use are explicitly in limited preview programs, not general release. OpenAI only enabled Operator for its highest-paying customers as a “limited ‘research preview’” (OpenAI debuts Operator, an AI agent with ecommerce applications) and it’s understood to be an experimental feature. Anthropic’s GUI-control capability is a beta feature accessible via API and not turned on by default for most Claude users (What are Claude computer use and ChatGPT Operator? - Zapier) In other words, the companies themselves acknowledge these agents are works in progress. They’re gathering feedback, watching for failure modes, and iterating – which is exactly what a research project does. For end users, it means you can’t yet rely on these agents the way you would a finished software product.
Reliability issues: Current agents simply aren’t consistently reliable. They may work impressively one minute and fail the next. We’ve all seen how large language models can “hallucinate” incorrect information; now imagine that tendency in an agent that’s clicking around your CRM or ordering items for you. Not good. In testing, Operator has been observed making mistakes like misreading web content or getting stuck, requiring user intervention (OpenAI’s big, new Operator AI already has problems | Digital Trends) Anthropic encountered plenty of goofy errors with Claude, from accidental clicks to off-task browsing (Developing a computer use model \ Anthropic) These glitches are not rare edge cases; they happen regularly enough that a human overseer is needed. No serious business will let an AI agent run loose with customer data or transactions until these error rates come way down.
Limited scope and access: The most advanced agents also have limited availability and scope by design. Operator is U.S.-only at launch and gated behind a hefty paywall (OpenAI’s big, new Operator AI already has problems | Digital Trends) which drastically limits real-world usage. (Many European users loudly complained about being excluded (OpenAI’s big, new Operator AI already has problems | Digital Trends) ) This also means less diverse feedback to improve the system. Likewise, Claude’s computer-use is constrained – it doesn’t have internet access in that mode for safety reasons, and it can only do things within the sandbox it’s given (Developing a computer use model \ Anthropic) These constraints are sensible to manage risk, but they highlight that the agents aren’t robust enough to be “turned loose” broadly. They’re being baby-stepped into the world.
Need for human fallback: Today’s agents typically need a human in the loop for when (not if) they get confused or face something unexpected. OpenAI built a feature into Operator allowing the user to take over mid-task or correct its course (OpenAI debuts Operator, an AI agent with ecommerce applications) (OpenAI debuts Operator, an AI agent with ecommerce applications) They expect you’ll have to occasionally steer, which tells you the autonomy is fragile. Anthropic’s docs explicitly warn to carefully review any actions Claude proposes when using the computer-control feature (Computer use (beta) - Anthropic API) Essentially, these agents behave like new trainees: they can carry out procedures, but you’d better supervise until you’re confident they won’t wreck something.
Performance vs. demo expectations: It’s a tale as old as tech: the demo looks great, the real product… not so much. AI agents are no exception. The carefully scripted showcases at events (or tightly edited YouTube videos) gloss over latency issues, errors, and do-overs. Early users of Operator noted it was significantly slower than the snappy demo implied, taking a long time to load pages and complete tasks (OpenAI’s big, new Operator AI already has problems | Digital Trends) And this is running on OpenAI’s servers with presumably optimal conditions. In less ideal environments (or with many concurrent users), performance could suffer more. Until agents can perform quickly and correctly under real-world conditions, they’re not ready for heavy use.

In short, there is a big gap between the current agent prototypes and production-ready products. It’s reminiscent of self-driving cars – amazing progress has been made, yet you still can’t buy a truly self-driving car in 2025 because handling the infinite corner cases is incredibly hard. AI agents are at a similar juncture. They can impress in controlled scenarios. They will be very impactful eventually. But today, anyone saying these agents are plug-and-play replacements for human workers is either misinformed or overselling. For startups, the takeaway is to temper your expectations and avoid overcommitting to unproven tech. By all means, experiment and pilot where it makes sense, but keep a human in the loop and have fallback processes. We’re not at the “set it and forget it” stage for autonomous AI agents just yet.

Challenges for Builders Today

If you’re a developer, ML engineer, or founder trying to build with AI agents, you’re probably already aware of some of these challenges firsthand. The truth is, building effective AI agents is hard – perhaps harder than building a typical AI application or standard software. Here are some key challenges facing those brave enough to pioneer in this space:

Unsolved research questions: Many fundamentals of agentic AI are still open research problems. There isn’t a clear blueprint for “how to build a generally intelligent agent that can reliably perform arbitrary tasks.” A lot of the progress so far has come from trial and error and tinkering in the lab. As Anthropic’s researchers described their journey to develop computer-use for Claude, “it took a great deal of trial and error to get there… constant iteration and repeated visits back to the drawing board” (Developing a computer use model \ Anthropic) That sentiment applies to virtually all aspects of agent development in 2025. We’re still figuring out basic questions: How should an agent break down a complex goal into subtasks on the fly? How can it recover when it’s stuck? How do we make it ask for help when needed? What’s the best way for multiple agents to coordinate without chatting in circles? Academia and industry labs are actively exploring these, but clear answers (and best practices) are scarce. For builders, this means you often have to invent new methods yourself or at least stitch together insights from various research papers. It’s exciting, frontier work – but it’s also like being lost without a map.
Tooling and frameworks lag behind the frontier: There’s a saying that the cutting edge is not user-friendly. The top AI labs have internal tools and infrastructure that outsiders simply don’t. For example, OpenAI trained specialized models (GPT-4 variants) with reinforcement learning specifically for agent tasks like browsing and tool use (OpenAI's New Research-Focused AI Agent and Altman's Open Source Revelation) Those models and training pipelines are proprietary. The average startup or dev team has to rely on open APIs and maybe open-source models that are far less tuned for agent behavior. While frameworks like CrewAI, LangChain, AutoGen, etc. provide helpful abstractions, they are still relatively low-level. Many of these frameworks focus on basics like connecting an LLM to a browser or setting up agent chat loops. They don’t yet solve the really hard stuff (e.g. guaranteeing logical consistency, or deeply integrating learning from human feedback). As a result, a lot of agent-building is DIY hacking on top of fragile APIs. The bleeding-edge capabilities – say, an agent that can use vision, memory, and reasoning seamlessly – often require stitching together multiple services and codebases (for vision, for language, for actions). It’s doable, but time-consuming and error-prone. The tooling will improve, but right now it’s a patchwork.
Integration is cumbersome: One practical challenge reported by enterprise teams is integrating these AI agents into existing systems. A survey by MuleSoft found 95% of IT leaders struggle to integrate data across the many apps in their stack (93% of IT leaders see value in AI agents but struggle to deliver, Salesforce finds | VentureBeat) and that integration challenge “hinders companies from fully realizing [AI agents’] potential” (93% of IT leaders see value in AI agents but struggle to deliver, Salesforce finds | VentureBeat) An agent might need to pull info from an ERP, a CRM, and an email server, then take action in a legacy internal system – that’s a lot of plumbing. Each integration point is a potential failure or security risk. Until robust connectors and permissions management for agents are in place, builders end up writing custom glue code for every app. It’s not the sexy part of AI, but it’s necessary for an agent to actually be useful in a business workflow. So if you’re building an agent solution, expect that much of your effort will go into these nitty-gritty integration tasks (and dealing with reluctant IT departments who are nervous about an AI agent poking around their systems).
Shortage of expertise: Given how new this field is, there are very few “expert” AI agent builders out there. In early 2024, IBM estimated an AI skills gap of around 50% – meaning half of the AI jobs could not be filled with existing talent (AI Skills Gap - IBM) Now consider that AI agents require a mix of skills (LLM prompt design, software integration, ML training, maybe even UX for human hand-off) that almost no one had on their resume a year ago. If you’re hiring, you won’t find people with 5 years of “autonomous agent” experience – they didn’t exist 5 years ago! This means teams must learn as they go, and there’s a lot of reinvention of the wheel. It also means the few people who have built something like an AutoGPT or a multi-agent system are in absurdly high demand. For a startup, attracting and retaining that kind of talent is tough when big companies are also dangling interesting projects (and big salaries) in front of them. All told, there’s a human bottleneck in expertise that slows down development.
Trial-and-error development: Because of the above points (nascent research, immature tools, lack of prior art), developing an agent is often an exercise in iterative tinkering. You might set up an agent and find it works on 3 test cases and fails on the 4th – why? Perhaps the prompt isn’t robust, or the agent missed a tool invocation. So you tweak the prompt or logic, try again, and repeat… Many teams report that building agents feels more like grad school research than traditional software engineering. It can be unpredictable and time-consuming. This trial-and-error process makes it hard to estimate timelines and hard to debug issues. Unlike a normal program where you can step through code, an AI agent’s decision process is partly a black box in the model’s weights. That’s improving with better observability tools, but it’s still not easy to deterministically reproduce why an agent did something dumb. Builders need patience and a tolerance for ambiguity – and a plan for continuous testing and improvement. It’s very much an iterative development process.
Managing stakeholder expectations: Finally, a non-technical but critical challenge: dealing with the hype in the room. Business leaders, customers, or investors might have outsized expectations for what AI agents can do. We’ve all seen the sci-fi movie demos and breathless media articles. When you propose building an AI agent for, say, customer service automation, your CEO might imagine firing half the call center next quarter thanks to AI. You, the builder, know it’s not that simple – the agent will need supervision and will only handle some scenarios at first. Bridging this expectation gap is key. A recent survey found 93% of IT leaders plan to implement AI agents in the next two years (93% of IT leaders see value in AI agents but struggle to deliver, Salesforce finds | VentureBeat) (wow!), yet 29% of projects missed their deadlines in 2024 and many hit integration roadblocks (93% of IT leaders see value in AI agents but struggle to deliver, Salesforce finds | VentureBeat) This suggests a lot of over-promising is happening. As a builder, you’ll often have to educate stakeholders on what’s realistic. That might mean doing pilot phases, showing where human oversight is still needed, and frankly admitting what the agent cannot do. It’s better to under-promise and over-deliver – or else you risk blowback if the fancy “AI agent” fails to meet the inflated expectations. Given how much hype surrounds this tech, managing that hype internally and externally becomes part of the job.

In summary, building AI agents in 2025 is a frontier endeavor with plenty of potholes. It combines unresolved research puzzles with practical engineering headaches. Those who venture into this area must be prepared for a lot of experimentation and learning. The good news is, if you do navigate these challenges, you’ll be in a small group with valuable expertise. Just go in with eyes open: despite what vendor glossy slides may imply, creating a useful, reliable AI agent system is not as simple as plugging into an API for “agent magic.” It takes real sweat and probably a few sleepless nights watching your agent make the same mistake for the tenth time. But hey, if that’s your cup of tea, the field is wide open!

Technical Challenges in AI Agents

Let’s drill down on some of the technical hurdles that make agent development uniquely challenging. These are the kind of deep issues that researchers and advanced development teams are grappling with:

Lack of clear evaluation metrics: In most AI domains, we have well-defined benchmarks – accuracy for classification, BLEU scores for translation, etc. For AI agents, what do we measure? Success isn’t just getting a single answer right; it could be completing a multi-step task or optimizing some outcome. The community is still figuring out how to evaluate agents systematically. A few benchmarks emerged in 2024 (like WebArena for web tasks, CORE for reasoning tasks (AI Agents 2024 Rewind - A Year of Building and Learning) , but they show how far we have to go. For example, one study found that general-purpose autonomous agents achieved only about a 14% success rate on complex multi-step web tasks (WebArena benchmark), whereas humans achieved 78% on the same tasks (AI Agents 2024 Rewind - A Year of Building and Learning) That’s a massive gap. It tells us current agents fail most of the time on open-ended tasks that require planning and adapting. We need better metrics to pinpoint why they fail and where to focus improvements. Is it because they don’t understand long-term consequences? Because they can’t recover from errors? Because they get stuck in loops? Likely all of the above. There’s also the issue of hallucinations – how do you quantify “truthfulness” of an agent executing tasks? Without clear metrics, it’s hard to even know if change X to the agent made it better or worse. The field is working on new evaluation methods (for instance, having agents simulate users to test each other), but as of 2025, measuring an agent’s performance often involves a lot of manual case-by-case testing and qualitative judgment. This makes progress slower and comparisons murky.
Few robust frameworks for learning and improvement: When an agent makes mistakes, how do we teach it to do better? In other AI contexts, we’d fine-tune the model with more data or use reinforcement learning (RL) if there’s a reward signal. For agents, applying RL or other learning techniques is not straightforward. Some advanced efforts have used RL with human feedback to train agent behaviors – OpenAI hinted that Deep Research’s abilities came from reinforcement learning on browser tasks (OpenAI's New Research-Focused AI Agent and Altman's Open Source Revelation) But that required a custom setup (essentially treating each multi-step task as a game to score). Most developers don’t have the capacity to run such RL training pipelines. Right now, much of the “learning” an agent does is implicit – it’s encoded in the base LLM (which was probably trained on a lot of text about how to do things) or in prompt strategies that developers devise. There is a dearth of off-the-shelf solutions for things like rewarding an agent for completing a task correctly or having it learn from its own failures. Imagine an agent tries a task, fails, and you want it to self-improve – that’s a hard problem! Some research projects enable agents to critique themselves or to run many simulations and pick the best result (a form of trial-and-error learning). But these are not yet packaged in easy libraries. Reinforcement learning itself is tricky to apply because the action space (the combinatorics of all possible sequences of actions) is enormous for complex tasks, and sparse rewards (only succeeding at the very end) make learning slow. In short, the cutting-edge techniques that likely are used internally at places like OpenAI (RLHF, expert fine-tuning, etc.) haven’t trickled down broadly. As a result, most agent builders rely on heuristic improvements (fine-tuning prompts, adding guardrails) rather than the agent truly learning from experience in a rigorous way.
Dependence on high-quality, task-specific data: One emerging insight is that to make an AI agent excel in a particular domain, you need lots of examples or feedback from that domain. For instance, if you want an agent that advises on financial planning, you’d want it trained or fine-tuned on real financial cases and perhaps feedback from financial advisors. The era of “just throw a general model at it” is ending – specialization is key. However, getting high-quality labeled data or demonstrations for complex tasks is very challenging. Often, only human experts know what the correct steps are (e.g. a doctor’s process for diagnosing, or an enterprise’s internal procedure for handling a refund). Having those experts label data or provide examples is expensive and slow. There’s also the issue of privacy if that data is sensitive. Companies like Scale AI have built big businesses on providing RLHF data precisely because most organizations can’t do it alone. In fact, demand for reinforcement learning with human feedback data skyrocketed recently – by late 2023 Scale’s run-rate revenue for such data was $750M, indicating how many AI companies are desperate for curated feedback data (Alignment-as-a-Service: Scale AI vs. the new guys) (Alignment-as-a-Service: Scale AI vs. the new guys) As one observer put it, “We don’t have good ways to get expert data for the domains we care about… and Scale has been the place curating most of that” (Alignment-as-a-Service: Scale AI vs. the new guys) This is a bottleneck: if you can’t easily get your hands on quality data or feedback for your agent, you’ll hit a ceiling in performance. Simulator environments or synthetic data can help in some cases (for example, gaming or certain repetitive tasks), but for real-world jobs, human expertise is often required to teach the AI. Thus, ironically, scaling up truly powerful agents may require a lot of human effort behind the scenes (to provide training signals), which is a constraint both technical and economic.
Safety and alignment concerns: Though not purely technical, it’s worth noting that pushing agents to be more autonomous raises more safety questions. An agent that can act in the world (even the digital world) can potentially do harm if misaligned or misused. This means developers have to build in safety checks: limiting what the agent can do, monitoring its output for dangerous decisions, etc. Anthropic, for example, kept Claude’s computer-use mode at a lower capability partly to ensure it didn’t inadvertently do something harmful (it was not given internet access during training to prevent any wild outcomes) (Developing a computer use model \ Anthropic) There’s ongoing research on how to make agents that obey constraints and understand when to stop or ask for help. But it’s an added layer of complexity on an already complex task. Any robust agent platform has to consider worst-case scenarios – what if the agent is prompted to do something harmful or gets manipulated via a prompt injection attack? (Yes, agents can be attacked too – e.g. a malicious website could be designed to confuse an AI browsing agent.) All this to say, building safe agents often requires additional engineering (sandboxing, red-teaming, etc.) which can conflict with performance. It’s a trade-off: the more freedom you give an agent, the more you have to worry about safety. Right now, the default is to play it safe, which also limits capability.
The evolving “data moat”: In the AI business, people talk about moats – what competitive advantage do you have that others can’t easily replicate? In the past, simply having a large dataset could be a moat. But with LLMs like GPT-4 trained on basically the entire internet, just having raw data isn’t enough. The moat is moving to proprietary labeled data and fine-tuned models. It’s been said that in this new era, “it’s not the LLM itself that sets businesses apart but rather the data they hold.” (Why Data is the Moat and LLMs a Commodity in the Post-ChatGPT World | by Amit Yadav | FabricHQ) For agents, this means the winners might be those who accumulate unique feedback logs, user interaction data, and domain-specific tuning that make their agents better over time. If Company A’s sales-support agent has been trained on millions of real chat transcripts and refined by senior sales reps’ feedback, it will likely outperform Company B’s agent that’s just using a generic model – and A’s data trove becomes a moat. We’re already seeing companies strategize about this: how to gather more and better training data from users in a virtuous cycle. From a technical perspective, this changes how we view data. It’s not just about big raw datasets anymore; it’s about curated, high-quality data for the specific tasks your agent does, and about leveraging real-world interactions (with appropriate privacy) to continuously improve the agent. This is hard to do, but if you succeed, you end up with an agent that’s uniquely skilled for your domain, and that others can’t easily copy because they don’t have your data. Startups building agents are wise to think early about data strategy – it could determine whether you have a long-term edge or you’re just riding on someone else’s foundation model until everyone else catches up.

In summary, the technical landscape for AI agents is full of thorny problems. We need better ways to measure success, new algorithms to help agents learn from mistakes, access to richer training data, and careful safety mechanisms – all of which are active areas of R&D. The good news is that progress is happening on each front: researchers are publishing new benchmarks, companies like OpenAI/Anthropic are sharing (some) lessons from their experiments, and tooling around evaluation and feedback is slowly improving. But these challenges mean that building an AI agent is not as straightforward as hooking up an API to GPT-4. We’re asking these systems to do things far more complex than answer questions – and we lack a decade of prior research on many of those complexities. It’s an exciting time for those who love solving hard problems (it doesn’t get much harder in AI than this!), but it’s also a reality check against any premature declarations that “AI agents have arrived.” There’s a lot of heavy lifting still to be done to make these agents truly robust and reliable.

Multi-Agent AI: Fascinating but Overhyped

One particularly buzzy concept is multi-agent AI – the idea of multiple AI agents interacting and collaborating to tackle tasks, possibly even forming an “AI society.” This vision captured people’s imaginations, but we need to dissect the reality from the hype. Are two (or ten) AIs better than one? So far, the answer has often been no (or “not yet”).

The early hype – AutoGPT and BabyAGI: In spring 2023, shortly after GPT-4’s release, hobbyist projects like AutoGPT and BabyAGI went viral. They set up one GPT model to autonomously generate tasks and another (or the same) to execute them in a loop, aiming to accomplish a high-level goal with minimal human input (Unraveling Baby AGI: Hype vs. Reality) (Unraveling Baby AGI: Hype vs. Reality) The internet buzzed with examples – an “AI CEO” managing a fake company, an agent trying to invent a new recipe, etc. These projects were exciting as proofs of concept, but they were also wildly overhyped. In fact, even at the time, many observers questioned if AutoGPT/BabyAGI were more than a gimmick (Unraveling Baby AGI: Hype vs. Reality) They often failed in practice, getting stuck in loops or producing nonsense plans. The creators themselves labeled them as experiments. Nonetheless, the idea of AI agents that could create and complete their own to-do lists without human oversight spurred people’s imagination (and a fair share of exaggerated headlines). It’s fair to say those early systems did mark the beginning of something new – an era of exploring autonomous AI – but they were far from any kind of general AI overlords. If anything, they demonstrated how quickly an unsupervised agent could go off the rails or hit a wall.

Modern efforts – more structure, but similar hurdles: Fast forward to 2024, and we have more structured multi-agent frameworks (like the aforementioned CrewAI and AutoGen) that allow defined roles and communication protocols between agents. Microsoft’s AutoGen, for example, provides a framework where you might have a “Planner” agent and an “Executor” agent (both LLM-powered) work together, or have agents with different specialties chat to solve a problem (AI Agents 2024 Rewind - A Year of Building and Learning) (AI Agents 2024 Rewind - A Year of Building and Learning) The difference now is that developers can design the interaction rules: who should do what, how they converse, when to stop, etc., rather than just letting them run in an uncontrolled loop. This has improved results on certain tasks – for instance, one can set up a coding agent and a testing agent to iteratively improve code, which can sometimes outperform a single agent prompting itself. There are also specific domains where multi-agent is naturally useful, like simulations (agents role-playing as characters) or negotiation scenarios where having distinct “personalities” or objectives for agents makes sense.

However, despite these improvements, multi-agent systems have yet to show a clear advantage in most real-world tasks. Often, a single sufficiently powerful agent with good prompt engineering can do as well as, or better than, a team of smaller agents passing notes to each other. Coordination is hard – not just for humans, but for AIs too! Multiple agents can introduce new failure modes: they can miscommunicate, get stuck in dialogue loops, or amplify each other’s errors. For example, two agents might end up chatting back and forth endlessly (“Should we try X?” – “Sure, go ahead.” – “I did X, what next?” – “Maybe Y?” – “Okay, did Y.” – “…”) without actually accomplishing the goal efficiently. If they share the same underlying model (e.g., both based on GPT-4), there’s also the question of whether we’re really gaining anything or just splitting one brain into parts.

The strongest AI agent systems we’ve seen so far tend to be single-agent with tool use rather than multi-agent. OpenAI’s Deep Research is one agent leveraging tools sequentially (OpenAI's New Research-Focused AI Agent and Altman's Open Source Revelation) Google’s Project Mariner achieved its high web task score using a single agent setup (they explicitly noted it was a single-agent run) (Google introduces Gemini 2.0: A new AI model for the agentic era) These successes suggest that for now, giving one agent more capability (more context, more tools, more training) yields better returns than splitting tasks among agents. Multi-agent setups can sometimes help with modularity or parallelism, but they introduce overhead in communication.

Multi-agent AI is certainly fascinating from a research perspective – it hints at how we might build complex AI ecosystems or model social intelligence. And it’s likely that as AI scales, we will have systems of agents doing different things (much like microservices in software). But the timeline for that is unclear. 2025 will surely see new multi-agent experiments (Microsoft’s researchers are already working on “declarative multi-agent teams” and a gallery of agent patterns (AI Agents 2024 Rewind - A Year of Building and Learning) (AI Agents 2024 Rewind - A Year of Building and Learning) . Yet for practical purposes, anyone expecting that chaining a bunch of GPT-4s together will yield a super-smart committee is likely to be disappointed. In many cases, you’re better off wringing more out of one model than orchestrating several, at least with current tech.

Why is multi-agent likely to underdeliver in 2025? A few reasons:

Immature coordination strategies: We lack robust algorithms for agent coordination. Human teams have managers, protocols, and years of evolved practices to work together effectively. AIs have none of that out of the box. We’re basically inventing simple coordination rules (who speaks when, how to resolve conflicts, etc.) and those are still primitive.
Model limitations show up multiplied: If one language model has a 10% chance of going off track in a given task, two interacting might compound errors or confuse each other, leading to even higher failure rates unless carefully managed. It’s like two slightly drunk people trying to support each other – it doesn’t always stabilize things!
Overhead vs benefit: Managing multiple agents has computational and latency overhead (multiple inference calls, passing messages, etc.). For a lot of tasks, this overhead outweighs any potential gain in intelligence, especially since current top-tier models are quite capable individually.
Narrow vs general improvements: Multi-agent might shine in narrow, structured scenarios (e.g., agent debates to improve factual accuracy, or a question-answer pair that cross-verify answers). But for broad, open-ended tasks, it doesn’t automatically solve the hard parts (which are understanding the problem and coming up with a correct plan – something a single agent still has to do within each role).

All that said, research in multi-agent systems is still valuable. It could be that by late 2025 or 2026, we discover effective patterns and perhaps even have models specialized as “team players.” One can imagine a future where your AI secretary, AI analyst, and AI engineer agents regularly talk to each other to get complex jobs done for you. But in 2025, if someone claims they have a product where a bunch of autonomous agents are collaborating seamlessly to run a business, take that with a large grain of salt. More likely, there’s a Wizard-of-Oz human or a lot of rigid scripting behind the scenes. Multi-agent AI today is mostly an overhyped idea that hasn’t yet met its lofty expectations. It captures our imagination (because it mirrors how we humans solve problems in groups), but the engineering reality is that it multiplies the difficulties before it multiplies the intelligence.

For startups, the implication is: don’t assume that adding more agents will linearally add more capability. Often, you’re better off focusing on one agent and making it as smart and reliable as possible, rather than orchestrating many. Use multiple agents only if there’s a clear reason (like distinct expertise areas or the need for simultaneous action) and you have a way to manage their interactions. Otherwise, you might just be complicating your system without clear gain.

Conclusion and Takeaways for Startups

AI agents in 2025 occupy an interesting duality: they are full of potential and yet fraught with challenges. On one hand, it’s easy to see the future where agentic AI is ubiquitous – handling routine online tasks, automating business processes, serving as personal digital assistants that truly relieve us of drudgery. The developments over the last year inch us in that direction and validate that these systems can work to an extent. On the other hand, the current reality is that making AI agents robust and trustworthy is a long, hard road. We’re witnessing the very first steps on that road, and there will be many twists (and a few U-turns) before agents realize their full promise.

Key takeaways for startups and innovators:

Stay grounded and avoid the hype trap. It’s tempting to brand every product an “AI agent” and to chase the trend with lofty promises. Resist that urge unless your tech truly delivers autonomous capabilities. Savvy customers are getting skeptical of hype. It’s better to under-promise and over-deliver. If you incorporate an agent, be clear about its limitations. Remember that an AI agent that actually solves a real problem – even if it’s only semi-autonomous – is far more valuable than one that promises magic but requires constant babysitting.
Focus on real use cases where an agent adds value. Ask yourself: does this problem genuinely need an autonomous agent solution, or would a simpler AI or software do? In many cases, a well-designed chatbot or a straightforward automation script might handle 80% of the need with far less complexity. Use an AI agent approach when there’s clear justification: e.g., the task is too complex for hard-coding but can be learned; it involves coordinating multiple steps or tools dynamically; or it requires a level of decision-making that static software lacks. Ground your projects in business value – e.g., reduce support ticket handling time by 50%, or automate data analysis that saves analysts 10 hours a week. Those concrete wins will matter more than having the flashiest tech. As the saying goes, don’t build a spaceship if you just need to cross the river.
Leverage existing tools, but know their limits. By all means, use frameworks like CrewAI or Agentforce if they jumpstart your development – they can handle the “plumbing” (connecting to browsers, APIs, logging, etc.) so you can focus on higher-level problems. Just recognize that most of these tools solve the easy parts (connecting an LLM to an environment, providing basic memory, etc.), whereas you still have to solve the hard parts (defining good reward signals, ensuring reliability, crafting domain-specific knowledge). Off-the-shelf agent solutions tend to address surface-level needs like observability, UI integration, and orchestration. Those are necessary, but not sufficient for success. The real differentiation will come from how well you handle evaluation, fine-tuning, and iterative improvement of the agent’s decisions. Be prepared to build custom components or logic on top of the frameworks – don’t expect an out-of-the-box agent platform to magically know how to do your job tasks perfectly.
Invest in data and feedback loops. As highlighted, the best agent in the long run will be the one that has learned the most from domain-specific data. If you’re deploying an agent, set up mechanisms to collect feedback: where did it succeed, where did it fail, and why? Even if you can’t fully retrain it now, log everything. Those logs might become training data to fine-tune a future version. If possible, involve domain experts to review agent decisions periodically and label/correct them. That could be as simple as having customer support reps rate the AI assistant’s responses, or salespeople tweaking the agent-generated emails before sending. Those human-in-the-loop moments are golden learning opportunities – capture them. Over time, building a proprietary dataset of agent interactions and outcomes can give you a defensible edge. It’s your “secret sauce” that others can’t copy, and it will let you improve beyond the baseline model everyone else has.
Plan for a human-agent hybrid approach (at least initially). The most successful implementations today use AI agents to augment humans, not replace them entirely. Think of agents as junior colleagues or interns: they can handle grunt work and free up humans for higher-level tasks, but they still need oversight on the important stuff. Design your workflows such that the agent does what it’s best at (speed, consistency, crunching data) and a human validates or handles the rest (judgment calls, relationship-sensitive interactions, etc.). For example, an agent might draft analyses or compile recommendations, and a human expert approves and finalizes them. Or an agent triages customer requests, and human staff handle the edge cases it’s unsure about. By explicitly defining this handoff, you manage risk and build trust in the system. As the agent improves, you can gradually widen its autonomy. But even long term, there will likely remain a symbiosis – and that’s fine! The goal is to optimize outcomes, not to achieve autonomy for its own sake.
Keep an eye on the frontier, but don’t chase every shiny object. The field of AI agents is evolving rapidly. New techniques or model updates could suddenly move the needle on some of the challenges we discussed. It’s wise to stay informed (e.g., follow research from OpenAI, DeepMind, Anthropic, and academic groups). However, be cautious about completely changing your product direction with every new paper or SDK. Many ideas will prove incremental or not pan out under real-world constraints. Maintain a learning mindset: run small experiments with new approaches to see if they help your specific case, but avoid the trap of “let’s rebuild everything because someone open-sourced a cool multi-agent demo.” Having a solid architecture that can incorporate improvements is better than constantly pivoting. In short, be agile in adoption but also strategic.

In conclusion, AI agents stand where, say, the internet stood in the early 90s – clearly the next big thing, but also clearly immature in its current form. Over the next few years, we can expect rapid progress. Today’s quirky beta agent that sometimes buys 100 packs of cookies by mistake could evolve into tomorrow’s reliable personal shopper that you trust completely. The trajectory is there, but timing and execution will make all the difference for companies betting on this tech. For startups, the opportunity is immense, but so are the risks of getting ahead of yourself.

The best approach is to be pragmatically optimistic: believe in the potential and work towards it, but stay grounded in solving tangible problems and mitigating present limitations. If you do that, you won’t just be riding the hype wave – you’ll be building real value that endures when the wave settles. AI agents are a marathon, not a sprint, and we’re only a few miles in. Pace yourself, keep your eyes open, and you might just reach the finish line with a winning product.

Where AI meets the real world

Apply

Join our event calendar and newsletter

View Luma

Seattle, WA