---
title: "Two Bridges to the Same Place: Why the AI Model Race Proves the Model Doesn't Matter"
author: "Lida Liberopoulou"
date: "2026-04-14"
license: "CC BY-SA 4.0"
canonical: "https://threadbaire.com/blog/posts/two-bridges-to-the-same-place.html"
---

The AI model race is supposed to produce a winner. One company pulls ahead, locks in the market, and becomes the next platform monopoly. That's the story the funding rounds and the product launches are telling.

But two companies arriving at the same capability in the same month just proves that the capability is reproducible. The competition between them is itself the evidence that the model layer is commoditising in the same way the browser war between Netscape and Microsoft proved browsers weren't worth defending, years before Chrome absorbed the category.

The platform owners already know this, and every one of them is building infrastructure that treats the model as a swappable component. Meanwhile, the open-weight ecosystem is closing the remaining gap from below. And even the quality gap is less stable than it appears, because the model you actually receive through the chat window may not be the same one that was benchmarked in the first place

## The Browser Wars

In 1995, Netscape Navigator was the internet. It had over 80% market share, a successful IPO, and what looked like an unassailable first-mover advantage. The browser was the product and people paid for it.

Microsoft shipped Internet Explorer and gave it away. Then bundled it with Windows and matched Netscape's features one by one, the JavaScript support, the CSS rendering, the dynamic HTML with each release closing the gap within months. The two companies spent years competing on capability, each matching the other's latest release faster than the last. By 1999, Explorer had over 95% market share, Netscape was dead and Microsoft had won the browser war.

But none of this really mattered because the competition between Netscape and Microsoft hadn't produced a durable advantage. It just proved that a web browser is commodity infrastructure, essentially a thing any well-resourced organisation can build. The years of feature-matching established that the capability was just a function of engineering investment. When Google shipped Chrome in 2008, it didn't need to win a browser war. The war had already been fought, and its lasting achievement was proving the browser wasn't worth defending. Chrome succeeded because Google was the infrastructure company. The browser was the distribution layer for everything Google already sold, search, advertising, Android, Gmail, YouTube and dozens and dozens of other services. Google gave Chrome away because the browser wasn't the product. It never had been. And Apple made the same move from the hardware side with Safari shipping as a feature of the device. It was good enough that nobody needed to think about it and never the thing you were paying for.

Microsoft's browser dominance, which cost billions and an antitrust trial to achieve, dissolved in under a decade. The winner of the browser war didn't get to keep the winnings. The platform companies which were the ones that owned the layers above and below the browser just absorbed the category whole.

When two well-resourced competitors arrive at the same capability in the same window, they haven't built a moat. They've demonstrated that the capability is a function of scale and investment. If that pattern holds then the competition between them is itself the evidence that the layer is commoditising. And the companies that own the surrounding infrastructure in the form of the platform, the distribution, the billing relationship are the ones who absorb the commodity layer once the race has proved it isn't worth owning separately.

The AI model layer is replaying this pattern at compressed speed. The frontier labs are competing for Netscape's position while the platform owners are already playing the Chrome move. And the open-weight ecosystem is proving that you don't need a proprietary vendor at all.

But the parallel is not exact. Browsers were standardised around open protocols with clear downstream business models. AI models are stranger, they are part infrastructure, part leased capability and part probabilistic service. But the structural pattern is the same: competition proves the capability is reproducible, the "winner" of the capability race doesn't hold the position, and the platform owners absorb the commodity layer. And the same pattern has replayed in cloud compute, in databases, in server operating systems. 

This matters whether you're signing a three-year enterprise contract or deciding which $20 subscription to keep. The scale is different, but the question is the same: is the thing you're building on still going to be here in two years, or are you just renting a position that moves? If you chose Claude over ChatGPT last year because it was better at writing or coding, and you'd switch tomorrow if something else pulled ahead, then you already know the answer. 

## Three convergences, one destination

The evidence comes from three directions at once. The frontier labs are proving each other's capabilities aren't unique. The companies that own the infrastructure underneath them are building model-agnostic platforms. The open-weight ecosystem is closing the remaining gap from below. And all three forces arrived visibly in the same period.

### The labs match each other

The timeline is documented in API changelogs, release notes, and product announcements. And it tells a specific story about how fast a competitive advantage disappears.

When OpenAI shipped function calling in June 2023, it took Google six months to match it and Anthropic about ten. When OpenAI introduced reasoning models in September 2024, Google matched it in three months and Anthropic in five. When all three shipped official agent tooling (Anthropic's Claude Code, OpenAI's Agents SDK, Google's ADK) the entire sequence landed within six weeks of each other.

More and more the parity window is compressing. A capability that bought a year of differentiation in 2023 buys just a few weeks in 2025. By the time a feature reaches the market, the other two are already shipping their version.

And this compression isn't limited to features but also shows up in the pace of flagship model releases. In late 2025, four major labs shipped their most powerful models within weeks of each other (xAI's Grok 4.1, Google's Gemini 3, Anthropic's Claude Opus 4.5, and OpenAI's GPT-5.2). One CEO reportedly even issued an internal "code red" memo on this. The competitive pressure was visible to everyone watching, but the structural implication was missed: four companies arriving at frontier capability in the same month is four companies proving the capability isn't proprietary.

The convergence also extends to pricing. In the same week in April 2026, both Anthropic and OpenAI adjusted their pricing structures in the same direction. Both chose to move away from flat-rate subscriptions and toward metered, usage-based billing for agent workloads. They discovered independently that the $20/month subscription model cannot survive contact with the actual compute costs of agentic use. And they responded by restricting access at the lower tiers and creating premium options.

They flinched simultaneously because the underlying cost structure is identical. They're running on the same hardware, from the same suppliers, facing the same economics. If you're paying $20 a month right now, both companies just admitted that price doesn't cover what the tool actually costs to run. And that is because the $20 subscription was never the business model. OpenAI has raised over $120 billion at a valuation that requires multiples of its current revenue to justify. Anthropic tripled its annual run rate, which relies heavily on enterprise contracts, to $30 billion. In the same quarter they locked in 3.5 gigawatts of compute capacity through 2027 from infrastructure partners who filed the commitment as a risk disclosure with the SEC. Neither company has demonstrated that the consumer subscription, at its current price, is self-sustaining. The grocery store analogy applies: the free sample gets you in the door but it was never what pays for the store.

As of this writing, both frontier labs are making the same pitch: that they have a model too powerful to release broadly, available only to a small set of institutional partners, and both are making it in the same window. Whether those claims hold up is a separate question. One company making a restricted-access pitch looks like a moat. But when two companies are making the same pitch in the same month that looks like a market category.

### The platform owners place their bets

While the frontier labs compete on capability, the companies that own the infrastructure underneath them have already made their structural decision. Every major platform owner is building model-agnostic agent infrastructure and not one of them has locked to a single model provider at the platform level.

Microsoft's Foundry Agent Service is positioned to work with "any framework and many models." Its model catalog unifies open-source and OpenAI models in the same deployment flow. Even within Microsoft 365, the agent mode in Excel lets users choose between Anthropic and OpenAI reasoning models. The company that has invested more in OpenAI than any other entity on Earth at the same time is building the infrastructure to replace it.

Amazon's Bedrock AgentCore works with "any open-source framework and model." Google's Agent Development Kit was released as open-source with explicit multi-provider support. Apple's Xcode 26.3 integrates coding agents from multiple providers with Claude and OpenAI Codex side by side in the same IDE. Even Meta, which just shipped a closed proprietary model, simultaneously maintains Llama Stack as an OpenAI-compatible, model-agnostic server.

These are shipped products and public documentation. The platform owners have looked at the frontier model race and decided the model is a swappable component. You don't build a model-agnostic control plane if you believe one model will dominate. 

The OpenAI API format itself became the clearest evidence of this. Google's Gemini API added OpenAI-compatible endpoint support in November 2024. Meta's Llama Stack is described as a "drop-in replacement." Databricks positions its platform as model-agnostic with OpenAI-compatible interfaces. The companies that compete with OpenAI adopted its API format as a compatibility baseline. And that is because the alternative is forcing developers to choose, and they'd rather let developers swap.

Anthropic's Model Context Protocol and OpenAI's AGENTS.md specification were donated as founding projects to the Agentic AI Foundation under the Linux Foundation in December 2025, alongside Block's goose agent framework. Google donated its Agent-to-Agent protocol to the Linux Foundation separately. Google, Microsoft, Amazon, and all three frontier labs are Platinum members of the same foundation. The companies competing most intensely at the model layer are cooperating at the protocol layer because they all need interoperability more than they need lock-in. They're standardising the connective tissue between agents while competing on the agents themselves. That only makes sense if you believe the model is commodity and the integration layer is where value lives.

### The open floor rises

The gap between open-weight and proprietary models has been closing on a measurable trajectory. On Chatbot Arena's Elo-based scoring, the gap between open and closed models narrowed from 8% to under 2% between January 2024 and February 2025 (Stanford HAI AI Index Report 2025). An MIT-affiliated working paper (Nagle and Yue, SSRN, November 2025) found that open models now close the performance gap within thirteen weeks of a closed model's release, which is down from twenty-seven weeks the year prior.

Open-weight models are also catching up on the agentic capabilities that define the current competitive frontier. GLM-5 scores competitively on tool-use benchmarks. Gemma 4, released under Apache 2.0 in April 2026, ships with native function calling, 256K context windows, and multimodal input, all capabilities that were frontier-only features months ago. And also these models run on consumer hardware.

Ollama now runs Gemma, DeepSeek, Qwen, GLM, and MiniMax through the same command-line interface. Developers are already pointing Anthropic's own Claude Code and OpenClaw at local Gemma 4 instances instead of cloud APIs. They are essentially using the frontier lab's tooling with someone else's model, at zero cost, with no data leaving the machine. The workflow that's emerging is the simple: run the free local model for 60–70% of routine work, route the hard tasks to whichever cloud provider is cheapest that week.

The compression is accelerating specifically in the capability class that costs the most to deliver and that is the agent workloads that both frontier labs repriced around in April 2026. The frontier providers raised prices on the thing they can't afford to subsidise. The open-weight alternatives offer the same capability class at a fraction of the cost, or free, on the user's own hardware. That's the commodity trap: raise the price, and the free alternative becomes more attractive. Subsidise the price, and the business model collapses. And neither of these options produces a moat.

Enterprise behaviour already reflects this. Perplexity's enterprise data shows that at the beginning of 2025, the two main providers, Claude and GPT, accounted for more than 90% of queries on Perplexity's enterprise platform. By late 2025, the market had fragmented into a four-way split, with no single provider holding even a quarter of queries. New models spike above 50% of enterprise usage for a few days after release, then settle back as users experiment and return to routing across providers. An entire product category, the AI gateways (Bifrost, Kong, LiteLLM, and OpenRouter), exists specifically to make model-switching a configuration change rather than a migration project.

## The model you benchmark is not the model you receive

If you’ve ever felt like your AI tool was sharper last week than it is today, you’re probably not imagining it. Maybe the answers feel flatter, the reasoning less careful, or just something slightly off. There is a dimension of model competition that benchmarks don’t capture and that almost nobody acknowledges. Benchmarks measure a model at its maximum capability, under controlled conditions, at a specific moment in time. What you actually receive through the chat window on a Tuesday afternoon may be something else.

That matters here because every published comparison between proprietary and open-weight models is really a comparison against the proprietary model at its ceiling. But if the version customers actually receive is something less than that ceiling, and if it shifts without disclosure, then the real gap is narrower than the benchmarks suggest. And the evidence that this happens is not just anecdotal.

A foundational study from Stanford and Berkeley measured the same named version of GPT-4 across a three-month period in 2023 and found accuracy on a prime-versus-composite classification task dropped from 84% to 51% (Chen, Zaharia, and Zou, *Harvard Data Science Review*, 2024). A 2026 audit from Princeton (Kirgis, Tufekci, et al.) tested the same models through OpenAI's chat interface versus its API and found the behaviour "differs dramatically" between the two. Also the same API endpoint tested just two months apart yielded what the authors describe as a "complete reversal" in output patterns.

And these are not edge cases. They reflect the structural reality of how commercial AI models are served. Every provider reserves the right to change model behaviour without notice. OpenAI's terms state they "may update our Services from time to time." Anthropic's API versioning preserves interface parameters but does not promise stable model outputs. According to Google, its model aliases "always point to the latest stable model" and update automatically.

OpenAI describes GPT-5 as a routed system where a fast model, a reasoning model, and a real-time router select which version handles each query. Once usage limits are hit, "a mini version of each model handles remaining queries." Anthropic's Claude Code defaults paid Pro and Max subscribers to *medium* reasoning effort while other users default to *high*. Paid users end up receiving less reasoning depth by default, not more. After hitting an Opus usage threshold, Claude Code "may automatically fall back to Sonnet."

OpenAI acknowledged rolling back a GPT-4o update in April 2025 after it became "overly flattering". It was a behaviour change that shipped, was noticed by users, and was corrected only after public complaints. Anthropic published a detailed postmortem in September 2025 documenting three infrastructure bugs that intermittently degraded Claude's response quality, with misrouting affecting up to 16% of Sonnet 4 requests at the worst-impacted hour. In both cases, the disclosure came after users noticed the problem, not before.

Gao, Liang, and Guestrin (ICLR 2025) tested 31 commercial API endpoints claiming to serve open-weight Llama models and found that 11 of them, which was more than a third, served output distributions that diverged from Meta's published reference weights under the same decoding settings. 

The gap between what was measured and what is delivered is invisible, variable, and undisclosed. For any organisation making a procurement decision based on published benchmarks, this is the gap that matters most.

The weights you download are the weights you run. The inference stack is yours to freeze. There are no silent routing changes, no fallback to a smaller model at peak hours, no undisclosed modifications between Tuesday and Thursday. Consistency is its own form of superiority, even when peak capability is lower, because at the end of the day you know what you're getting.

So the real-world convergence between proprietary and open-weight models is tighter than any benchmark suggests. Because the end user is not actually getting the ceiling but a constantly moving floor.

## What this does not mean

None of this means models don't matter or that every provider is interchangeable today. Frontier models still lead on the hardest tasks. The people and companies building on them have real reasons for the choices they've made. It can be many things, trust, habit, the way a tool fits into how they think. Switching costs are real even when they're not contractual.

What the pattern means is that the technical advantage is shrinking faster than the story around it admits, and the infrastructure is already adjusting. The durable value is migrating to the layers around the model, things like context, integration, the billing relationship, the workflow you've built and not the model itself. And the people who understand this hold their commitments loosely. They keep their work portable and they don't let a vendor become the only place their context lives. The ones who don't will find out when the tool they built everything around changes its pricing, changes its behaviour, or gets matched by something that costs nothing.

## What should exist but doesn't

If the model layer is becoming a commodity, then the most important question is not what a model can do at its peak. It is what customers actually receive in practice, day after day, through the interface they pay for. And that is the part of the market that is barely measured.

Complaints about commercial model quality are widespread, persistent, and still mostly anecdotal. Users regularly report that performance fluctuates, degrades without warning, and that the model they tested last week does not seem to be the one they are getting today. We do have peer-reviewed evidence that models change over time. What remains unproven is whether specific providers are throttling specific models at specific times, because no one has yet run the kind of test that would demonstrate it.

And the shape of that test is simple. A fixed prompt suite, run against the same named model at the same settings, across fixed time windows for two or three weeks, with outputs scored blind against a rubric. Compare weekdays to weekends, peak US hours to off-hours, the chat interface to the API and subscription tiers where possible.

And it would answer an important question about the current state of AI. Because if delivered quality shifts silently over time, then benchmark leadership is an even weaker moat than it appears. At the end of the day, the competitive landscape is not defined by what the model achieves under ideal conditions in a system card or leaderboard but by what the customer actually gets on a Tuesday afternoon.

Today, nobody publishes ongoing comparisons of that delivered quality against benchmarked quality: something that would measure not what the model can do at maximum effort under controlled conditions, but what normal users receive through the chat interface at default settings during ordinary use. That gap is where the real competitive landscape lives, and it is still largely unmapped.

There is also a consistency comparison that should be straightforward: the same prompt suite run against a commercial API endpoint and against an equivalent open-weight model running locally, at the same settings, across the same period. The commercial endpoint has no obligation to stay fixed, but the local model at least lets you pin the thing you are running.

These are tests that any research group, security firm, or well-resourced developer with API access could run. The results would matter to every enterprise buyer evaluating a model contract and every procurement officer comparing vendor claims with delivered performance.

The organisations best placed to run them are the companies already building products on top of commercial model APIs and absorbing the instability firsthand. They have the access, the engineering capability, and the strongest incentive: their own products depend on the answer.

The tooling for this kind of monitoring already exists in academic form. Statistical methods for detecting undisclosed model changes behind APIs have already been published. What does not exist is anyone running them continuously and publishing the results.

In the meantime, the practical advice remains unglamorous but load-bearing: keep your work portable. Do not let your context (what is in your prompts, your workflows, the way you have taught a tool to work with you) live only inside one provider’s system. The model you are using today may not be the one you are using in eighteen months. The only thing you will want to bring with you is the work itself. Make sure you can.
