techtimes.com

OpenAI inference costs cut by half with software: GPU count drops to hundreds for guest traffic

Tech News•Jul 4, 2026•

4 min read

Published by AINave Editorial • Reviewed by Ramit

TL;DROpenAI reportedly halves inference costs for ChatGPT guest traffic using software optimization, reducing GPU needs to roughly two hundred.

OpenAI has achieved a software-only optimization that reportedly cuts inference costs by more than half for the ChatGPT guest tier, reducing the required Nvidia GPUs to a couple hundred. The gain comes from better utilization of existing infrastructure rather than new hardware, though the exact technique remains undisclosed. This development matters to AI builders because inference costs are the primary barrier to AI profitability and API pricing. If the gains generalize beyond guest traffic, they could lead to lower API prices and higher usage limits.

What happened

In June 2026, OpenAI engineers demonstrated a software optimization that, when applied to logged-out ChatGPT traffic, cut the number of Nvidia GPUs needed from an estimated tens of thousands to roughly a couple hundred, according to The Information citing an internal source. The gain comes entirely from improved utilization of existing server infrastructure, with no new hardware deployed. OpenAI has not publicly disclosed the technique.

Separately, on June 24, 2026, OpenAI and Broadcom unveiled Jalapeño, a custom inference accelerator manufactured by TSMC that targets roughly 50% lower cost per token in early testing. Full production is expected by 2027-2028. The software optimization and hardware roadmap form two phases of OpenAI’s inference strategy.

Why AI builders should care

Inference cost is the ongoing operational expense of serving every query. OpenAI spent $5.02 billion on Azure inference alone in the first half of 2025, making it the central obstacle to profitability. A software-only reduction of this magnitude shifts the competitive landscape from who has the most GPUs to who can make the same chips produce more output per dollar.

If the optimization holds, it creates room for lower API prices, higher usage limits, or improved margins. The techniques analysts suspect - KV cache reuse, quantization, in-flight batching, and query routing - are established methods that compound when tuned to a specific traffic profile. Any AI builder hosting models at scale should watch whether these gains reach the API tier.

Practical implications

The software optimization currently applies only to the guest tier, which produces simpler, more predictable traffic than free-tier, paid, or API workloads. If the same technique generalizes, API pricing could face downward pressure, and usage caps could rise for multiple tiers. OpenAI’s custom Jalapeño hardware, if it delivers the claimed 50% cost reduction, could compound software gains and further reduce per-token costs.

For builders relying on OpenAI’s API, the immediate impact is uncertain. No pricing changes have been announced. But the direction is clear: serving-stack engineering is becoming as strategically important as model architecture, and those who can optimize inference will hold a cost advantage.

Caveats

Several important unknowns limit the practical takeaway. OpenAI has not disclosed the exact technique - the four candidate methods (KV cache reuse, quantization, in-flight batching, query routing) are analyst speculation, not confirmed by the company. The confirmed results apply only to anonymous guest traffic; whether the gains transfer to free-tier, paid, or API users is completely unconfirmed. OpenAI has made no commitments about extending the optimization, and the durability of the gains at production scale remains to be seen.

Additionally, Jalapeño’s 50% cost-per-token claim comes from early testing and lacks independent benchmarks. Full production is not expected until 2027-2028. Until the software optimization is proven beyond the guest tier and the hardware claims are validated, builders should treat this as a promising signal rather than a concrete change to their cost structure.

FAQs

What caused OpenAI halving inference costs?

The Information reported that OpenAI engineers developed a software-only optimization in June 2026 that reduced GPU requirements for ChatGPT guest traffic from tens of thousands to roughly a couple hundred. The exact technique has not been publicly disclosed. The gain comes from better utilization of existing server infrastructure rather than new hardware, according to the report.

Will the software optimizations apply to paid ChatGPT tiers or only guest traffic?

OpenAI has publicly disclosed the optimization only for the guest tier (anonymous visitor traffic). No commitment has been made regarding extension to free-tier, paid, or API workloads. The guest tier produces more homogeneous and lower-complexity traffic, making it an easier test bed for the optimization. Whether the same gains apply to more complex paid or API workloads remains the central open question.

What is the Jalapeño chip and when is production-ready?

Jalapeño is OpenAI’s first custom-designed inference accelerator, announced on June 24, 2026, and manufactured by Broadcom and TSMC. Broadcom CEO Hock Tan stated early testing shows roughly 50% lower inference cost per token compared with current GPUs, though no independent benchmarks have been published. Initial prototype deployments are expected by end of 2026, with full production scale in 2027-2028.

How many GPUs were required before and after the optimization?

Before the optimization, industry estimates placed the number of Nvidia GPUs needed to serve ChatGPT guest traffic in the tens of thousands. After applying the software optimization, the number dropped to roughly a couple hundred, according to the internal source cited by The Information. This represents a reduction of more than 99% in the GPU count for that specific segment.

Sources

Latest Tech News

China's film-plus push: cinemas as experiential hubs amid ticket slump

2 hours ago

Stargate UK data centre project: OpenAI site visit gaps and unverified pledges draw scrutiny

2 hours ago

Mistral AI's European edge: open-weight models, sovereignty-focused cloud plans, and enterprise play

2 hours ago

Tesla AI spending cap signals a new cost-management phase for corporate AI

2 hours ago

ctx.rs Indexes Months of AI Agent History in One Command, Cutting Token Costs 50x

2 hours ago

NVIDIA Rubin AI Liquid-Cooled Data Center: 45°C Cooling Redefines Efficiency

2 hours ago

Meta AI agents progress slower than hoped as layoffs and restructuring reshape its AI push

2 hours ago

Microsoft Copilot Unified App by August 2026: Feature Cuts and AutoPilot Pricing Loom

2 hours ago

China's Cinema Regulation Shift Could Turn Theaters into Multi-Use Hubs with AI Experiences

8 hours ago

Macron and Modi court tech CEOs in a race for AI data center investment

8 hours ago

Chrome 151 Beta adds automatic punctuation inference to speech dictation via Web Speech API

14 hours ago

Shadow AI at Work: 55% of UK employees use unapproved AI tools, KnowBe4 report finds

14 hours ago

Devin-kun in Japan: AI agents accelerate legacy code modernization amid a shrinking software workforce

14 hours ago

Android Beta Hints at Google Maps Gemini Food Ordering

14 hours ago

SAP redirects spend to fuel AI transformation, warns of workforce evolution rather than mass layoffs

20 hours ago

North Korea-linked npm malware impersonates Rollup polyfills to harvest developer secrets and enable remote access

20 hours ago

Anthropic launches Claude Science and internal drug discovery programs

20 hours ago

Apple argues public YouTube videos were accessible for AI training and seeks to dismiss DMCA claims

20 hours ago

Air Force One AI-Generated Imagery: What the Bookshelf Photo Tells Us About Detection Limits

20 hours ago

PAW: Compile Once, Run Offline with a 23MB Adapter That Matches 32B Models

20 hours ago

Claude Fable 5: balancing long-form reasoning with stronger safeguards - what AI builders should know

20 hours ago

Google DeepMind Union Talks Stall After Rocky Start With Leadership Absence and Union-Busting Allegations

1 day ago

Autonomous AI ransomware attack: what builders should know and how to defend AI agents

1 day ago