Compute Manifesto

We have entered the Inference Era

For half a decade, AI progress was driven by pre-training scaling, as breakthroughs were gated by order-of-magnitude growth in PFLOP-days, required to train models on increasingly larger datasets. AI has now entered a new phase as inference-time compute scaling and RL training unlock capabilities that would have previously required years of pre-training alone.

This means the same compute used to train the next model is also needed to run the inference that funds the clusters which train and serve the models that follow. These two vectors of progress are compounding whilst competing for the same scarce resource, making compute the most important resource in the world. The first 1GW clusters arrive in 2026, and the race to 10GW is already underway.

Inference scaling demands interactivity and throughput simultaneously

Inference-time scaling enabled incredible model capabilities and user experiences, driving the fastest revenue growth in the history of technology in 2025. The global deployment of AI over the next decade will be shaped by inference user experience, and unit economics.

According to METR data, task length capabilities are now doubling every ~five months and on current trends by late 2027 models will be capable of tasks that take humans two working weeks. Underappreciated from their research, is the empirical reality that every doubling of task length is achieved by more than doubling the token budget. Continuing on this path means that further exponential increases in model capabilities would come at the cost of super-exponential increases of compute.

Making the most powerful capabilities available to everyone will require models to be served with both high throughput and high interactivity. Without hardware that delivers both simultaneously, incredible raw capabilities will exist undeployable at scale, accessible only to the few who can wait and pay.

The existing GPU architecture is reaching its physical limit

Current hardware is fundamentally unable to serve inference that is fast for every user at once, which is why we see outcomes such as Opus 4.6 rate limits and GPT-5.2 Pro running at ~13 tokens per second per user.

This tradeoff is inherent to the memory architecture used by every major accelerator since TPUv2 and V100: a large logic die placed on an interposer alongside stacks of HBM. High throughput per XPU and per megawatt can only be achieved by batching large numbers of users together, fully utilising compute and amortising the energy cost of moving model weights through HBM across a large batch of output tokens. Large batch sizes inevitably increase per user latency, reducing interactivity and forcing a hard tradeoff.

At its core, Inference performance is constrained by data movement. As a result, continued improvements in logic efficiency (FLOPs/W) and throughput (FLOP per package) deliver diminishing returns. Reductions in data transfer time are capped by the memory wall, as well as shoreline and package size constraints. While the transition from HBM2 to HBM4 delivered meaningful gains in both energy efficiency and throughput density, repeating improvements of that magnitude will take close to a decade and require significantly more complex and expensive fabrication techniques. Limited further energy efficiency gains from HBM improvements creates an unavoidable floor on the pJ/bit required to transfer KV cache for each token and thus a floor on total token energy consumption in the current architecture.

Over the past decade, scaling this architecture has improved overall system performance, but further scaling will not achieve both high throughput and high interactivity. From Hopper to Rubin Ultra, package size will have grown by roughly four times. A further four times increase would approach full wafer scale packaging limits. Larger packages can reduce data transfer time and improve interactivity, but do not reduce fixed data movement latency. As such Amdahl’s law limits future interactivity improvements from further package size increases.

The physical path that data takes from HBM, through the interposer into compute has not fundamentally changed and is becoming more complex with the introduction of cross-reticle high bandwidth interfaces. As a result, data movement latency, measured as time per cache hit or miss, is at or near its limits and contributes an increasing share of per token latency. Data transfer time per layer can be pushed further through tensor parallelism of larger layers, but this comes at the cost of additional power and interconnect latency. High throughput encoding schemes also introduce encoding and decoding latency, further raising the latency floor per token and limiting achievable interactivity.

Even the Dominant Player cannot overcome this

If this tradeoff could be resolved through scale, integration, or execution, the incumbent at the centre of today’s compute ecosystem would be the one to do it.

The incumbent has huge moats across software, system integration and supply chain thanks to billions in pre-payments to secure leading-edge logic nodes, HBM and advanced packaging capacity.

With each generation, it doubles down on this approach. Systems grow larger, more tightly integrated, and more ambitious. Absolute performance continues to rise but the underlying constraints do not move and it remains impossible to deliver high interactivity and throughput together.

What it would take to succeed

Hardware that can deliver both high throughput and high interactivity must simultaneously address data-movement efficiency and latency at scale. Any approach that improves only one dimension merely changes the nature of the tradeoff.

From a supply chain and fabrication perspective, a new architecture must forego HBM, advanced packaging or any other technology that is supply chain constrained by current incumbents. When even the biggest hyperscalers are struggling to secure capacity, a startup simply cannot compete.

From a compatibility perspective, the hardware must support models as they exist today. It should not enforce QAT/PTQ of existing models or require a new thermodynamic neuromorphic architecture with the promise of theoretical improvements.

From a design perspective, achieving this requires system-scale thinking, moving from reticle-scale and wafer-scale design to rack-scale co-design of compute and data movement as a single, unified system.

Challengers have tried and failed

There is no shortage of well-funded challengers in this space, but they fall into the same two failure modes.

Some remain within the logic die-interposer-HBM architectural paradigm and are subject to the same interactivity-throughput tradeoff whilst competing against the next generation GPU/TPU, with an older generation lower-bin HBM and logic.

Others fail to go far enough. They recognise the need for a new paradigm, try to re-shape the interactivity trade-offs, but cannot break out of the trade-off, and remain constrained by the limits of silicon-only approaches.

OLIX is building a new Paradigm

We are building a new class of accelerator designed to achieve high throughput and high interactivity on the most demanding inference workloads, free from the architectural and supply chain constraints of the current regime.

It is our belief that scaling an SRAM-architecture integrated with photonics can surpass HBM-based architectures on throughput/MW and TCO, and significantly outperform silicon-only SRAM-architectures in interactivity and latency.

The OLIX Optical Tensor Processing Unit (OTPU) is an optical digital processor with a novel memory and interconnect architecture. This approach enables bit-perfect logic with a step change in performance.

We are creating the next paradigm for frontier AI.

OLIX.

Company

Careers

press@olix.com

Elsewhere

Company

Careers

press@olix.com

Elsewhere

Company

Careers

press@olix.com

Elsewhere

Join the team