The Scaling Era: What We've Learned
The past several years of AI development have established two fundamental truths about building powerful AI systems. Firstly, scaling model parameters, training data and pre-training compute consistently improves AI model performance. And secondly, post-training techniques enhance task-specific capabilities while inference-time scaling and reinforcement learning enable performance improvements in domains with verifiable rewards.
From these truths we can see what a super-intelligent system might look like - large models generating extensive token sequences through long inference chains. The economic deployment of such systems requires substantial inference compute capacity. And as user bases expand, query volumes increase and tokens per query grow, inference demand is accelerating much faster than supply can scale.
The Shortage Era: Supply-Demand Imbalance
Compute as a Critical Commodity
This means we are fast approaching or, indeed, are already in the shortage era. Computing resources are becoming a critical factor of production. Access to, and control of, compute will become inextricably linked to national economic security. In turn, compute capability will become critical to national infrastructure.
We can see this already both in nation states negotiating deals to secure compute infrastructure (and even in some cases 'traditional' critical national infrastructure like nuclear power has been co-opted exclusively to serve compute demand) and in the pursuit of 'sovereign compute' capability.
Given the critical importance of compute, it is surprising how few analyses recognise it as the most important commodity of the next decade. If we analysed compute in the same way we analysed another commodity (and we can do this because, like other commodities, compute is economically essential, has global demand but restricted supply, and is broadly fungible) we might plot an industry supply curve - in this case a Compute Cost Curve.
When we look at the compute cost curve we see that at a certain point the incremental cost of generating one additional token becomes uneconomical at any price point - there is a hard physical limit on token supply that is far below demand. Projected advances in hardware capacity and performance merely move this point along the axis. Latest-generation hardware certainly maintain significant cost advantages over predecessors but fabrication constraints will inevitably restrict total capacity increases.
Projecting the 2025-2030 Curve
Current trends suggest total token production capacity will increase approximately 40x by 2030—insufficient to meet projected demand growth. This analysis holds interactivity (tokens/second/user) constant yet advanced AI applications will require dramatically higher token generation rates to deliver value.
Increasing interactivity requirements shift the supply curve left and upward, severely constraining available throughput. Our analysis shows that moving from 50Tokens /s/User to 1000 Tokens/s/User leads to a >100x reduction in available token supply This effect stems from the GPU memory architecture. The memory wall prevents a solution to the limited supply of high interactivity tokens due to the fundamental constraint that memory bandwidth scales with is package perimeter while compute scales with area.
[image]
The Memory Wall
Higher token generation rates require dedicating more bandwidth to model weights, reducing capacity for activations and user data—creating the throughput-interactivity tradeoff. When memory bandwidth demands exceed single-accelerator capacity, tensor parallelism becomes necessary to further increase interactivity, but this further reduces throughput due to collective communication overhead.
We can see that physical constraints limit achievable memory bandwidth in GPU architectures. Package size increases (as seen in Hopper → Blackwell → Rubin progressions) provide incremental improvements but cannot overcome fundamental scaling limits. Packaging limits of wafer scale GPUs (e.g. TSMC System on Wafer packaging with 64 HBM stacks) create a ceiling on memory bandwidth and thus high interactivity token supply.
[image]
Quantisation: The Hardware Imperative
Facing immutable hardware constraints, research has progressively pushed quantisation boundaries. Halving bit precision approximately doubles performance in digital systems by doubling effective memory bandwidth. FP4 training and inference are now well-demonstrated, with NVFP4/MXFP4 implementations using block sizes of 16-32.
This trend - driven by performance demand - will continue toward 2-3 bit precision with larger block sizes as compute shortages intensify. Hardware architecture dictates this trajectory: GPU floorplans increasingly optimise for FP4 making it the competitive baseline. Workloads not optimised for FP4 will face performance disadvantages.
[image]
The Salvation era: Optical Computing
Hardware limits have made a new approach necessary but also, through driving quantisation, possible.
Quantisation provides linear performance improvements in digital systems but quadratic improvements in optical computing due to its analog nature. FP8 represents the inflection point where optical approaches can execute AI workloads accurately while significantly exceeding digital compute performance. FP4 [cements/accelerates] the route for Optical Compute to become the dominant paradigm for AI compute going forwards. Further quantisation will extend this advantage.
The Optical Tensor Processing Unit (OTPU)
Building competitive AI accelerators requires meeting a rigorous performance threshold. To create a strong enough incentive to drive adoption, novel accelerators must achieve a performance advantage great enough to overcome the cost, complexity and inconvenience of integrating a new hardware approach. Dylan Patel of SemiAnalysis has suggested a new accelerator would have to demonstrate at least 5x performance advantage over incumbent solutions—a nearly impossible target for a startup in the digital domain due to HBM & frontier fab node and packaging access, integration challenges, bill-of-materials disadvantages and software optimisation gaps.
Optical computing offers a fundamentally different approach. The OTPU addresses the memory wall problem through a novel memory architecture uniquely enabled by [Flux’s free space optical design. This architecture delivers order-of-magnitude improvements in both memory bandwidth and capacity—improvements that cannot be replicated in traditional digital compute architectures.
Without optical solutions, our compute cost curve analysis shows that the current trajectory analysis will lead to severe compute shortages constraining AI deployment, limiting economic viability and delaying broad distribution of advanced AI capabilities pending years of additional infrastructure buildout.
The Future: What’s Next
The gap between AI inference demand and digital compute supply will widen substantially through the rest of this decade. Memory bandwidth constraints impose fundamental limits on token generation capacity regardless of compute improvements. Without addressing the proximal constraints on compute, superintelligent systems won’t be able to be widely deployed, generating a huge opportunity cost where low quality uses of AI proliferate, but applications that would have a positive, transformational impact on society remain economically unviable.
However, quantisation trends driven by these constraints create the conditions where optical computing architectures will deliver performance improvements beyond what digital optimisation can achieve.
The compute shortage is not a temporary bottleneck but a structural challenge requiring architectural innovation. Optical computing, enabled by recent quantisation advances, offers a path to abundant inference compute capacity—the prerequisite for economically viable deployment of advanced AI systems.