Achieving Microsecond AI Inference for Trading Decisions
From Latency Metrics to Decision Outcomes
In audited STAC-ML benchmarks, a system featuring VOLLO achieved inference latencies as low as 2 microseconds at the 99th percentile, less than half that of competing solutions.
On paper, that’s a performance metric.
In practice, it determines whether a trading decision can still be profitable.
In modern electronic markets, the challenge is no longer just building better models, but ensuring their outputs arrive in time to act on live market conditions. Many firms already operate highly sophisticated machine learning pipelines, but their effectiveness is constrained by how quickly those models can generate actionable outputs.
Historically, trading systems operated at two extremes: simple, highly optimised strategies executing in nanoseconds, or more complex, human-driven decisions unfolding over seconds, minutes, or longer. AI is now closing that gap, enabling more intelligent, model-driven decisions to be made automatically in real time.
As a result, intelligent decision-making has become speed-critical. The value of a prediction is now directly tied to how quickly it can be acted on, and even microseconds can determine whether an opportunity is captured or missed.
The STAC-ML (Markets) Inference benchmark, designed by quantitative engineers and technologists from leading financial institutions, exists precisely to measure this challenge under realistic conditions.
It provides a rare, standardised view into how different architectures perform when exposed to real-time market data and production-grade models.
In the STAC-ML benchmarks, latency is the primary point of comparison between competing systems. While throughput and efficiency are also measured, the focus is on achieving the lowest possible inference latency under realistic trading conditions.
This reflects how machine learning is now being applied in trading: as models are used to drive real-time decisions, their value depends on whether their outputs arrive in time to act on an opportunity and generate alpha.
Why Inference Latency Determines Trading Decisions
In high-frequency and systematic trading environments, every decision exists within a shrinking time window.
A model may identify:
- a pricing inefficiency
- a shift in supply and demand
But if inference completes too slowly, that signal becomes irrelevant.
The trade is no longer available. The edge disappears.
This creates a fundamental constraint:
A prediction that arrives too late is indistinguishable from no prediction at all.
This is why trading firms are increasingly focused on decision latency, not just execution latency.
Inference sits directly within the decision loop:
Market Data → Feature Extraction → Model Inference → Decision → Execution
Even microsecond delays at the inference stage propagate through this loop, affecting:
- fill quality
- slippage
- hit rate of signals
- overall strategy profitability
For proprietary trading firms latency = more profit.
Reducing inference latency is therefore not an optimisation exercise, it is a direct lever on trading performance.
The Role of STAC-ML in Measuring Real-World Performance
Benchmarking AI inference in financial markets is not trivial.
Unlike synthetic benchmarks, trading workloads require:
- real-time streaming data
- strict determinism
- consistent tail latency (not just averages)
- accurate model outputs under pressure
The STAC-ML Markets (Inference) benchmark was created to reflect these conditions.
It evaluates:
- latency (including 99th percentile performance)
- throughput
- resource efficiency
- model accuracy
Crucially, it does so using standardised models and datasets, enabling meaningful comparisons across different technology stacks.
This makes STAC one of the most credible indicators of real-world performance in financial machine learning.
Within this framework, VOLLO achieved less than half the latency of any other audited system in the STAC-ML benchmark. This is not just an isolated result, it reflects an inference pipeline engineered for deterministic, real-time decision-making under production trading conditions.
Why Traditional Architectures Struggle with Low Latency AI Inference
Most machine learning inference pipelines were not designed for microsecond-level constraints.
Traditional approaches, particularly GPU-based systems, optimise for:
- throughput
- batch processing
- large-scale parallel workloads
These characteristics are valuable in many domains. But in trading, they introduce structural limitations.
Data Movement Overhead
Inference often requires moving data between:
- network interfaces
- CPU memory
- accelerator memory
Each transfer introduces latency.
Non-Deterministic Execution
Many architectures exhibit variability in processing time:
- queueing delays
- scheduling overhead
- contention across workloads
For trading systems, tail latency matters more than average latency.
Architecting for Microsecond Decisioning
Achieving microsecond inference requires rethinking where and how models run.
In the STAC benchmark system, VOLLO operates on FPGA-based infrastructure, enabling a fundamentally different approach.
Deterministic Execution
FPGA-based systems provide predictable processing paths, consistent latency profiles, and minimal jitter.
This is critical for maintaining performance at the 99th percentile, not just on average.
Hardware-Level Optimisation
The benchmark system included:
- AMD Versal™ Premium adaptive SoC
- FPGA-based accelerator hardware
- high-performance server infrastructure
These components allow inference pipelines to be deeply optimised, tightly integrated, and purpose-built for latency-sensitive workloads.
Model Flexibility Without Latency Trade-Offs
Importantly, this approach does not restrict model choice. VOLLO supports:
- decision trees
- neural networks
- advanced architectures such as state space models
In practice, firms still balance model sophistication against latency constraints, often using a sandbox or virtual environments to tune models for a given latency target. Approaches like this expand that boundary, allowing more complex models to run within latency budgets that would previously have required simplification.
Revisiting the Results: What 2 Microseconds Enables
Returning to the STAC benchmark results (Inference) benchmark results (see STAC Report SUT ID MRTL260323):
- 2µs latency (99th percentile; LSTM_A model; NMI=1)
- Less than half the latency of alternative audited systems
- Strong throughput and energy efficiency performance
These metrics matter because they unlock new capabilities in trading decision-making.
1. More Signals Become Actionable
Previously marginal signals (discarded due to latency) can now be used in live trading.
2. More Complex Models Can Run in Real Time
Firms can deploy richer models without sacrificing speed.
3. Decision Confidence Increases
Deterministic latency enables:
- predictable system behaviour
- tighter integration with execution strategies
4. Competitive Edge Compounds
Small latency improvements accumulate across:
- thousands of decisions
- millions of trades
Over time, this translates into measurable performance gains.
From Inference Acceleration to Decision Infrastructure
What these results ultimately point to is a broader shift: AI inference is becoming part of core trading infrastructure, not an external component.
In this model:
- inference runs alongside market data
- decisions are generated in real time
- execution systems operate on continuously updated signals
This creates a closed-loop decision system, where data is processed instantly, models respond immediately, and actions are taken without delay.
As models continue to evolve, becoming more complex and more central to trading strategies, the importance of this architecture will only increase.
The Future of Low Latency AI in Trading
Looking forward, for those who wish to generate alpha using AI for auto-trading, several trends are becoming clear:
- 1. Microsecond Inference Will Become Table Stakes
Firms that cannot operate at this speed will be structurally disadvantaged. - 2. Infrastructure Will Converge Around Data Proximity
Inference, networking, and execution will increasingly converge into unified systems.
3. Model Complexity Will Increase
As latency constraints are reduced, firms will deploy:
- deeper models
- more adaptive strategies
- real-time learning systems
4. Benchmarking Will Drive Adoption
Independent validation (such as STAC) will continue to play a key role in:
- reducing perceived risk
- accelerating adoption
- standardising performance expectations
Evaluating Your Own Models in a Low Latency Environment
For many trading teams, the key question is no longer “can we improve our models?”. But “can we run our models fast enough to act on them?”
The most effective way to answer this is through direct evaluation.
VOLLO enables teams to test their own models in a low latency environment, without requiring FPGA expertise or changes to existing ML workflows.
This allows firms to:
- benchmark current performance
- quantify latency improvements
- understand the impact on decisioning
When Latency Defines the Decision
The STAC-ML benchmark results make one thing clear:
- Low latency AI inference is no longer an optimisation layer, it is a determinant of trading outcomes.
At microsecond timescales:
- signals either exist or disappear
- decisions are either valid or obsolete
- opportunities are either captured or missed
And increasingly, that boundary is defined by inference latency.
The full benchmark results are available in the STAC Report (SUT ID MRTL260323) at http://www.STACresearch.com/MRTL260323.