Microsecond AI Inference for Trading

By myrtle.ai Newsroom on May 1, 2026

From Latency Metrics to Decision Outcomes

In audited STAC-ML benchmarks, a system featuring VOLLO achieved inference latencies as low as 2 microseconds at the 99th percentile, less than half that of competing solutions.

On paper, that’s a performance metric.

In practice, it determines whether a trading decision can still be profitable.

In modern electronic markets, the challenge is no longer just building better models, but ensuring their outputs arrive in time to act on live market conditions. Many firms already operate highly sophisticated machine learning pipelines, but their effectiveness is constrained by how quickly those models can generate actionable outputs.

Historically, trading systems operated at two extremes: simple, highly optimised strategies executing in nanoseconds, or more complex, human-driven decisions unfolding over seconds, minutes, or longer. AI is now closing that gap, enabling more intelligent, model-driven decisions to be made automatically in real time.

As a result, intelligent decision-making has become speed-critical. The value of a prediction is now directly tied to how quickly it can be acted on, and even microseconds can determine whether an opportunity is captured or missed.

The STAC-ML (Markets) Inference benchmark, designed by quantitative engineers and technologists from leading financial institutions, exists precisely to measure this challenge under realistic conditions.

It provides a rare, standardised view into how different architectures perform when exposed to real-time market data and production-grade models.

In the STAC-ML benchmarks, latency is the primary point of comparison between competing systems. While throughput and efficiency are also measured, the focus is on achieving the lowest possible inference latency under realistic trading conditions.

This reflects how machine learning is now being applied in trading: as models are used to drive real-time decisions, their value depends on whether their outputs arrive in time to act on an opportunity and generate alpha.

Why Inference Latency Determines Trading Decisions

In high-frequency and systematic trading environments, every decision exists within a shrinking time window.

A model may identify:

a pricing inefficiency
a shift in supply and demand

But if inference completes too slowly, that signal becomes irrelevant.

The trade is no longer available. The edge disappears.

This creates a fundamental constraint:

A prediction that arrives too late is indistinguishable from no prediction at all.

This is why trading firms are increasingly focused on decision latency, not just execution latency.

Inference sits directly within the decision loop:

Market Data → Feature Extraction → Model Inference → Decision → Execution

Even microsecond delays at the inference stage propagate through this loop, affecting:

fill quality
slippage
hit rate of signals
overall strategy profitability

For proprietary trading firms less latency = more profit.

Reducing inference latency is therefore not an optimisation exercise, it is a direct lever on trading performance.

The Role of STAC-ML in Measuring Real-World Performance

Benchmarking AI inference in financial markets is not trivial.

Unlike synthetic benchmarks, trading workloads require:

real-time streaming data
strict determinism
consistent tail latency (not just averages)
accurate model outputs under pressure

The STAC-ML Markets (Inference) benchmark was created to reflect these conditions.

It evaluates:

latency (including 99th percentile performance)
throughput
resource efficiency
model accuracy

Crucially, it does so using standardised models and datasets, enabling meaningful comparisons across different technology stacks.

This makes STAC one of the most credible indicators of real-world performance in financial machine learning.

Within this framework, VOLLO achieved less than half the latency of any other audited system in the STAC-ML benchmark. This is not just an isolated result, it reflects an inference pipeline engineered for deterministic, real-time decision-making under production trading conditions.

Why Traditional Architectures Struggle with Low Latency AI Inference

Most machine learning inference pipelines were not designed for microsecond-level constraints.

Traditional approaches, particularly GPU-based systems, optimise for:

throughput
batch processing
large-scale parallel workloads

These characteristics are valuable in many domains. But in trading, they introduce structural limitations.

Data Movement Overhead

Inference often requires moving data between:

network interfaces
CPU memory
accelerator memory

Each transfer introduces latency.

Non-Deterministic Execution

Many architectures exhibit variability in processing time:

queueing delays
scheduling overhead
contention across workloads

For trading systems, tail latency matters more than average latency.

Architecting for Microsecond Decisioning

Achieving microsecond inference requires rethinking where and how models run.

In the STAC benchmark system, VOLLO operates on FPGA-based infrastructure, enabling a fundamentally different approach.

Deterministic Execution

FPGA-based systems provide predictable processing paths, consistent latency profiles, and minimal jitter.

This is critical for maintaining performance at the 99th percentile, not just on average.

Hardware-Level Optimisation

The benchmark system included:

AMD Versal™ Premium adaptive SoC
FPGA-based accelerator hardware
high-performance server infrastructure

These components allow inference pipelines to be deeply optimised, tightly integrated, and purpose-built for latency-sensitive workloads.

Model Flexibility Without Latency Trade-Offs

Importantly, this approach does not restrict model choice. VOLLO supports:

decision trees
neural networks
advanced architectures such as state space models

In practice, firms still balance model sophistication against latency constraints, often using a sandbox or virtual environments to tune models for a given latency target. Approaches like this expand that boundary, allowing more complex models to run within latency budgets that would previously have required simplification.

Revisiting the Results: What 2 Microseconds Enables

Returning to the STAC benchmark results (Inference) benchmark results (see STAC Report SUT ID MRTL260323):

2µs latency (99th percentile; LSTM_A model; NMI=1)
Less than half the latency of alternative audited systems
Strong throughput and energy efficiency performance

These metrics matter because they unlock new capabilities in trading decision-making.

1. More Signals Become Actionable
Previously marginal signals (discarded due to latency) can now be used in live trading.

2. More Complex Models Can Run in Real Time
Firms can deploy richer models without sacrificing speed.

3. Decision Confidence Increases
Deterministic latency enables:

predictable system behaviour
tighter integration with execution strategies

4. Competitive Edge Compounds
Small latency improvements accumulate across:

thousands of decisions
millions of trades

Over time, this translates into measurable performance gains.

From Inference Acceleration to Decision Infrastructure

What these results ultimately point to is a broader shift: AI inference is becoming part of core trading infrastructure, not an external component.

In this model:

inference runs alongside market data
decisions are generated in real time
execution systems operate on continuously updated signals

This creates a closed-loop decision system, where data is processed instantly, models respond immediately, and actions are taken without delay.

As models continue to evolve, becoming more complex and more central to trading strategies, the importance of this architecture will only increase.

The Future of Low Latency AI in Trading

Looking forward, for those who wish to generate alpha using AI for auto-trading, several trends are becoming clear:

1. Microsecond Inference Will Become Table Stakes
Firms that cannot operate at this speed will be structurally disadvantaged.
2. Infrastructure Will Converge Around Data Proximity

Inference, networking, and execution will increasingly converge into unified systems.

3. Model Complexity Will Increase
As latency constraints are reduced, firms will deploy:

deeper models
more adaptive strategies
real-time learning systems

4. Benchmarking Will Drive Adoption
Independent validation (such as STAC) will continue to play a key role in:

reducing perceived risk
accelerating adoption
standardising performance expectations

Evaluating Your Own Models in a Low Latency Environment

For many trading teams, the key question is no longer “can we improve our models?”. But “can we run our models fast enough to act on them?”

The most effective way to answer this is through direct evaluation.

VOLLO enables teams to test their own models in a low latency environment, without requiring FPGA expertise or changes to existing ML workflows.

This allows firms to:

benchmark current performance
quantify latency improvements
understand the impact on decisioning

When Latency Defines the Decision
The STAC-ML benchmark results make one thing clear:

Low latency AI inference is no longer an optimisation layer, it is a determinant of trading outcomes.

At microsecond timescales:

signals either exist or disappear
decisions are either valid or obsolete
opportunities are either captured or missed

And increasingly, that boundary is defined by inference latency.

The full benchmark results are available in the STAC Report (SUT ID MRTL260323) at http://www.STACresearch.com/MRTL260323.

Achieving Microsecond AI Inference for Trading Decisions