Julian Mack, Jeevan Singh Bhoot, Iria Pantazi, Conor Williams, Theo Ehrenborg & Michal Borsky

Natural Conversational AI requires human-like response times as well as world-class accuracy. In streaming Automatic Speech Recognition (ASR) systems, this creates a fundamental tension between speed and precision. Systems that wait for additional future context achieve better recognition results but introduce more latency, while those that process speech immediately reduce delays but typically suffer a 20-25% accuracy penalty compared to their offline counterparts [1].

CAIMAN-ASR has renegotiated that trade-off with its latest milestone. On the Hugging Face Open ASR leaderboard we deliver better accuracy than a 3x bigger offline Whisper model whilst providing transcription with 4x lower latency than our closest API competitor.

So how did we achieve this?

Accuracy Improvements

Figure 1: CAIMAN models have a comparable pareto frontier to Whisper models. The CAIMAN large model running with a highly pruned beam search has a lower Word Error Rate (WER) than the whisper-medium.en model 3x its size. See the CAIMAN-ASR performance figures documentation for more details. Whisper model WERs are taken from the public Open ASR Leaderboard and use greedy decoding.

As shown in Figure 1, CAIMAN models are competitive with the Whisper models. This is despite the fact that:

  • CAIMAN performs streaming rather than offline ASR meaning the models see up to just 60ms of future context. Whisper by contrast sees all available future context (up to 30 seconds)
  • CAIMAN uses LSTMs rather than self-attention. LSTMs tend to be less accurate but their recurrent operations unlock the ultra-low latencies shown in Table 1 when run on Achronix FPGAs

Our recent accuracy improvements are due to:

  1. Improved training data: we have refined our audio filtering, auto-labelling and synthesis pipelines, improving the quality and diversity of our training data. Our production English models are now trained on 44k hrs of high-quality, mostly open-source data described here
  2. Adaptive beam search: our highly optimized beam search explores many options when the model is unsure but aggressively prunes when confidence is high, balancing thoroughness with efficiency
  3. Random state passing: to improve model accuracy for long utterances, we use random state passing [2]. This technique simulates the concatenation of random utterances during training, enabling the model to maintain context over longer durations and reduce WER in long-form speech recognition. The code to support this (along with the rest of our training code) is released open-source: [docs] [code]
  4. Checkpoint averaging: like others, we find that checkpoint averaging improves accuracy. We average the best N checkpoints from the same training run based on validation WER [3]
  5. Training efficiency: we have significantly increased the throughput of our training pipeline, in part due to the addition of a custom CUDA transducer loss function. This loss function (which has PyTorch bindings and is released open-source) has optimizations over and above those in Nvidia Apex’s upstream. Throughput improvements don’t improve WER directly but have allowed us to run many more experiments over the last few months

We have more WER improvements in the pipeline so watch this space!

Latency Advantage

To put these WERs in context, as far as we are aware, CAIMAN-ASR provides the lowest latency ASR solution on the market. We measure the total ‘User-Perceived Latency’ (UPL): the time between the speaker finishing a word, and receiving the transcription back for that word as shown in Figure 2.

Figure 2: User-perceived latency for the word ‘Hello’ is marked above. UPL includes everything that impacts how long a user waits including compute latency, queuing latency and network latency.

A selection of streaming ASR API provider UPLs are shown in Table 1. CAIMAN-ASR latencies were measured by an independent third-party over a 30-day stress test while the API ASR provider results are spot measurements made by Myrtle.ai. The Python code to measure CAIMAN latency is open-source: [docs] [code].

Streaming ASR solutionRealtime streams per acceleratormedian latency (sec)p90 latency (sec)p99 latency (sec)WER % Librispeech-dev-clean
Highest throughput
CAIMAN base, greedy decoding
20000.150.310.463.75%
Most accurate
CAIMAN large, beam=4
5000.160.430.972.96%
Hyperscaler #10.640.943.172.89%
Hyperscaler #20.821.041.364.99%
ASR provider #30.681.142.903.20%
ASR provider #41.111.423.283.82%
Table 1: User perceived latency measurements for CAIMAN and four leading streaming ASR API providers. CAIMAN’s median latency is x4 lower than the nearest competitor. All latencies include network latencies and are the averages per word per dataset. For each hyperscaler and ASR provider, all streaming models were benchmarked and, in every case, the most accurate model turned out to also be the lowest latency model. Benchmarks run March 2025

Conversation feels real time when a response is received within 0.3s: anything higher than this is experienced as a significant ‘lag’. As shown in Table 1, CAIMAN-ASR is the only system to provide streaming latencies within this window. By combining our solution with downstream technologies, CAIMAN-ASR latencies unlock Conversational AI applications like realtime captioning, live translation and fully automated call center agents.  In the coming months, Myrtle.ai will add seamless integration with CAIMAN-LLM, a fast response large language model that can facilitate these downstream workloads.

Cost-Efficient

Beyond its speed and accuracy, CAIMAN-ASR offers a significant cost advantage over a GPU for streaming ASR workloads due to the efficiency of Achronix FPGAs in low-latency scenarios. Please get in touch to see a Total Cost of Ownership comparison.

Summary

CAIMAN-ASR is available today. The solution redefines the latency-accuracy trade-off by significantly improving WER while maintaining industry leading streaming latencies. To find out more about how CAIMAN-ASR enables real-time conversational AI at lower cost than GPUs, contact us at speech@myrtle.ai.

References

[1] R. C. van Dalen, “Globally normalizing the Transducer for streaming speech recognition,” in Proc. ICASSP, Apr. 2025. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10890301

[2] A. Narayanan, R. Prabhavalkar, C.-C. Chiu, D. Rybach, T. N. Sainath, and T. Strohman, “Recognizing long-form speech using streaming end-to-end models,” arXiv preprint arXiv:1910.11455, Oct. 2019. [Online]. Available: https://arxiv.org/abs/1910.11455.

[3] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, “Averaging weights leads to wider optima and better generalization,” arXiv preprint arXiv:1803.05407, Feb. 2019. [Online]. Available: https://arxiv.org/abs/1803.05407.