

For most of the past decade, AI translation strategy has been a relatively static decision. Teams selected a provider, often Google Translate or DeepL, configured it once, and assumed quality would steadily improve over time.
That assumption no longer holds.
AI translation has entered a phase of rapid, uneven evolution. New models appear all the time, performance shifts by language and content type, and quality gains in one area can coincide with regressions in another.
In this environment, choosing a single engine and hoping for the best becomes a true, measurable risk.
To better understand how today’s leading AI translation systems actually perform under real-world conditions, Localize conducted two independent, proprietary blind studies in 2025.
Rather than testing artificial benchmarks or vendor-curated examples, the studies were designed to mirror the kinds of content enterprises translate every day, and to evaluate quality without brand bias.
The findings point to a fundamental shift in how localization teams should think about AI translation in 2026 and beyond.
Most public comparisons of translation engines rely on:
While useful for basic validation, these approaches don’t reflect how translation quality behaves in production environments. Here, content varies widely in structure, tone, and intent, so a one-size-fits-all approach will never work.
At the same time, enterprise teams rarely have the resources to run large-scale, unbiased evaluations themselves. As a result, engine choices are often driven by:
These are good criteria. Until you realize you work in a landscape where models change monthly, or even weekly, which means these shortcuts quickly become liabilities.
There’s one truth you need to keep in mind. Translation quality is inherently subjective. Even experienced linguists can be influenced, consciously or not, by expectations about a provider’s reputation or typical strengths.
A blind study removes that influence.
Without revealing the source, blind evaluations ask native speakers to rate translations on fluency, precision, and meaning preservation. This makes it possible to assess quality as it’s actually perceived, without the influence of brand expectations or reputation.
The goal of Localize’s blind study was not to declare a universal “winner,” but to generate reliable, data-backed insight into:
As this was a blind study, we anonymized all translation outputs before evaluation. Translators didn’t know:
Then, each translation was evaluated independently to minimize cross-contamination or anchoring effects.
Across the June and September 2025 studies, Localize evaluated:
Traditional MT engines:
LLM-based translation systems:
The selection isn’t an exhaustive catalog of every available model. But it reflects the engines most commonly encountered in enterprise localization workflows today.
We included 6 strategic languages in our study:
For each language, we selected roughly 40 phrases to span:
This mix was intentional. Testing only “easy” sentences hides the failure modes that matter most in production localization.
Native linguists scored each translation on:
The September study used a 1–5 scale to reduce variance and improve interpretability, while the June study used a 1–10 scale. Directional comparisons between the two were normalized to identify improvement, regression, and volatility trends over time.
What we didn't test (and why): We focused on text translation, not document formatting, multimedia localization, or real-time speech translation. We also didn't test domain-specific terminology at scale, as legal, medical, or highly technical content would require separate evaluation. Our goal was to understand general-purpose translation quality across the content types most organizations translate daily.
Across both studies, several patterns emerged.
When aggregated across all six languages in the September study, the highest-performing engines were LLM-based:
Traditional MT engines clustered just below:
Google’s engines, both NMT and LLM variants, ranked lowest in aggregate scores.
The most surprising finding wasn't which engine performed best, but how close the top performers came to human-level quality across routine content.
For straightforward sentences, clear instructions, and well-structured paragraphs, the gap between professional human translation and top-performing AI has narrowed considerably.
Linguists consistently noted that LLM outputs showed natural phrasing, appropriate register, and semantic accuracy that would have been impossible just two years ago. In many cases, AI translations required only minor adjustments to match human quality, and some outputs needed no changes at all.
So, where did linguists note differences? In a few areas, including:
In long-form or ambiguous content, these differences became more pronounced. LLMs handled context across sentences more reliably, while traditional MT systems tended toward literal, sentence-by-sentence translations.
For many customer-facing content types, such as product UI, onboarding flows, support documentation, and marketing copy, linguists frequently described LLM outputs as “publishable with minimal review,” particularly in:
In these scenarios, translation quality itself is no longer the primary constraint. Workflow design is.
That doesn’t mean human oversight is no longer needed. On the contrary, it still plays a critical role in areas such as:
But we have to realize that the role of humans is shifting from “fixing broken translations,” to “governing and validating already-strong outputs.”
The aggregated scores tell one story, but the language-specific results tell another: every engine showed meaningful variance across languages. An engine that excelled in Spanish might struggle in Japanese. A model optimized for European languages might produce awkward constructions in Chinese.
It’s easy to think these are minor differences that don’t matter so much in the long run, but that’s not the case. They reflect fundamental architectural choices about how models prioritize different linguistic structures.
The conclusion? A single-engine setup:
The data strongly supports a portfolio approach, where different engines are routed based on observed performance rather than habit or vendor loyalty.
Cost-quality trade-offs are also no longer obvious as the September cost matrix adds another important dimension. Some of the highest-quality engines were also among the most cost-effective on a per-character basis:
Meanwhile, some lower-performing engines carried higher costs. This undermines the assumption that better quality costs more, and further weakens the case for defaulting to legacy providers.
When comparing the June and September 2025 results, Localize was able to observe directional quality changes over time, not just static rankings.
Across the board, many engines showed modest improvements:
However, not all changes were positive. Google’s engines, for instance, both NMT and LLM variants, showed the highest volatility:
For organizations that value predictability, this volatility represents an operational risk that rarely surfaces without systematic monitoring.
Model updates can quietly change how specific language pairs, sentence structures, or content types are handled, introducing inconsistencies that don’t trigger obvious errors but still affect tone, clarity, and meaning.
Without continuous measurement, these shifts accumulate unnoticed. Until they surface indirectly through user feedback, brand inconsistency, or downstream rework.
The broader lesson? AI translation performance is fluid, not fixed.
The difference between a 4.73 and a 4.75 average score is interesting for benchmarking, but it's not what determines translation quality in production.
What determines quality is whether your workflow can consistently route content to the right engine, measure quality automatically, flag problems before they reach customers, and adapt constantly.
The blind study makes one thing clear: AI translation performance is no longer a binary question of “good” or “bad.” Instead, quality varies across languages, content types, and time. That variability affects different teams in different ways and requires different operational responses.
Consistency matters more than peak quality. Even when individual translations are technically correct, subtle shifts in tone, register, or phrasing across languages can undermine the perceived coherence of the product experience.
The study makes one thing clear: these shifts are rarely uniform. For instance, an engine may perform well on short UI strings while struggling with longer, context-rich messages. In a single-engine setup, these inconsistencies accumulate quietly, resulting in uneven UX across markets.
Multi-engine routing and continuous quality scoring address this problem at its root. Rather than optimizing for the highest possible score in isolation, they reduce variance across languages and content types.
Marketing content amplifies translation weaknesses more than almost any other content type. Nuance, tone, idiomatic phrasing, and emotional intent all play a direct role in engagement and conversion, and even small deviations can materially affect performance.
The study’s results show that LLM-based translation systems consistently outperform traditional MT engines on exactly these dimensions, particularly in longer-form and customer-facing content. Linguists repeatedly noted stronger semantic understanding, more natural phrasing, and better handling of ambiguity in LLM outputs.
For organizations operating in regulated or high-risk environments, the primary concern is not speed or cost, but control.
The study highlights that the risks don’t come from using AI translation itself, but from using it without visibility. Model updates, language-specific regressions, and subtle shifts in meaning are difficult to detect through manual review alone, especially at scale.
Blind evaluation and automated quality gates allow AI translation to be used responsibly, with evidence rather than assumptions.
The blind study highlights differences between engines and exposes structural weaknesses in how translation workflows are usually designed. The data points us towards 4 practical steps you can use to reduce risk, improve quality, and future-proof your localization strategy.
Not all content carries the same risk, and treating it as such is one of the most common and costly mistakes in localization.
Rather than classifying content by volume or format alone, teams should assess content based on impact:
The study shows that translation quality varies most in nuanced, context-heavy content. Instead of applying equal effort to all text, you can improve model quality and review procedures by first identifying the content that will have the biggest impact.
The data consistently shows that LLM-based translation systems outperform traditional MT engines in scenarios that require:
You can especially see this advantage in languages such as Japanese and Chinese, and in longer-form or marketing-oriented content.
At the same time, the study reinforces that human review is most effective when applied selectively, not universally. You don’t have to review all content equally. You’ll get the best value when human effort focuses on high-impact content, edge cases, and quality validation.
Traditional MT engines still play an important role, particularly for high-volume, low-risk content such as structured UI strings or internal documentation. The key is intentionality: aligning both model choice and level of human oversight with content characteristics, rather than applying a single engine across everything.
One of the clearest lessons from the study is that manual routing does not scale.
Engine performance changes over time, and those changes are often uneven across languages and content types. Manual decision-making, whether through static rules or periodic spot checks, can’t reliably keep up with this variability.
Automation addresses this gap by:
In this model, automation doesn’t replace human judgment. Instead, it ensures that human effort comes to add value, rather than compensating for outdated or brittle workflows.
Perhaps the most important shift is conceptual: translation strategy can no longer be “set and forget.”
The June and September study results make it clear that AI translation quality is fluid and decisions made even a few months ago may no longer reflect current performance.
Teams should treat translation as a living system, with:
Localize’s platform is designed around the reality revealed by the blind study: AI translation quality is dynamic, uneven, and highly dependent on context. The goal isn’t to promote a single “best” engine, but to give teams the infrastructure they need to adapt as models, languages, and content types evolve.
In practice, this means treating translation as a system rather than a vendor choice. Localize supports workflows that are built to handle variability by design, including:
Just as importantly, Localize supports strategic human involvement, not blanket review. It combines automated evaluation with configurable quality gates so that your team can focus human effort where it adds more value: high-impact content, edge cases, and validation.
Taken together, these capabilities reflect the core lesson of the study: improving translation quality at scale isn’t about chasing the latest model. It’s about building workflows that can measure performance, respond to change, and maintain consistency as the AI landscape continues to shift.

David is a Product Owner at Localize, where he drives product strategy and execution and works closely with engineering and design to launch impactful features. His work helps ensure Localize delivers seamless, customer-focused translation solutions.
Transform your event management journey from concept to analysis with Releventful. Elevate each step with our comprehensive tools designed for unforgettable event experiences.
Explore our features now!