Machine translation quality used to be a rather simple equation. You checked a couple of metrics once and moved on from there. As comfortable as that approach may feel, it’s no longer enough.
Right now, we have teams relying on a mix of neural machine translation (NMT), large language models (LLMs), custom prompts, and frequent model updates. Output quality can change from one week to the next, even when nothing else in your workflow does. A single score, captured at one point in time, can’t reflect reality.
That’s why you need to forget about asking, “Is this translation good?” and ask instead, “How do we continuously evaluate quality across content, languages, and time?”
This article will break down exactly how to evaluate machine translation (MT) quality using both automated metrics and human evaluation, and how to combine them into a practical, repeatable system.
What “translation quality” actually means
MT systems are far better than before, and that’s as much an advantage as it is a disadvantage. LLMs produce more fluent output and need less manual intervention. But that fluency doesn’t guarantee accuracy, consistency, or brand alignment.
At the same time, models update frequently, and translation engines can behave differently across languages and content types. A blind study we conducted in 2025 showed just that. An LLM that did really well in German, did poorly in Japanese, and so on. No single model works perfectly across all use cases.
This creates 3 particular challenges:
- Static evaluation breaks down. A score captured last quarter may no longer reflect today’s output.
- Quality varies by use case. UI strings, marketing copy, and support content behave very differently under the same model.
- Risk isn’t evenly distributed. A minor error in a blog post is very different from a mistake in a checkout flow or legal disclaimer.
In this context, the most important thing is to clearly define what you’re measuring.
- Accuracy. Does the translation preserve the meaning of the source text without omissions or distortions?
- Fluency. Does the output read naturally in the target language?
- Adequacy. Does the translation fully convey the intent, tone, and nuance of the original?
- Consistency. Are terminology, phrasing, and style applied consistently across content?
- Brand voice. Does the translation align with how your product or company communicates?
Do keep in mind that not all dimensions matter equally in every context. For instance, a UI label prioritizes clarity and consistency. A marketing headline prioritizes tone and persuasion. A support article may need both.
That’s why quality evaluation always depends on content type and language pair.
Two main ways to evaluate machine translation quality
There are two primary approaches to MT quality evaluation: automated metrics and human evaluation. Most modern workflows rely on both.
Automated metrics (fast, scalable)
Automated metrics compare machine output against a reference translation or use internal scoring models to estimate quality.
They’re valuable because they are:
- Fast and inexpensive.
- Scalable across large volumes.
- Easy to automate in CI/CD pipelines.
In practice, you’ll find them most used as early signals: they help teams detect regressions, compare engines, and monitor quality trends over time.
There are 2 very common ones to remember. BLEU (Bilingual Evaluation Understudy) measures how closely a machine translation matches a human reference by comparing overlapping words and phrases.
The second one is COMET, a newer neural-based metric that evaluates semantic similarity between source, translation, and reference. Usually, it correlates better with human judgment than BLEU, especially for adequacy, but it still depends on reference quality and trained data distributions.
Automated metrics are indeed fast and scalable, but they’re not perfect. They struggle with things like LLM variability, nuance and intent, as well as domain and context sensitivity. As an example, tone, persuasion, humor, and brand voice are difficult to capture numerically, even when meaning is preserved.
Human evaluation (accurate, expensive)
We may have a lot of machines and LLMs, but human review remains one of the most reliable ways to assess translation quality, especially for nuance, intent, and contextual accuracy.
You won’t necessarily need to have a person reading every single line of text. But you should use human evaluation for things like:
- Customer-facing marketing content.
- Legal, medical, or compliance-sensitive text.
- UX copy where clarity affects conversion.
- New language pairs or domains.
The real problem here is that human translation hardly scales. And when you have thousands upon thousands of words to translate and verify daily, that can become a pressing issue. The solution for many teams is to use a sampling strategy instead of full-text editing.
That way, you can review only a percentage of the output, focus more on high-risk content, and escalate low-confidence translations.
The most used MT quality metrics (and when to use each)
As much as we’d like one metric to apply to everything, none is universally the “best”. Your goal shouldn’t be to pick one, but to understand what each metric tells you and where it can mislead you.
The table below summarizes commonly used machine translation quality metrics and how you can apply them.
Why MT quality must be evaluated as a system
Evaluating machine translation quality in isolation ignores how translation actually happens in production. And there are a few key factors that support this idea.
Firstly, there are huge differences among content types, each one behaving differently under the same model.
- UI strings are short, repetitive, and sensitive to terminology.
- Long-form content allows for more variation and creative phrasing.
- Support content must balance clarity, accuracy, and empathy.
Then, we can’t forget language-specific behavior. Languages differ in things like morphology and inflection, word order and segmentation, and idiomatic expressions. That’s why a score that looks “acceptable” in English–Spanish may hide serious issues in English–Japanese or English–Arabic.
Last, but not least, metrics drift over time. You change the model, the prompts, the training data, it’s only normal for the metric behavior to change as well. Without continuous monitoring, you won’t notice quality drift until users do.
A practical, repeatable evaluation framework
The biggest shift in translation quality evaluation is moving away from one-off scoring toward ongoing, operational quality control. In practice, that means treating translation quality as a system with clear checkpoints, not a single metric reviewed in isolation. For that, you’ll need a simple repeatable framework with 3 layers.
Continuous evaluation instead of one-time scoring
It’s not enough for automated quality checks to run during the initial engine selection. They need to do so continuously, in the background. That way, you can achieve several goals:
- Detect quality regressions after model or prompt changes.
- Track trends over time instead of relying on snapshots.
- Compare output quality across content types and languages.
The idea isn’t to chase a perfect score, but to notice when quality changes.
Automated gating for low-quality output
The good news is that not all translations deserve the same level of scrutiny. A modern workflow uses automated scores and confidence signals to decide what happens next.
For instance, high-confidence translations move forward automatically. Low-confidence output will be held back and routed for review. Risky content (legal, UX, customer-facing) can have stricter thresholds and require a more serious, perhaps manual quality check.
This kind of automated gating keeps quality high without slowing your entire process.
Targeted human review where it adds value
As we saw previously, certain types of content still need human review. But in most cases, you can safely avoid full post-editing. Focus instead on spot checks on high-impact content and sampling strategies that help ensure everything is correct.
A solid framework will trigger human review each time it detects risky content, which ensures experts spend their attention where nuance, intent, or domain knowledge actually matter.
Translation platforms that support built-in quality scoring, automated thresholds, and auditability make this approach easier to maintain at scale. For example, Localize is designed around this kind of system-level evaluation, combining automated quality signals with workflow controls so quality management stays continuous, not manual.
How to evaluate translation software (not just engines)
The answer is not simply looking at which engine or model performs best. That’s one part of the equation, but not the only one. The right translation software will support you across several areas.
- Workflow automation. Quality checks should happen automatically instead of being triggered manually.
- Auditability and traceability. Your team needs to understand how quality was evaluated, which content was reviewed, and why certain decisions were made.
- QA triggers and thresholds. A good translation software will flag low-confidence translations, routing them for review.
Without these capabilities, even the best-performing MT engine becomes hard to trust at scale. Evaluation turns into a series of ad hoc checks instead of a repeatable process.
Modern translation platforms can solve this problem with a combination of automated quality signals and workflow controls. The Localize platform, for instance, treats quality evaluation as part of the translation cycle itself. That makes it easier to maintain consistent standards even with different languages, content, and models.
Common pitfalls teams run into
Even the most experienced teams can fall into predictable traps when evaluating machine quality translation.
Over-relying on a single metric
No single score reflects real-world translation quality. Metrics like BLEU or COMET capture specific aspects of output, but they can miss issues related to tone, intent, or domain accuracy.
When you focus only on one metric, you risk optimizing your content for the score instead of the actual user experience.
Ignoring drift over time
You already understand that MT quality isn’t static. Model updates, prompt changes, and new content types can shift output quietly, even if scores initially look stable.
The hardest part is that such changes truly can happen from one day to the next. Our 2025 blind study showed that between June and September there were already significant differences for certain models and language pairs in terms of quality. Without continuous monitoring, you risk discovering problems long after users do.
Treating evaluation as a one-time setup
The changes that models and prompts go through over time bring in another issue. Quality evaluation simply can’t be something you configure once and forget about. Instead, you’ll need to constantly revisit your approach, to avoid having your entire strategy rely on outdated thresholds.









