How to Evaluate Machine Translation Quality (MT Metrics + Human Review)

Table of Contents

Machine translation quality used to be a rather simple equation. You checked a couple of metrics once and moved on from there. As comfortable as that approach may feel, it’s no longer enough.

Right now, we have teams relying on a mix of neural machine translation (NMT), large language models (LLMs), custom prompts, and frequent model updates. Output quality can change from one week to the next, even when nothing else in your workflow does. A single score, captured at one point in time, can’t reflect reality.

That’s why you need to forget about asking, “Is this translation good?” and ask instead, “How do we continuously evaluate quality across content, languages, and time?”

This article will break down exactly how to evaluate machine translation (MT) quality using both automated metrics and human evaluation, and how to combine them into a practical, repeatable system.

What “translation quality” actually means

MT systems are far better than before, and that’s as much an advantage as it is a disadvantage. LLMs produce more fluent output and need less manual intervention. But that fluency doesn’t guarantee accuracy, consistency, or brand alignment.

At the same time, models update frequently, and translation engines can behave differently across languages and content types. A blind study we conducted in 2025 showed just that. An LLM that did really well in German, did poorly in Japanese, and so on. No single model works perfectly across all use cases.

This creates 3 particular challenges:

Static evaluation breaks down. A score captured last quarter may no longer reflect today’s output.
Quality varies by use case. UI strings, marketing copy, and support content behave very differently under the same model.
Risk isn’t evenly distributed. A minor error in a blog post is very different from a mistake in a checkout flow or legal disclaimer.

In this context, the most important thing is to clearly define what you’re measuring.

Accuracy. Does the translation preserve the meaning of the source text without omissions or distortions?
Fluency. Does the output read naturally in the target language?
Adequacy. Does the translation fully convey the intent, tone, and nuance of the original?
Consistency. Are terminology, phrasing, and style applied consistently across content?
Brand voice. Does the translation align with how your product or company communicates?

Do keep in mind that not all dimensions matter equally in every context. For instance, a UI label prioritizes clarity and consistency. A marketing headline prioritizes tone and persuasion. A support article may need both.

That’s why quality evaluation always depends on content type and language pair.

Two main ways to evaluate machine translation quality

There are two primary approaches to MT quality evaluation: automated metrics and human evaluation. Most modern workflows rely on both.

Automated metrics (fast, scalable)

Automated metrics compare machine output against a reference translation or use internal scoring models to estimate quality.

They’re valuable because they are:

Fast and inexpensive.
Scalable across large volumes.
Easy to automate in CI/CD pipelines.

In practice, you’ll find them most used as early signals: they help teams detect regressions, compare engines, and monitor quality trends over time.

There are 2 very common ones to remember. BLEU (Bilingual Evaluation Understudy) measures how closely a machine translation matches a human reference by comparing overlapping words and phrases.

The second one is COMET, a newer neural-based metric that evaluates semantic similarity between source, translation, and reference. Usually, it correlates better with human judgment than BLEU, especially for adequacy, but it still depends on reference quality and trained data distributions.

Automated metrics are indeed fast and scalable, but they’re not perfect. They struggle with things like LLM variability, nuance and intent, as well as domain and context sensitivity. As an example, tone, persuasion, humor, and brand voice are difficult to capture numerically, even when meaning is preserved.

Human evaluation (accurate, expensive)

We may have a lot of machines and LLMs, but human review remains one of the most reliable ways to assess translation quality, especially for nuance, intent, and contextual accuracy.

You won’t necessarily need to have a person reading every single line of text. But you should use human evaluation for things like:

Customer-facing marketing content.
Legal, medical, or compliance-sensitive text.
UX copy where clarity affects conversion.
New language pairs or domains.

The real problem here is that human translation hardly scales. And when you have thousands upon thousands of words to translate and verify daily, that can become a pressing issue. The solution for many teams is to use a sampling strategy instead of full-text editing.

That way, you can review only a percentage of the output, focus more on high-risk content, and escalate ‌low-confidence translations.

The most used MT quality metrics (and when to use each)

As much as we’d like one metric to apply to everything, none is universally the “best”. Your goal shouldn’t be to pick one, but to understand what each metric tells you and where it can mislead you.

The table below summarizes commonly used machine translation quality metrics and how you can apply them.

Metric	Needs reference?	Best for	Weaknesses	Use it when
BLEU	Yes	Engine comparison, regressions	Poor at nuance and fluency	Comparing models at scale
TER	Yes	Measuring edit effort	Doesn’t reflect readability	Estimating post-editing cost
METEOR	Yes	Synonym-aware scoring	Still reference-bound	Evaluating adequacy
COMET	Yes	Semantic similarity	Depends on training data	Higher-quality automated evaluation
chrF	Yes	Morphologically rich languages	Limited semantic insight	Evaluating non-English pairs
LLM_based scoring	Sometimes	Style and fluency	Hard to calibrate	Supplementing traditional metrics

Why MT quality must be evaluated as a system

Evaluating machine translation quality in isolation ignores how translation actually happens in production. And there are a few key factors that support this idea.

‍

Firstly, there are huge differences among content types, each one behaving differently under the same model.

UI strings are short, repetitive, and sensitive to terminology.
Long-form content allows for more variation and creative phrasing.
Support content must balance clarity, accuracy, and empathy.

Then, we can’t forget language-specific behavior. Languages differ in things like morphology and inflection, word order and segmentation, and idiomatic expressions. That’s why a score that looks “acceptable” in English–Spanish may hide serious issues in English–Japanese or English–Arabic.

Last, but not least, metrics drift over time. You change the model, the prompts, the training data, it’s only normal for the metric behavior to change as well. Without continuous monitoring, you won’t notice quality drift until users do.

A practical, repeatable evaluation framework

The biggest shift in translation quality evaluation is moving away from one-off scoring toward ongoing, operational quality control. In practice, that means treating translation quality as a system with clear checkpoints, not a single metric reviewed in isolation. For that, you’ll need a simple repeatable framework with 3 layers.

Continuous evaluation instead of one-time scoring

It’s not enough for automated quality checks to run during the initial engine selection. They need to do so continuously, in the background. That way, you can achieve several goals:

Detect quality regressions after model or prompt changes.
Track trends over time instead of relying on snapshots.
Compare output quality across content types and languages.

The idea isn’t to chase a perfect score, but to notice when quality changes.

Automated gating for low-quality output

The good news is that not all translations deserve the same level of scrutiny. A modern workflow uses automated scores and confidence signals to decide what happens next.

For instance, high-confidence translations move forward automatically. Low-confidence output‌ will be held back and routed for review. Risky content (legal, UX, customer-facing) can have stricter thresholds and require a more serious, perhaps manual quality check.

This kind of automated gating keeps quality high without slowing your entire process.

Targeted human review where it adds value

As we saw previously, certain types of content still need human review. But in most cases, you can safely avoid full post-editing. Focus instead on spot checks on high-impact content and sampling strategies that help ensure everything is correct.

A solid framework will trigger human review each time it detects risky content, which ensures experts spend their attention where nuance, intent, or domain knowledge actually matter.

Translation platforms that support built-in quality scoring, automated thresholds, and auditability make this approach easier to maintain at scale. For example, Localize is designed around this kind of system-level evaluation, combining automated quality signals with workflow controls so quality management stays continuous, not manual.

How to evaluate translation software (not just engines)

The answer is not simply looking at which engine or model performs best. That’s one part of the equation, but not the only one. The right translation software will support you across several areas.

Workflow automation. Quality checks should happen automatically instead of being triggered manually.
Auditability and traceability. Your team needs to understand how quality was evaluated, which content was reviewed, and why certain decisions were made.
QA triggers and thresholds. A good translation software will flag low-confidence translations, routing them for review.

Without these capabilities, even the best-performing MT engine becomes hard to trust at scale. Evaluation turns into a series of ad hoc checks instead of a repeatable process.

Modern translation platforms can solve this problem with a combination of automated quality signals and workflow controls. The Localize platform, for instance, treats quality evaluation as part of the translation cycle itself. That makes it easier to maintain consistent standards even with different languages, content, and models.

Common pitfalls teams run into

Even the most experienced teams can fall into predictable traps when evaluating machine quality translation.

Over-relying on a single metric

No single score reflects real-world translation quality. Metrics like BLEU or COMET capture specific aspects of output, but they can miss issues related to tone, intent, or domain accuracy.

When you focus only on one metric, you risk optimizing your content for the score instead of the actual user experience.

Ignoring drift over time

You already understand that MT quality isn’t static. Model updates, prompt changes, and new content types can shift output quietly, even if scores initially look stable.

The hardest part is that such changes truly can happen from one day to the next. Our 2025 blind study showed that between June and September there were already significant differences for certain models and language pairs in terms of quality. Without continuous monitoring, you risk discovering problems long after users do.

Treating evaluation as a one-time setup

The changes that models and prompts go through over time bring in another issue. Quality evaluation simply can’t be something you configure once and forget about. Instead, you’ll need to constantly revisit your approach, to avoid having your entire strategy rely on outdated thresholds.

Author

David Rossi

Product Owner

David is a Product Owner at Localize, where he drives product strategy and execution and works closely with engineering and design to launch impactful features. His work helps ensure Localize delivers seamless, customer-focused translation solutions.

Stay one step ahead

Stay in the loop! Sign up for our newsletter and get the latest news and product updates!

How to Evaluate Machine Translation Quality: Metrics and Methods

What “translation quality” actually means

Two main ways to evaluate machine translation quality

Automated metrics (fast, scalable)

Human evaluation (accurate, expensive)

The most used MT quality metrics (and when to use each)

Why MT quality must be evaluated as a system

A practical, repeatable evaluation framework

Continuous evaluation instead of one-time scoring

Automated gating for low-quality output

Targeted human review where it adds value

How to evaluate translation software (not just engines)

Common pitfalls teams run into

Over-relying on a single metric

Ignoring drift over time

Treating evaluation as a one-time setup

FAQs

What is the best metric for MT quality?

How much human review is enough?

How often should quality be re-evaluated?

Related Articles

AI Translation Software: What Top Global Companies Got Wrong (And How to Fix It)

Automatic Translation: What It Is and How to Implement It Successfully

Related Articles

HR Tech Localization Best Practices for Marketers Launching Globally

UI Localization: How to Adapt Your Web UI for Global Audiences

How AI Translation Works: Understanding the Technology Behind It

Ready to translate your website and content faster?

Featured Guide

What is Localization?

Product

Integrations

Support

Resources

Company