The translation industry is seeing a boom in Machine Translation (MT), both in use and technology. As these MT technology uses and applications evolve, a standardized MT scoring method is desperately needed but still lacking.
In order to successfully leverage MT, Language Service Providers (LSPs) and others in the industry need to develop an objective, methodical system for evaluating the effectiveness of their MT. There have been attempts at a standardized test system, but nothing introduced so far has produced the objective results necessary.
To create a more effective set of MT testing criteria, translation providers need to take into account industry-specific metrics when evaluating MT quality. Here we’ll look at some testing systems currently used in the industry, as well as what’s needed to create a more standardized scoring metric.
SAE J2450
Some LSPs make use of quality metrics as standardized by the industry. The Society of Automotive Engineers created a standardized metric for measuring the quality of human translations known as SAE J2450. The scoring criteria is well defined and impartial, which is why LSPs have adapted and applied this model to other industries.
According to The Translation Automation User Society (TAUS), the SAE metric focuses on “incorrect term, syntactic error, omission, word structure or agreement, misspelling, punctuation, miscellaneous error.”
BLEU
When it comes to MT, the Bilingual Evaluation Underscore (BLEU Score) is the most widely accepted system for testing statistical and neural MT. However, BLEU isn’t perfect.
BLEU’s evaluation scores fail to provide an objective result for translation quality, and test scores can be skewed depending on the industry used to train the engine. In addition, tests can be rigged to show favorable results and could also be maladministered, resulting in false negatives or false positives.
As Slator reported last year, BLEU has trouble when it comes to evaluating a sentence’s grammatical structure and also doesn’t pick up on some small changes that could alter a text segment’s meaning.
DQF
TAUS has been working to develop and standardize the Dynamic Quality Framework (DQF). The DQF is a Multidimensional Quality Metric that measures six different fields: Accuracy, design, fluency, locale convention, style, terminology, and verity (the difference between a statement being true or false). The DQF can be integrated into Computer Assisted Translation (CAT) tools to simplify the reporting and give translators a familiar medium to do scoring.
The American Society for Testing and Materials F43 (ATSM) is looking to standardize the DQF, which may pressure more LSPs to adopt the standard. If the majority of LSPs adopt the same Language Quality Assessment, it allows consumers to more accurately compare language quality for Human, Machine, or Post-Edited translations.
A Variable Option
MT providers need to adopt a standardized testing method for their general MT offerings that shows translation quality by industry.
To do this, a test needs to be given with industry-specific text that the MT provider would not already have in their database. This sample text then should be scored according to standardized metrics, so a consumer would be able to shop around for comparable quality.
Currently, consumers rely on methods such as reading comprehension, which is extensive and subjective.
There are organizations that are looking to standardize MT testing by making use of CAT tools, which have been used for years to aid in Human Translation Quality Evaluation.
Ultimately, the best evaluation system will depend on the content being tested. Translations for the legal field will need to be tested differently as compared to, for example, , those done for manufacturing companies. For this reason, an evaluation system that relies on industry-specific knowledge would be ideal.
If MT can be better evaluated based on need, it’s likely it will be more effective for those who use it.