How do machine translation tools work? What makes it possible to translate pages and pages of text in a blink of an eye? The answer is translation memory and segmentation.
Segmentation is one of the key processes that make computer-assisted translation (CAT) tools possible. In this article, we will explain how segmentation works and how to implement segmentation rules to maximize the benefits of automation for your project.
Pro tip: Use a professional localization management suite like Centus to streamline segmentation in translation through automated and customizable rules, real-time previews, segment-level editing, and integration with translation memories. Learn more.
What is segmentation in translation?
Segmentation is a process of splitting text into segments for easier translation. A segment is an elementary syntactic unit. Depending on the content and the segmentation rules for a particular project, it can vary in length and be a complete sentence, a phrase, or a combination of words.
The point of segmentation is to let CAT and AI tools parse the text and break it into manageable units. This way, the machine can search the translation memory for equivalent units in the target language to suggest possible translations.
For example, the sentence “I see a white house on the hill” can be broken down into three segments:
- I see
- A white house
- On the hill
Each segment represents an elementary unit that possesses both meaning and grammatical structure and has a high probability of being already translated within some other text and stored in the database of translation memory.
Let’s say, the text needs translating into French and Polish. The CAT tool finds equivalent segments in the corresponding databases for each language pair:
English | French | Polish |
---|---|---|
I see | Je vois | widzę |
a white house | une maison blanch | biały dom |
on the hill | sur la colline | na wzgórzu |
I see a white house on the hill. | Je vois une maison blanche sur la colline. | Widzę biały dom na wzgórzu. |
In the context of a CAT tool, the translation for each segment is then suggested for the human translator to confirm, reject, or modify. In the context of full AI translation, the segments are then reassembled into the full text according to the grammatical and syntactic rules of the target language.
Benefits of segmentation in translation
Without segmentation, neither machine translation, nor computer-assisted human translation would be possible. Segmentation lies at the foundation of all modern translation technology. It is critical for the high quality, affordability, and fast delivery of the translation.
High-speed translation
Thanks to segmentation, machine translation (MT) and translation memories (TM) can be leveraged to deliver accurate translations faster. Segmentation builds the bridge between the datasets of the source and the target languages.
Human oversight
Segmentation allows translators to confirm or decline suggested translations, making their work a lot easier, but leaving them in control. Even manual translation becomes much easier when one can focus on key phrases and meaningful concepts, highlighted as separate segments.
Context
Segmentation breaks the text down without removing segments from their context. Translators always have access to all the relevant information from the text necessary for understanding the semantic nuances and choosing the right equivalent for each segment.
Lower translation costs
Thanks to translation memories, all segments that were already approved for previous translations are automatically translated when a new text goes through the CAT tool. This way, one doesn’t have to translate the same phrases twice. Moreover, when a new segment is translated by a human expert, this translation is automatically added everywhere in the text where this segment occurs.
How the segmentation process works
Segmentation is usually performed automatically during the import process of the document into the CAT tool. Segmentation rules vary from one CAT tool to another and can be modified depending on the translation needs of each particular project.
As a rule, segmentation divides the text into paragraphs and sentences according to formatting signs such as line breaks, paragraph marks, page breaks, tabulators, and end-of-cell signifiers that usually end a segment. Within these larger pieces of text, punctuation marks help the navigation: full stops, commas, exclamation and question marks, colons, and semi-colons.
The resulting segments can be titles, paragraphs, bullet points, or phrases. Individual words can also be recognized as segments if they are added as technical terms and managed by the terminology glossaries. Once segments are singled out, they are added to the translation memory as building blocks for further use.
Then it’s time for the matching process. CAT compares segments from two language datasets and finds corresponding segments. The results are usually color-coded and highlighted in the CAT interface as follows:
- 100% matches: if a segment stored in the TM is identical to the segment in the text that’s being translated, it is considered a 100% match.
- Fuzzy matches: if a segment stored in the TM is similar, but not an exact match, it is displayed as a fuzzy match. Fuzzy matches usually present an overlap percentage measured from 0% to 99%.
- Repetitions: if a segment appears multiple times in the text, it is marked as a repetition. When the translator confirms the translation for this segment, it is applied automatically throughout the text.
Full matches are usually confirmed as is or slightly modified depending on the style and context required for this translation. Fuzzy matches, on the other hand, need modification from the translator. A 99% match can be a sentence that only differs in a punctuation mark or an article, while an 80% match has several different words that need to be translated anew. As a rule, matches below 70% are of very limited use.
How to perform segmentation in translation
Segmentation mimics similar cognitive processes in the human mind. However, it cannot be as flexible as the brain that evolved to process language. To align binary machine logic with the complexities of natural language, segmentation rules should be creatively applied.
Breaking text into segments correctly is necessary for the CAT tool to determine whether they have already been translated and find matches for them. Correct segmentation depends heavily on the quality and formatting of the source text. To make sure your segmentation yield will be optimal, make sure to prepare the text before importing it into the CAT.
- Double-check punctuation
Correct use of punctuation marks allows the CAT tool to segment the text correctly and find more readily available matches in the TM. Make sure all sentences end with a full stop, exclamation, or a question mark, and comma placements make sense grammatically.
- Avoid abbreviations
Unless you plan to add a particular abbreviation as a term into your glossary, it’s better to avoid them, since abbreviations using full stops will end the segment incorrectly.
- Don’t format numbered lists manually
Since full stops signal the end of the sentence, you will get segments consisting of digits only. To avoid this, always use your word processor’s formatting tool to create lists.
- Avoid repeated indentations or tabulators
When formatting text for visual structure, it is important to use the page layout settings of your word processor instead of manually applying indentations and tabulators for each line of the text, since they usually signal the end of segment to CAT tools and can break a syntactic unit incorrectly.
- Do not use paragraph breaks for formatting
Avoid using paragraph breaks to visually divide parts of the text that are syntactically monolith: for example, to divide a title into several lines. CAT tool will identify each of the lines as an individual segment instead of one sentence.
- Make sure tables are formatted correctly
Depending on the text format, the end of the cell might signal the end of the segment instead of the full stop. Make sure the formatting for columns and rows is applied correctly to help with accurate segmentation.
3 Examples of segmentation in translation
Segmentation works differently depending on the type of text, file format, and how the text itself is structured. Tables are segmented differently than plain text, and instructions are segmented differently than the poems. Usually, CAT tools have preset rules that help handle various translation cases. For example:
Software string segmentation
When working with software strings (for example, to localize UI text), each string is treated as a separate segment. Moreover, placeholders and variables should be left as they are, but placed correctly within the source and the translated text.
Text:
“Good to see you again, {user_name}! Click {button_label} to continue.”
Segmented:
-
Segment 1: “Good to see you again, {user_name}!”
-
Segment 2: “Click {button_label} to continue.”
Placeholders {user_name} and {button_label} should not be changed, but the rest of the text can be translated. For that, rules should be created to recognize brackets as placeholder signs.
Spreadsheet segmentation
In a spreadsheet file (for example, XLSX or CSV), each cell is treated independently and is segmented individually. This is usually done automatically.
Text in Cells:
Product Name | Price |
---|---|
Laptop Pro/X | $999 |
Smartphone S30 | $599 |
Headphones M770 | $199 |
Segmented:
-
Segment 1: “Product Name”
-
Segment 2: “Price”
-
Segment 3: “Laptop Pro/X”
-
Segment 4: “$999”
-
Segment 5: “Smartphone S30”
-
Segment 6: “$599”
-
Segment 7: “Headphones M770”
-
Segment 8: “$199”
In this case, it’s vital to create exceptions for signs like “/” used as part of the name product (usually / ends a segment).
HTML/XML File Segmentation
In HTML or XML files, segmentation is based on the content but must respect the structure of the code. The tags (<p>) are not included in the translation segments but help maintain the structure of the output.
HTML Text:
<p> How vainly men themselves amaze</p>
<p> To win the palm, the oak, or bays,</p>
<p> And their uncessant labours see</p>
<p> Crown’d from some single herb or tree</p>
Segmented:
-
Segment 1: “How vainly men themselves amaze”
-
Segment 2: “To win the palm, the oak, or bays,”
-
Segment 3: “And their uncessant labours see”
-
Segment 4: “Crown’d from some single herb or tree”
In most CAT tools, tags are handled automatically, so you won’t need to add exceptions.
Common segmentation rules in translation
As a rule, segments coincide more or less with sentences or the content of each cell in the table. If you need to create different types of segments for your project, you can edit segmentation rules in your CAT tool. Most CAT and MT tools support the creation of customized segmentation.
To make sure your rules are functional and create the segments you need, follow these simple steps:
- Specify the marks ending the segments
Add all the symbols that will end segments in your rule set. Depending on your formatting and document type, these symbols include, but are not limited to punctuation. For example, when working with tables, some CAT tools allow setting segmentation to parts of cells or multiple, adjacent cells. This is particularly handy when working with subtitles where sentences can be split into several cells to fit into one line on the screen.
- Add exceptions to the rules
Most often, exceptions are various kinds of abbreviations. Common ones are already added as exceptions to most CAT tools by default. For example, full stops after Mrs., e.g., Inc., or N.Y. should not mark the end of the segment. However, if your text contains unique cases, such as brand or domain names with punctuation (Wham!, Chips Ahoy!, Toys “R” Us, 3D.studio, etc.) add them manually.
- Work with regular expressions to refine rules/exceptions
A regular expression (RegEx) is a character sequence that defines a search pattern used to locate (and replace if needed) specific instances of words or phrases in the source or target text. You can use regular expressions to find different forms of the same term to improve term consistency, search for multiple terms at the same time, find all segments where capitalization or punctuation in the target text doesn’t match those in the source text, etc.
This is a more sophisticated part of segmentation rules and it takes some time to master. To learn the basics, you can consult this regular expressions tutorial.
Key considerations for segmentation rules
For most use cases, preset rules of the CAT tools work nicely, but for specific content, depending on the domain and particular translation needs, for example, legal translation or marketing localization, the segmentation should be fine-tuned to get the best results.
When you customize rules for segmentation or create an entirely new segmentation rule set, here is what to keep in mind:
- Custom segmentation is project-specific
You can only add a segmentation rule to an existing project within your CAT. To edit rules, choose an existing project or create a new one and then, proceed to settings, segmentation rules, and editing.
- Segmentation rules belong to a language
Segmentation is applied to a source text. This means you should create and edit segmentation rules for your source language, keeping the grammar and syntax of the source language in mind.
- You cannot edit default segmentation rules
Usually, CAT tools have a default segmentation rule set for every language. While you can customize segmentation settings for source languages on a project-by-project basis, you cannot change the default settings. You can modify them manually for a particular project or, in some cases, migrate your settings from another CAT tool.
How to test segmentation rules
After creating and saving your customized set of segmentation rules, you can import documents to test how this version of segmentation works on target texts.
Some CAT tools have a preview box allowing you to see the effects of the segmentation rule set and fine-tune it on the go. If a preview option is not available, it’s best to try your segmentation rules on a sample document, before you start working on the translation project.
- Import and open a sample file
Take a smaller file from your upcoming project or create a sample in the target language. Make sure it contains some of the words and expressions that your rules and exceptions should address. Import it into your CAT tool under the current project.
- Use the analysis feature of your CAT tool
Look at the metrics provided by your CAT tool. See how your customized rule set affects word count, match rate, quality score, and other stats of your translation.
- Check for errors and inconsistencies
Look for awkward segment breaks, poor sentence patterns, broken headings and subheadings, etc. Trace back the rule/exception that needs amending to fix the issue.
- Adjust the rules/exceptions
Change the rules or complete the exceptions list. For quicker searches and more thorough results, use regular expressions. They allow creating highly specific search filters in your CAT to hunt down the minutest problems. Also, consider checking if the formatting of your source file is done properly.
- Rerun the test until results are satisfactory
Reiterate until your segmentation rule set breaks the text exactly as you need. This might seem like busy work, but it saves you a lot of time and effort in the long run, especially for larger projects.
Get the week's best content!
By subscribing, you are agreeing to have your personal information managed in accordance with the terms of Centus Privacy Policy ->
Keep learning
11 min. read
What Is Back Translation and How to Do It Right?
6 min. read
What Is Translation Memory? Understanding Its Role and Benefits
7 min. read
Translation Management System: A ‘Show, Don’t Tell’ Guide
9 min. read
Localization Problems: 10 Biggest Challenges and Solutions
8 min. read