Calculating the Number of Words in a Translation Project

  • 05 February 2018
  • by Michał Tosza
Calculating the Number of Words in a Translation Project

Michał Tosza shows us how to use the wordcount and statistics feature in memoQ to present your client with transparent information when quoting a project.

Last time in "Localizing games in memoQ - 5 useful features", I briefly introduced 5 memoQ functions that are especially useful when localizing games. As I told you back then, I would elaborate on each of them on future articles. So today, it is time to analyze those memoQ features that can help us calculate the amount of work and time required to localize a medium-sized game.

Let us imagine we have the following scenario: We have just received an incoming query from a small independent game developer. Let’s call them GreatGames. They are ready to publish their next game, but they know that localization is vital to reach as many potential customers as possible. Before launching the game, they approach you and ask for a quote (the price of translation). They also request you to provide an estimation of the time it takes to localize their game. They are honest and state that you are not the only translator they are contacting.

Now let us try to analyze this developer:
  • They have no localization experience. This is the first game they will localize,
  • They are likely to be price-sensitive. After all, they are an independent studio,
  • They would need a concrete time-span of the translation process. They are close to their launch deadline,
  • They value transparency. They have been transparent too when they told you are not the only translator they have contacted.
All in all, they expect a straightforward calculation, a concrete deadline, moderate price, and transparent communication. So, what do you need to do to get this job and win this client? You need to deliver exactly all they need. Or even more, as “underpromise and overdeliver” is a powerful business strategy.

How can memoQ help with this task?

First, you need to acquaint this developer with basic translation industry standards, such as the fact that financial and deadline calculations are rooted in the number of source words in the translation project.

But, what are source words?

Source: the amount of text calculated in the original file.

Why not in the translation? We simply do not know the length of the translation (target text) and for that reason, it is impossible to agree on a price for the translation while negotiating the job.

Word: it is not as straightforward as it seems.

Word is a string of letters, special characters or numbers ending with space, end of line character, or end-of-sentence punctuation. Why will we use words as the basis for our calculations? Because all CAT-tools (memoQ included) treat it as a standard unit of measure when analyzing, calculating, parsing and performing all other operations on texts. This is simply an industry standard.

A piece homework for you: 

John has got a cat.
John's got a cat.
John's got 345 cats.
John has got 23 000 cats.
John has got 23.000 cats.
John has got #### cats.

How many words are there in each of these segments? This is your homework for the next time. You can download this simple file here, import to memoQ and find out.

You can save this definition of source words to be passed on to the developer when sending your bid. Now we need to move on and analyze the file they delivered.

1. Create a project for GreatGames in memoQ and import the file for translation

The imported file is displayed in Project home > Translation > Documents section. You can find the total number of words next to the file name (in the # column). 
Number of words memoQ

This gives a very general overview of the amount of work that awaits you. When localizing games you can assume that your daily output will be about 1500 words. That is not that much, you might think, especially when translation agencies expect a daily output at about 2500-3500 words. Indeed 1500 words a day is quite little. However, bear in mind that when localizing games you will:
  • transcreate[1] a lot,
  • spend a lot of time making up expressions that sound natural in your target language,
  • come up with neologisms,
  • look up (pop) cultural references and come up with their equivalents,
  • deal with out of this world technologies, mechanics, physics, etc. created by game developers that need to make sense when localized,
  • face unusual characters that speak in all sorts of styles, registers, vocabularies,
  • struggle with character limits (string size) in mobile games.
Making a translation that encompasses all of these aspects is way more time consuming than translating a printer user manual, an MSDS sheet or a CAD software UI update. This is why your productivity will drop to almost half of a standard daily output when localizing games.

Another productivity killer is the fact that strings in game localization files are often non-repetitive or even not a bit similar one to another. Therefore, you will not be able to make use of auto propagation or fuzzy matches.

Let’s come back now to the file we imported - it contains 4385 words in total.

But this really does not tell us anything, as there can be:
  • 4385 words in non-repetitive non-similar segments and all of them would require manual translation,
  • OR there could be dozens of repetitions and similar segments, hence a lot of the job would be performed by memoQ's Auto-Propagation and fuzzy matches functions.
But we do not know that yet and need to find out.

2. More information about the real amount of words.

The total number of source words means very little. To receive more detailed information about the amount of work, you can right-click the file name and click Weighted Counts. Now the number of words has dropped to 4358.

Not significantly, but it is lower. Why? Weighted words are source words that take into account repetitions (that would translate way faster than new words) and fuzzy matches (that in principle would also translate faster than new words). As mentioned before, repetitions and fuzzy matches tend to be scarce in game localization, hence the tiny difference between the total and the weighted number of words.

3. Detailed analysis

We know something about the number of words in the file, but this is still not good enough. We need to thoroughly analyze the file and get a detailed overview. We need a report that shows the concrete information required to estimate the time for the job, so we can provide our potential client with transparent information. 

To get this report, click the file name and hit Documents tab > Statistics button.

In memoQ, you can generate a clean and customized word analysis report that will come in handy later on. In the Statistic dialog that displays you will see many options that allow preparing this report.
  1. We will only analyze the selected documents.
  2. We want the report Show counts but without Status report, because segment status plays no role in our analysis.
  3. Trados 2007 is long gone, so the word count we want is memoQ.
  4. In this Analysis, we will not use Project TMs and corpora, simply because they are empty.
If you want to know what the other options are used for, click Help and have a nice read. I strongly suggest getting familiar with Homogeneity, even if we will not use this feature on the current project.

When all options are set, click the Calculate button.
Statistics memoQ

GreatGames is our potential client and this is the first game they localize. The translation memory for this job is empty, and that is why there were no fuzzy hits. The game strings are non-repetitive, so the number of Repetitions is very low (just 38 words in 6 segments). This analysis shows that we will need to manually translate 4347 words. 

Statistics report memoQ
Click on the image for larger view.

Now we use the Export function. This allows saving the report in HTML or CSV format for easy browsing outside of memoQ. Let's save in HTML.

HTML report memoQ

4. Summary.

At the beginning of this article, when we analyzed the expectations of our potential client, we agreed that they would like to have:
  • a straightforward calculation,
  • a concrete deadline,
  • moderate price,
  • transparent communication.
All the wordcount operations in memoQ helped us to deliver what the client expected. Let’s sum it up:
  1. We know the exact word count is: 4347 new words + 38 words in repeated segments.
  2. The time required for this job would be 4-5 days: 3 days for translation (3 x 1500), 1 day for reading through the text, 1 “backup” day to get replies for the questions you asked the developers.
  3. The HTML analysis report is a great way to be transparent. You can and should deliver this report to the developers to show how much work is required.
  4. Once you have passed all the above information to the developers, you will deliver what they expect – clear communication.
In your bid, you should deliver a detailed calculation of the number of words, the memoQ’s word count report, a 5-day delivery deadline, and adapt your price per word accordingly.

As you noticed, the developers did not ask for reports and most likely you will be asking for clarifications when you translate, so this will allow you to deliver the game in 4 days. Nonetheless, you should state a 5-day long deadline and deliver within 4 days. Why? Simply because this is the best way to “underpromise and overdeliver”.

PS: I did not forget about your homework from my previous article!

You had to come up with an Excel formula to count the number of words in each cell. The answer is not easy, as the definition of “word” is tricky, as you have probably noticed. Moreover, Excel does not have a command to count words. It can count characters, but not words. The best formula I could come up with is:

(POLISH) = DŁ(USUŃ.ZBĘDNE.ODSTĘPY(nazwa_komórki))-DŁ(PODSTAW(nazwa_komórki;" ";""))+1
(ENGLISH) = =LEN(TRIM(cell_name))-LEN(SUBSTITUTE(cell_name," ",""))+1

What does it do? It counts the number of spaces, as each word is divided with space, and adds 1 to the outcome. The sentence “I have a dog” contains 3 spaces so it has 4 words. TRIM (or USUŃ.ZBĘDNE.ODSTĘPY in Polish) removes double spaces or spaces at the beginning or end of the text. Thanks to this command sloppy written source text would not confuse our formula.