Using Corpora in Your Translation Work

I trust everyone out there in translation world is doing well having enjoyed a restful Easter break! Today I wanted to share a few practical tips that I’ve developed from within a specific area of translation studies in the hope that they will prove useful to you too. The area in question is corpus-based translation studies, a fascinating sub-category of the discipline that I’m not hugely familiar with but one that I have explored enough to glean these useful tips to apply to my day-to-day work.

In this particular context, the word ‘corpus’ – coming from the Latin for ‘body’ (and with its lovely plural of ‘corpora’) – simply refers to a large and structured set of texts (nowadays usually electronically stored and processed) used to carry out statistical analysis and check occurrences or validate linguistic rules within a specific language territory. These days, corpora are often found at the root of machine translation technology, including Google Translate – something I’ve previously explored on my blog.

In general, a corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. A few free online corpora out there include the BNC (British National Corpus), the Spanish corpus CREA and the corpus of written Italian, CORIS.

For translation studies, meanwhile, the modern translator’s capacity to process information more quickly and easily than ever before has led to vast developments in this area. Indeed, the problem now is that although we can find information easily, we need to ensure that it is reliable and correct within a specific context and this is where corpora and concordancing software play an important role.

While I don’t want to delve too far into the technical side of things as the practical tips below don’t necessarily require any in-depth knowledge of the area, for anyone looking to get better acquainted with what research in this area involves, I’d highly recommend giving Maeve Olohan’s ‘Introducing Corpora in Translation Studies’ a read. The book works from a basis of Descriptive Translation Studies and analyses the worth of (as you’d imagine) corpora in translation studies. While asserting that contrastive translation studies alone doesn’t take the translation act or its sociocultural contexts (ideology etc.) into account, the author advocates combining the quantitative data provided by corpora with qualitative findings from studying the texts more closely.

Using corpora as a style guide or terminology checker

Based on my own (fairly limited) research into the area, the primary application of corpora that I initially implemented into my work was the method of constructing a ‘DIY corpus’ to carry out statistical analysis in fields within which I was not 100% comfortable.

One of the major aims of this method was to develop my knowledge of medical translation when I first started exploring it as a specialism. In this particular case, I constructed an English monolingual corpus of medical journal entries available in open-access online journals in order to gain a firm understanding of the kind of style expected of medical writing and to provide statistical backing for the selection of particular terms or phrases in translation. Subsequently, I analysed the constructed corpus by using the concordancing software AntConc, a freeware program that allows the user to search a body of text for collocations, the frequency of words and word clusters.

In practice, when producing a translation, I would cross-check my English renderings with the corpus (i.e. searching for a particular word or phrase within the texts) to highlight how common/uncommon a particular phrase or term was and what contexts it could be used in. Take, for instance, this simple example: while a term such as ‘significativement’ in a French ST could be adequately translated in various contexts as ‘significantly’, ‘frequently’ or ‘strongly’, by using the corpus and concordancing software you can pinpoint the most fitting synonym for this exact text type and context. As such, compared to dictionaries or glossaries, which often just provide word lists and are unable to take context into account, such ‘DIY corpus’ results provide unparalleled contextual and statistical backing for a given choice.

In general, it is thought that reliable statistical patterns will emerge from a corpus containing 100,000 or more words and, while this figure initially sounds daunting, if texts in the area are readily available – as if the case with medical texts, for example – the process is not overly time-consuming and the positive impact it has on a translation project ensures that its construction is completely justified. Not all topics or languages are so readily available, however, and the relevance and reliability of documents does need to be carefully assessed before adding them to the corpus. Why not give it a go?

The internet as a corpus

Frankly, however, despite representing an excellent way to find your footing in a new specialist area, it is simply not practical to construct a 100,000-word corpus every single time you have to tackle an unfamiliar genre. Fortunately there is another, more user-friendly way of implementing corpora into your daily work.

Rather than going through the extended process outlined above, it is possible to achieve a similar effect using the internet as a ready-made corpus for a whole range of topics. The most effective method that I have found involves using Google as a makeshift style guide when working for a particular site or within a particular site’s stylistic parameters. For example, If you were looking to produce an English text in-line with the general style adopted on the BBC’s website, you can simply type a specific phrase into Google accompanied by ‘’ (or whatever site it happens to be) and the results will let you determine whether the word/phrase is commonly used, and in what specific context.

Say you wanted to see whether the US English spelling of the word ‘specialise’ is ever used on the BBC’s site, for instance: by typing “specialize” (the quotation marks are important to find only that exact spelling) into Google, it takes just a few clicks to conclude that the British English spelling is much more common with almost 30,000 hits compared to just under 4,000 for the US spelling. Handy, right?

For me, when I am writing for a particular site or translating documents along similar lines to previously produced texts online, there are numerous times when I feel unconvinced by a certain phrase or sentence structure and like to use this method to ensure that I’m fulfilling the intended style, it’s a kind of guiding hand when proofing your own work.


4 thoughts on “Using Corpora in Your Translation Work

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s