Skip to content

Tag: Corpora

‘Fire’: Some observations

At the end of class on Tuesday, November 21, we were asked to further investigate the word ‘fire’ as a noun and a verb. I made the following observations during my short investigation:

The fact that ‘fire’ is a noun and a verb can be confirmed by looking at ‘fire’ in the Oxford English Dictionary. Although polysemy is often studied by using Princeton’s WordNet or the various wordnets in other languages (please see my post WordNet and wordnets for more information), it is clear that the dictionary entry in this case gives plenty of detail about the polysemy of the word, as both the entry for ‘fire’ as a verb and the entry for ‘fire’ as a noun contain dozens of different meanings.

I wanted to compare the usage of ‘fire’ as noun and a verb to see which one is used more frequently. I decided to refer to the British National Corpus on Sketch Engine. By looking at the Wordlist feature, I discovered that ‘fire’ appears 17,348 times as a lemma. It appears 14,172 times as a noun and 3,176 times as a verb, showing that it is used far more frequently as a noun.

Leave a Comment

WL4102 Shorter Blog Essay: Machine Translation and Gender Bias

For my shorter blog essay, I decided to look at bias in machine translation with a particular focus on Google Translate and gender. How does Google Translate work? Where can we see gender bias and why does this occur? How does this link to linguistic corpora? These are some questions upon which I hope to reflect in this blog post.

Google Translate was launched in 2006 with the goal of “[breaking] language barriers and [making] the world more accessible”. With over 500 million users and over 100 languages available, the service was seeing more than 100 billion words being translated per day by 2016. These extremely high numbers show what a fundamental part Google Translate plays in translation for many people around the world. The service is an example of Machine Translation (MT) or translation from one natural language to another using a computer. (Source: Ten years of Google Translate)

Google Translate was originally based on statistical machine translation. Phrases were first translated into English and then cross-referenced with databases of documents from the United Nations and the European Parliament. These databases are corpora, e.g. ‘European Parliament Proceedings Parallel Corpus 1996-2011. Although the translations produced were not flawless, the broad meaning could be reached. In November 2016, it was announced that the service would change to neural machine translation, which would mean comparing whole sentences at a time and the use of more linguistic resources. The aim was to ensure greater accuracy as more context would be available. The translations produced through this service are compared repeatedly, which allows Google Translate to decipher patterns between words in different languages. (Source: Google Translate: How does the search giant’s multilingual interpreter actually work?)

Although accuracy has improved, one issue that remains is that of gender. Machine Translation: Analyzing Gender, a case study published in 2013 as part of Gender Innovations at Stanford University, shows that translations between a language with fewer gender inflections (such as English) and a language with more gender inflections (such as German) tend to display a male default, meaning the nouns are shown in the male form and male pronouns are used, even when the text specifically refers to a female. The case study also shows, however, that this male default is overridden when the noun refers to something considered stereotypically female, such as ‘nurse’. I tested gender bias myself on Google Translate by translating the grammatically gender-neutral term ‘the secretary’ from English to German. As can be seen in the photo below, the automatic translation feature shows that the translation of ‘the secretary’ by itself translates automatically to ‘die Sekretärin’ (the secretary, female) but when it is combined with ‘of state’ translates automatically to ‘der Staatsekratär’ (secretary of state, male). This shows a very obvious bias in terms of gender roles. (Sources: Machine Translation | Gendered InnovationsGoogle Translate’s Gender Problem (And Bing Translate’s, And Systran’s)

The above example shows a simple word with the definite article and gives no other context. Therefore, the machine had to rely on frequency and chose the gender most used in translations of this word in the corpora upon which it is based. In the corpora, ‘the secretary’ is more frequently shown to be a female and ‘the secretary of state’ is more frequently shown to be a male. Although Google Translate has moved away from solely statistical translation to neural machine translation, it still displays issues stemming from statistical methods. This 2017 article by Mashable features further examples. In it, a Google spokesperson is quoted as saying over email, “Translate works by learning patterns from many millions of examples of translations seen out on the web. Unfortunately, some of those patterns can lead to translations we’re not happy with. We’re actively researching how to mitigate these effects; these are unsolved problems in computer science, and ones we’re working hard to address.” (Sources: Machine Translation | Gendered InnovationsGoogle Translate might have a gender problem)

The current algorithms involved in Google Translate, which involve using available corpora to translate phrases, present this issue. As corpora show language as it is used, one would hope that the gender balance within them would improve as the world develops more awareness of gender-related issues and gets closer to achieving gender balance. This could then, in turn, have an effect on the translations. The Machine Translation: Analyzing Gender case study suggests a solution that would entail reforming the computer science and engineering curriculum to include “sophisticated methods of sex and gender analysis”, and concludes with a reflection on the complexity of this issue and on the need for new algorithms, understandings and tools. Until then, this issue, as shown by my above quick search, is ongoing. (Source: Machine Translation | Gendered Innovations)

For more content relating to corpora and gender, please see my main blog essay.

Update – December 8, 2018: I have written another blog post relating to this short essay in light of an article published two days ago on Google Blog. You can read this post here.

List of sources:

Leave a Comment

‘Strong’, ‘Grasp’, ‘Consequence’: Some observations

At the end of class on Tuesday, November 13, we were asked to further investigate the words ‘strong’, ‘grasp’ and ‘consequence’ in terms of how we would go about describing them grammatically using the tools we had focused on in class. The following are some observations I made during my investigation:

 

Strong:

In the Oxford English Dictionary, ‘strong’ is shown to be a noun in the case of a group of strong people. When considering how this noun functions grammatically, it is important to note that it is a collective noun and cannot be made plural. It is also shown to be an adjective, meaning grammatically we have to consider the comparative and superlative forms, in this case ‘stronger’ and ‘strongest’ respectively. As it is a monosyllabic adjective and therefore must take the -er and -est suffixes, I did not need to consult a corpus. In the case of a disyllabic adjective, it can be useful to consult a corpus to see what the most common comparative and superlative forms are, as grammatically speaking, the adjective could take the -er and -est suffixes or the ‘more’ and ‘most’ forms.

 

Grasp:

‘Grasp’ is shown to be a noun and a verb in the Oxford English Dictionary. I reflected on the plural form and thought that, in terms of my own usage of the word, I would never say ‘grasps’ as a plural of ‘grasp’. I thought about ‘within their grasp’ as an expression and the fact that one can only say ‘I have a good grasp of maths’ in the singular. I decided to use the British National Corpus (BNC) to see if it can in fact be used as a plural. On Sketch Engine, I created a concordance based on the lemma ‘grasp’ as a noun, which can be viewed here. This showed that in the BNC, grasp as a noun is not used in the plural form. As a verb, it is important to note that it is a regular verb, with the past tense being formed with ‘I/you/he/etc. grasped’ and the past participle being ‘grasped’.

 

Consequence:

The Oxford English Dictionary shows ‘consequence’ to be a noun and a verb. The verb ‘consequence’ is described as rare and obsolete.  In the case of ‘consequence’ as a noun, it can easily be made into the plural form by adding an ‘s’. I created a concordance on Sketch Engine within the BNC based on the lemma ‘consequence’ as a noun, and one can see that it is commonly used. The concordance can be seen here.  

Leave a Comment
css.php