Skip to content

Category: Tools and Information Technologies

Update: Google Translate and Gender Bias

For my shorter blog essay, I wrote a post called Machine Translation and Gender Bias. In it, I discussed the evidence of gender bias on Google Translate. On December 6, 2018, an article that is highly relevant to this blog post was published on Google Blog. I wanted to write another blog post to discuss its content and implications. 

The article, titled ‘Reducing gender bias in Google Translate’ was written by James Kuczmarski, Product Manager at Google Translate. It details a new development in the service that aims to “[address] gender bias by providing feminine and masculine translations for some gender-neutral words”. When translating a single word such as ‘surgeon’ from English into French, Italian, Portuguese or Spanish, or when translating phrases from Turkish into English, the article claims that both the feminine and masculine translation will be displayed. While the German example in my original blog post is, thus, still not covered by this new feature, Kuczmarski states that Google Translate plan to cover more languages in the future. (Source: Reducing gender bias in Google Translate)

I decided to test the new feature out in another language that I study, Spanish. Firstly, I entered the term ‘secretary’: 

Google Translate did indeed provide the masculine and feminine form of the noun in Spanish. However, when I added ‘the’ before the noun, it was clear that the service has not fully solved the gender bias issue:

When the definite article is placed before the noun, the translation still automatically becomes ‘la secretaria’, the feminine form. 

It is great to see that efforts are being made to reduce the amount of gender bias on Google Translate. However, the limited number of languages covered by the new system and the fact that a quick test that I carried out showed an obvious flaw in it (that it does not function with the definite article), it is clear that there is still much progress to be made. 

List of Sources:

Leave a Comment

‘Fire’: Some observations

At the end of class on Tuesday, November 21, we were asked to further investigate the word ‘fire’ as a noun and a verb. I made the following observations during my short investigation:

The fact that ‘fire’ is a noun and a verb can be confirmed by looking at ‘fire’ in the Oxford English Dictionary. Although polysemy is often studied by using Princeton’s WordNet or the various wordnets in other languages (please see my post WordNet and wordnets for more information), it is clear that the dictionary entry in this case gives plenty of detail about the polysemy of the word, as both the entry for ‘fire’ as a verb and the entry for ‘fire’ as a noun contain dozens of different meanings.

I wanted to compare the usage of ‘fire’ as noun and a verb to see which one is used more frequently. I decided to refer to the British National Corpus on Sketch Engine. By looking at the Wordlist feature, I discovered that ‘fire’ appears 17,348 times as a lemma. It appears 14,172 times as a noun and 3,176 times as a verb, showing that it is used far more frequently as a noun.

Leave a Comment

WL4102 Shorter Blog Essay: Machine Translation and Gender Bias

For my shorter blog essay, I decided to look at bias in machine translation with a particular focus on Google Translate and gender. How does Google Translate work? Where can we see gender bias and why does this occur? How does this link to linguistic corpora? These are some questions upon which I hope to reflect in this blog post.

Google Translate was launched in 2006 with the goal of “[breaking] language barriers and [making] the world more accessible”. With over 500 million users and over 100 languages available, the service was seeing more than 100 billion words being translated per day by 2016. These extremely high numbers show what a fundamental part Google Translate plays in translation for many people around the world. The service is an example of Machine Translation (MT) or translation from one natural language to another using a computer. (Source: Ten years of Google Translate)

Google Translate was originally based on statistical machine translation. Phrases were first translated into English and then cross-referenced with databases of documents from the United Nations and the European Parliament. These databases are corpora, e.g. ‘European Parliament Proceedings Parallel Corpus 1996-2011. Although the translations produced were not flawless, the broad meaning could be reached. In November 2016, it was announced that the service would change to neural machine translation, which would mean comparing whole sentences at a time and the use of more linguistic resources. The aim was to ensure greater accuracy as more context would be available. The translations produced through this service are compared repeatedly, which allows Google Translate to decipher patterns between words in different languages. (Source: Google Translate: How does the search giant’s multilingual interpreter actually work?)

Although accuracy has improved, one issue that remains is that of gender. Machine Translation: Analyzing Gender, a case study published in 2013 as part of Gender Innovations at Stanford University, shows that translations between a language with fewer gender inflections (such as English) and a language with more gender inflections (such as German) tend to display a male default, meaning the nouns are shown in the male form and male pronouns are used, even when the text specifically refers to a female. The case study also shows, however, that this male default is overridden when the noun refers to something considered stereotypically female, such as ‘nurse’. I tested gender bias myself on Google Translate by translating the grammatically gender-neutral term ‘the secretary’ from English to German. As can be seen in the photo below, the automatic translation feature shows that the translation of ‘the secretary’ by itself translates automatically to ‘die Sekretärin’ (the secretary, female) but when it is combined with ‘of state’ translates automatically to ‘der Staatsekratär’ (secretary of state, male). This shows a very obvious bias in terms of gender roles. (Sources: Machine Translation | Gendered InnovationsGoogle Translate’s Gender Problem (And Bing Translate’s, And Systran’s)

The above example shows a simple word with the definite article and gives no other context. Therefore, the machine had to rely on frequency and chose the gender most used in translations of this word in the corpora upon which it is based. In the corpora, ‘the secretary’ is more frequently shown to be a female and ‘the secretary of state’ is more frequently shown to be a male. Although Google Translate has moved away from solely statistical translation to neural machine translation, it still displays issues stemming from statistical methods. This 2017 article by Mashable features further examples. In it, a Google spokesperson is quoted as saying over email, “Translate works by learning patterns from many millions of examples of translations seen out on the web. Unfortunately, some of those patterns can lead to translations we’re not happy with. We’re actively researching how to mitigate these effects; these are unsolved problems in computer science, and ones we’re working hard to address.” (Sources: Machine Translation | Gendered InnovationsGoogle Translate might have a gender problem)

The current algorithms involved in Google Translate, which involve using available corpora to translate phrases, present this issue. As corpora show language as it is used, one would hope that the gender balance within them would improve as the world develops more awareness of gender-related issues and gets closer to achieving gender balance. This could then, in turn, have an effect on the translations. The Machine Translation: Analyzing Gender case study suggests a solution that would entail reforming the computer science and engineering curriculum to include “sophisticated methods of sex and gender analysis”, and concludes with a reflection on the complexity of this issue and on the need for new algorithms, understandings and tools. Until then, this issue, as shown by my above quick search, is ongoing. (Source: Machine Translation | Gendered Innovations)

For more content relating to corpora and gender, please see my main blog essay.

Update – December 8, 2018: I have written another blog post relating to this short essay in light of an article published two days ago on Google Blog. You can read this post here.

List of sources:

Leave a Comment

WL4102 Main Blog Essay: Critical Discourse Analysis and Corpora – A Corpus-Based Project

In my post on Concordances and Voyant Tools, I touched on the use of corpora in discourse research. In this extended blog post, I will explore this idea further, by looking at the connectedness of critical discourse analysis (CDA) and linguistic corpora, using Sketch Engine to study two corpora, one in Spanish and the other in German.

Introduction

CDA is an approach to studying written and spoken communication and the relationship between this communication and society. Van Dijk states that a central focus of CDA is “(group) relations of power, dominance and inequality and the way these are reproduced or resisted by social group members through text and talk” (van Dijk 2). It emerged from different areas of linguistics, including text linguistics and sociolinguistics, and it relates to several modules that we have studied as part of the BA World Languages programme, particularly WL2102: Introduction to Semiotics. The approach is multidisciplinary and draws on methodological approaches that are effective in examining forms of social inequality, such as inequality based on class, sexuality and religion. (Sources: Critical Discourse Analysis: Theory and Interdisciplinarity: pages 11-15, Aims of Critical Discourse Analysis)

As language forms such an important part of the approach, I wanted to closer examine how we can see power structures relevant to CDA in linguistic corpora, and thus, observe how corpus linguistics and CDA connect. I decided to focus on power structures relating to skin colour and gender, using a German-language corpus to examine skin colour and a Spanish-language corpus to examine gender.

Methodology

For my investigation, I used Sketch Engine’s Word Sketch Difference feature, which compares a set of collocates for one lemma in a certain corpus to a set of collocates for another lemma within the corpus. Each lemma is given a colour (red or green) and the collocates that tend to combine with each one are given the same colour. Collocates in white tend to combine with both. If a collocate is shown in dark green or dark red, the collocation is stronger, meaning that the collocate combines far more often with the lemma of that colour and far less often with the other lemma. (Source: Word Sketch Difference lesson | Sketch Engine)

To look at group inequalities in a simple way, I decided to use lemmas that represent an opposing power relationship. It is important to note that these terms are not binary oppositions but I see them as opposing in terms of societal power structures. In the German-language corpus, I searched the lemmas ‘schwarzhäutig’ (black-skinned) and ‘weißhäutig’ (white-skinned), drawing on the amount of discrimination historically and presently faced by people of colour. In the Spanish-language corpus, I searched the lemmas ‘mujer’ (woman) and ‘hombre’ (man) based on the gender discrimination frequently experienced by women living in a patriarchal society.

The German-language corpus used was the German Web 2013 corpus (deTenTen13), which contains over 16 billion words. The Spanish-language corpus used was the Spanish Web 2018 corpus (esTenTen18), which contains over 17 billion words and two subcorpora for European Spanish and American Spanish.  Both of these corpora are made up of collected web-based texts. (Sources: deTenTen – German corpus from the web | Sketch EngineesTenTen – Spanish corpus from the web | Sketch Engine

Results

Skin colour in the German Web 2013 corpus:

The result of my search can be seen here

I decided to look at oppositions relating to skin colour in the German-language corpus for a specific reason: when using the Word Sketch Difference feature on Sketch Engine, the user can only enter a lemma. This means that one cannot enter a term such as ‘black person’ or ‘white person’. As the colours ‘white’ and ‘black’ often refer to any object with that colour, this was also not a helpful search. In German, the adjectives ‘schwarzhäutig’ (black-skinned) and ‘weißhäutig’ (white-skinned) are used, which means they can function as a lemma automatically connected to skin colour.

Although the search only results in one column due to the infrequency of the words in the corpus, this column alone gives us plenty of information. The three nouns that are shown to be modified by the adjective ‘schwarzhäutig’ six times or over and that are never modified by ‘weißhäutig’ in the corpus are ‘Hüne’ (giant/hulk), ‘Bastard’ (bastard) and ‘Afrikaner’ (African male). These collocates give us an impression of the language used in texts online around the word ‘schwarzhäutig’. The appearance of the word ‘Bastard’ shows how repeatedly negative some of this language can be. The presence of the word ‘Afrikaner’ also shows an association between black skin and African males, which is of course commonly problematic for people of colour from countries outside of Africa, as exemplified by the social media campaign by CNN ‘‘No, where are you really from?’‘.

These results contrast to ‘Amerikaner’ (American male), ‘Europäer’ (European male), ‘Fremde’ (stranger, foreign person) and ‘Blondine’ (blonde woman), which are words shown to only collocate with ‘weißhäutig’. Although these terms also show positioning of skin colour in terms of country of origin or nationality, no words such as ‘Bastard’ appear, showing the power structure.

Gender in the Spanish Web 2013 corpus

The result of my search can be seen here

The words ‘mujer’ and ‘hombre’ appear more frequently in the Spanish-language corpus compared to ‘weißhäutig’ and ‘schwarzhäutig’ in the German corpus.

The two verbs that mainly collocate with ‘mujer’ rather than ‘hombre’ as an object are ‘embarazar’ (to impregnate) and ‘violar’ (to violate/rape). At the bottom of the column, we can see that a verb that commonly collocates with ‘hombre’ rather than ‘mujer’ as an object is ‘armar’ (to arm). These collocates show ‘hombre’ as associated with weapons and ‘mujer’ as a receiver of violence, which shows the power structure. This is furthered when we consider the verb that mainly collocates with ‘mujer’ as a subject: ‘sufrir’ (to suffer).

By examining the column that shows collocates involving the preposition ‘sin’ (without), we can see that ‘mujer’ is associated with collocates such as ‘sin pareja’ (without a partner) and ‘sin hijo’ (without a child), which also shows certain expectations surrounding the role of women.

Conclusion

As a central focus of critical discourse analysis is group relations of power as shown through language, a corpus analysis can help greatly. Although my above examination of two corpora on Sketch Engine only consisted of a short investigation, it provided me with results that showed that power structures can be seen in corpora by examining collocates. In the German-language corpus, the word ‘Bastard’ showed the negativity surrounding the term ‘schwarzhӓutig’ and, in the Spanish-language corpus, ‘mujer’ was shown to receive violence and involve societal expectations surrounding partnership and children. In this sense, a corpus-based study relates to the central focus of CDA in studying the reproduction of power relations through communication. I feel that this information will be helpful for me going forward, especially as I will be taking a World Languages module next semester called WL4101: Language and Power. This task has shown me how relevant corpora are in my studies as a language student.

For more content relating to corpora and gender, please see my shorter blog essay.

List of sources:

Leave a Comment

‘Strong’, ‘Grasp’, ‘Consequence’: Some observations

At the end of class on Tuesday, November 13, we were asked to further investigate the words ‘strong’, ‘grasp’ and ‘consequence’ in terms of how we would go about describing them grammatically using the tools we had focused on in class. The following are some observations I made during my investigation:

 

Strong:

In the Oxford English Dictionary, ‘strong’ is shown to be a noun in the case of a group of strong people. When considering how this noun functions grammatically, it is important to note that it is a collective noun and cannot be made plural. It is also shown to be an adjective, meaning grammatically we have to consider the comparative and superlative forms, in this case ‘stronger’ and ‘strongest’ respectively. As it is a monosyllabic adjective and therefore must take the -er and -est suffixes, I did not need to consult a corpus. In the case of a disyllabic adjective, it can be useful to consult a corpus to see what the most common comparative and superlative forms are, as grammatically speaking, the adjective could take the -er and -est suffixes or the ‘more’ and ‘most’ forms.

 

Grasp:

‘Grasp’ is shown to be a noun and a verb in the Oxford English Dictionary. I reflected on the plural form and thought that, in terms of my own usage of the word, I would never say ‘grasps’ as a plural of ‘grasp’. I thought about ‘within their grasp’ as an expression and the fact that one can only say ‘I have a good grasp of maths’ in the singular. I decided to use the British National Corpus (BNC) to see if it can in fact be used as a plural. On Sketch Engine, I created a concordance based on the lemma ‘grasp’ as a noun, which can be viewed here. This showed that in the BNC, grasp as a noun is not used in the plural form. As a verb, it is important to note that it is a regular verb, with the past tense being formed with ‘I/you/he/etc. grasped’ and the past participle being ‘grasped’.

 

Consequence:

The Oxford English Dictionary shows ‘consequence’ to be a noun and a verb. The verb ‘consequence’ is described as rare and obsolete.  In the case of ‘consequence’ as a noun, it can easily be made into the plural form by adding an ‘s’. I created a concordance on Sketch Engine within the BNC based on the lemma ‘consequence’ as a noun, and one can see that it is commonly used. The concordance can be seen here.  

Leave a Comment

Poem in XML

As part of WL4102, we were asked to encode a poem of our choice on Brackets in XML using the TEI guidelines. I chose to encode the poem ‘Explico algunas cosas’ by Pablo Neruda, a powerful poem dealing with the themes of the Spanish Civil War and the poet’s own writing. As I could not upload the XML document directly to this blog post due to WordPress’ security settings, I have posted screenshots of the encoded poem:

Leave a Comment

WL4102 Presentation: Standards

As part of WL4102, students were asked to prepare presentations with a focus on concepts, technologies or tools of their choice. For my presentation, I chose the topic ‘standards’. Following research into the topic, I began to write my script, timing myself as I did so to ensure that I would stick to the allotted five minutes on the day of the presentation. This made my script add up to approximately 850 words. The script is based on the following information, which I have arranged in the form of a blog post. I have also included the sources that informed my presentation and compiled a list that can be found at the bottom of this post. While speaking, I displayed an outline of my presentation on my blog to aid those in attendance in following the presentation. I have attached a screenshot below.

Standards in computing are sets of specifications, or guidelines, for developing a certain computer technology, be that hardware or software. 

The first electronic digital computer was completed over 70 years ago and for about 10 years after that, each new computer developed was a wholly unique design. This was very costly. In the late 1950s, computers called mainframes emerged. These were cheaper to produce and were produced in greater quantity. Each manufacturer developed its own standards to build a complete system, for example, control and programming methods. When computers gained commercial popularity in the latter part of the 20th century, they became smaller and were produced in much greater numbers. This is largely due to standards. A huge advantage of standards stems from the fact that they help in creating programs that work on different systems, or that are compatible with different systems. The increasing number of computers produced was possible largely due to manufacturers’ ability to draw on standards to manufacture a range of products relying on the same technologies. (Sources: History and Impact of Computer Standards [Anniversary Feature] – Computer: page 3, BBC Bitesize – GCSE Computer Science – Standards – Revision 1)

Some characteristics of standards, as detailed by Marvin Waschke, are that standards are established by custom, by general consent or by an authority. They are widely followed and are usually documented with great precision. The scope and applicability, or what the standard actually provides for, are usually widely understood. Examples include Unicode, the QWERTY keyboard layout and the file format MP3. Some of the major standards organisations include the International Organization for Standardization (ISO), which maintains standards of every kind, and the World Wide Web Consortium, which focuses on standards for web-based technologies. (Sources: Cloud Standards: Agreements That Hold Together Clouds – Marvin Waschke – Google Books: pages 26 and 27BBC Bitesize – GCSE Computer Science – Standards – Revision 1)

There are two types of standards, which relate to the source of the standard itself: de facto and de jure. Both come from Latin, with de facto meaning ‘in practice’ and de jure meaning ‘in law’. In the case of a de facto standard, a person or a company builds a system in a certain way and this system could be a success and be used more, either by the original creator or by others. It then evolves so that the majority of systems are developed in this way and a de facto standard exists. There are no standard bodies involved in this standard’s development. An example of this is IBM’s PC design. (Source: Cloud Standards: Agreements That Hold Together Clouds – Marvin Waschke – Google Books: pages 27-32)

De jure standards, on the other hand, are sanctioned by a standard body, like the World Wide Web Consortium, which maintains the XML Schema Definition (XSD) language. This standard defines the “legal building blocks of an XML document”. This ensures that XML is valid and well-formed. (Source: XML Schema Tutorial

De Jure standards are sometimes also backed by government bodies, such as pharmaceutical safety standards. De jure standards may start as de facto standards. One example of a de facto standard becoming a de jure standard is the C programming language. The language was originally detailed in a book by its creators, Brian Kerrigan and Dennis Ritchie, in 1978. It gained popularity and the American National Standards Institute (ANSI) published a C standard language in 1989. (Source: Cloud Standards: Agreements That Hold Together Clouds – Marvin Waschke – Google Books: pages 32-36)

Another distinction can be drawn between open and proprietary standards. Open standards are made openly available to the public, either through a Creative Commons License or by being unlicensed, along with any supporting material needed to fully understand the scope and applicability of the standard. An example of this is HTML. Proprietary standards are privately owned by an organisation or individual. The owner can control the use of the standard through the licensing terms. An example of this is a DOC file, or the Microsoft Word Document file format. Stacy Baird states that these standards allow for more efficiency in the development of a new product, as the procedural issues involved in open standard-setting organisations are avoided, such as the issue of reaching a consensus among everybody involved. (Sources: BBC Bitesize – GCSE Computer Science – Standards – Revision 3Opening Standards: The Global Politics of Interoperability – Google Bookspage 19)

An example of a technical standard is ISO 639, which is the international standard for language codes. The purpose of ISO 639 is to maintain internationally recognised codes to represent languages or language families. This standard is a de jure standard, as it is maintained by a standards organisation: the International Organization for Standardization. It is an open standard, as the codes can be used without having to deal with any licensing terms. This example shows that standards are diverse in how they are used as these language codes can be used in coding and can be used in a library setting. (Source: ISO 639 Language codes)

In summary, standards are sets of specifications or guidelines for developing a certain aspect of computer technologies. They provide a means of creating programs and products compatible with different systems. They can be de jure or de facto, open or proprietary, and can be used in a variety of settings. 

List of sources:

Leave a Comment

Concordances and Voyant Tools

Concordance programs allow the user to search for instances of a given word or phrase within a text or a corpus. In the most common concordance format, each concordance line displays an occurrence of the word as it appears in the text or database, along with the words occurring on either side of it. This shows the context in which the word appears. This format is called a KWIC concordance, or a key-word-in-context concordance.

Concordances provide a convenient format to analyse words in terms of their context and to examine the patterns in which they occur. This is useful for lexicographers as it is important to know how words occur in a certain language and their frequency when writing dictionaries. By studying concordances, one can also obtain data on collocations, which is very useful in discourse research (Sources: Using a Concordance for Discourse Researchhttps://ota.ox.ac.uk/documents/searching/handbook.html).

I used Voyant Tools to experiment with concordances using a plain text file version of Luther’s Bible. I downloaded the text from the Oxford Text Archive and uploaded it to Voyant Tools. This created the concordance that can be viewed here. In the bottom right-hand corner of the screen, one can see the concordance lines created.

From the most frequent word in the text, it is clear that I encountered the issue of German character representation. Instead of the German word ‘dass’, which means ‘that’ and which would have been written with a scharfes S (ß) at the time, ‘daãÿ’ is displayed. The appearance of this word also brings up the issue of stop words. Stop words are words that should be excluded from the results of a concordance and are typically function words. In Voyant Tools, you can choose to use pre-existing stopword lists or create your own. In this case, as some function words, such as ‘daß’, were displayed differently to how they should appear, it would be difficult to use this function.

One can see in the cirrus, or the word cloud, that some predictable content words were frequent in the text, such as ‘gott’ (God) and ‘sohn’ (son). It is interesting to note that the word cloud does not show these with a capital G and S. As all German nouns start with capital letters, this is an important feature of the language that is left out of the word cloud.

Leave a Comment

WordNet and wordnets

Princeton’s WordNet is a lexical database showing semantic relationships between words in the English language. It focuses on nouns, verbs, adjectives and adverbs, as words within these word classes are all content words, meaning that they have meaning by themselves (as opposed to function words). Princeton’s WordNet takes these content words and groups them into ‘synsets’, which are groups of cognitive synonyms, or words with the same meaning or sense (Sources: PARTS OF SPEECH, WordNet | A Lexical Database for English).

Wordnets have emerged in other languages based on this concept, including in my languages of study – Irish, Spanish and German. Wordnets for each of these languages can be found by following these links:

  • EuroWordNet database: a multilingual database providing wordnets for several European languages, including Spanish and German. Free samples from each language can be downloaded here.
  • Líona Séimeantach na Gaeilge (LSG), or the Language Semantic Network: an Irish-language wordnet, providing a comprehensive database of Irish words and the semantic links between them.  The PDF version can be downloaded here.

The PDF version of the LSG displays the wordnet in alphabetical order. As in the Princeton WordNet, content words are presented in synsets, showing relationships between words. The word ‘comhchiall’ denotes synonymous words, ‘aicmí’ denotes the class to which the word belongs and ‘fo-aicmí’ the subclasses stemming from the word. ‘Gaolta’ shows a related word that is not synonymous. In this screenshot below from the PDF, for example, one can see that the word ‘teangeolaíocht’ (linguistics) is shown to be in the class of ‘eolaíocht’ (science) with one subset being ‘pragmataic’ (pragmatics). It is shown to be related to, but not synonymous with, ‘gramadach’ (grammar).

This shows that synsets on the LSG, like those in Princeton’s WordNet, have a hierarchical element. Using the relations expressed through hypernyms and hyponyms, the LSG shows where each word lies within the hierarchy of similar words in the synset. In the example above, ‘eolaíocht’ is a hypernym for ‘teangeolaíocht’, and ‘pragmataic’ a hyponym for ‘teangeolaíocht’.  Antonyms are not shown within the LSG PDF file, unlike in Princeton’s WordNet. 

The entries are linked to the synsets available in the Princeton WordNet, which its creator, Kevin Scannell, states is helpful for his work on English-Irish machine translation. The entries are not mapped directly, however, partly due to the distinctions within Irish that do not exist within English (such as the difference between ‘rua’ and ‘dearg’) (Sources: LSG: Home, LSG: Details).

Leave a Comment

Text databases relating to my studied languages

In this post, I have compiled a list of some of the text databases available online that relate to the three languages that I study as part of the BA World Languages (Irish, Spanish and German). The list includes links to databases dedicated to various topics, from general literature of the language to specific authors, and could be of help to students studying these languages and/or studying the cultures of the countries in which these languages are spoken.

Bunachair shonraí a bhfuil baint acu leis an nGaeilge mar ábhar:

Bases de datos que están relacionadas con la filología hispánica:

Datenbanken, die Germanistik oder Deutsch als Fremdsprache betreffen:

 

Leave a Comment
css.php