Less Sexism in Finnish

Machine learning models are only as good as the data you use to train them. AI sexism and racist machines have made the news in 2017.

Yesterday, Slava Akhmechet tweeted about his word2vec language model test. According to the tweet, the test was trained on the Google News dataset that contains one billion words from news articles in English.

Word2vec can be used to find similar meanings between words. For example: (Man | King) would be analogous to (Woman | ________) … Can you guess? Yes, “Queen”.

Image from Tensorflow Tutorial

However, Slava’s tweet shows that according to news text, while “he” is “persuasive”, “she” is “seductive”, and so on:

I’ve ran some word2vec tests in Finnish. I thought it would be interesting to see if a similar bias exists in Finnish as well. You know, Finland being one the most gender-equal countries, where we speak a language that has only gender-neutral pronouns. In Finnish, both “he” and “she” are the same word: “hän”.

Results

Finnish is more equal than English.

For example, in Finnish, both “man” and “woman” have similarity to “gynecologist” and “general practitioner”. (Top-10 similarity for woman included “nurse”, which was missing for man.)

(Man+persuasive) is mostly equal to (Woman+persuasive): both are “credible”, “dashing”, “impassioned”.

Mostly equal in traffic as well. (Man+biker) has similarity score 0.54; (Woman+biker) = 0.51.

Personal qualities:
 (Man+credible)    = 0.24 vs (Woman+credible)  = 0.25
 (Man+dependable)  = 0.21 vs (Woman+dependable)= 0.22

It’s not all that rosy, though.

 (Man+manager)     = 0.24 vs (Woman+manager)   = 0.17
 (Man+sensitive)   = 0.23 vs (Woman+sensitive) = 0.30

The training data for this word2vec model is the Finnish Internet, i.e. articles, news, discussions, and online forums in Finnish (by Turku BioNLP Group)

 

 

 

 

 

One Comment

  1. Pingback: Personal Semantic Search – Jussi Huotari's Web

Leave a Reply

Your email address will not be published. Required fields are marked *