Gender Recognition on Dutch Tweets

Finally, as the use of capitalization and diacritics is quite haphazard in the tweets, the tokenizer strips all words of diacritics and transforms them to lower case. As for systems, we will involve all five systems in the discussion. The creators themselves used it for various classification tasks, including gender recognition Koppel et al.

This may support ourhypothesis that allfeature types aredoingmore orlessthe same. Accuracy Percentages for various Feature Types and Techniques. But it might alsomean that the gender just influences all feature types to a similar degree. However, easy art projects for seventh graders dating his Twitter network contains mostly female friends.

For the other feature types, we see some variation, but most scores are found near the top of the lists. This type of character n-gram has the clear advantage of not needing any preprocessing in the form of tokenization.

In fact, for all the tokens n-grams, it would seem that the further one goes away from the unigrams, the worse the accuracy gets. The ones used more by women are plotted in green, those used more by men in red.

The first set is derived from the tokenizer output, and can be viewed as a kind of normalized character n-grams. Where Cohen assumes the two distributions have the same standard deviation, we use the sum of the two, practically always different, standard deviations.

However, we used two types of character n-grams. The control shell then weighted each score by multiplying it by the class separation value on the development data for the settings in question, and derived the final score by averaging. Another interesting group of authors is formed by the misclassified ones. The age is reconfirmed by the endearingly high presence of mama and papa.

Apparently, in our sample, politics is a male thing. In scores, too, we see far more variation.

However, as any collection that is harvested automatically, its usability is reduced by a lack of reliable metadata. For both models the control shell calculated a final score, starting with the three outputs for the best hyperparameter settings.

All users, obviously, should be individuals, and for each the gender should be clear. Again, we decided to explore more than one option, but here we preferred more focus and restricted ourselves to three systems. The tokenizer counts on clear markers for these, e.

We also varied the recognition features provided to the techniques, using both character and token n-grams. Again, we take the token unigrams as a starting point.

We then measured for which percentage of the authors in the corpus this score was in agreement with the actual gender. They used lexical features, and present a very good breakdown of various word types. This is in accordance with the hypothesis just suggested for the token n-grams, as normalization too brings the character n-grams closer to token unigrams. These statistics are derived from the users profile information by way of some heuristics.