How Twitter language reveals your gender — or your friends’
Social media is giving linguists new insight into how speech varies.
When a group of young sociolinguists started an annual conference called New Ways of Analyzing Variation four decades ago, they focused on variation of the spoken kind, looking at how speech patterns relate to group identity. But at the 41st gathering of NWAV a week ago, at Indiana University Bloomington, papers on traditional ways of speaking shared the limelight with something the founders couldn’t have predicted: the 21st-century terrain of computer-mediated language.
Twitter, in particular, merited a whole panel, with papers on the medium’s changing slang (are your Twitter followers “tweeps,” “tweeple,” or “tweeties”?) and on the way that the Spanish verb “gustar” (meaning “to like”) gets used in different parts of the Spanish-speaking Twittersphere. A third paper crunched through millions of tweets to detect gender differences in language use, not just in dictionary words but in such electronic shorthand as “xoxo” (for hugs and kisses) and emoticons.
Twitter is a new world for linguists. Like text-messaging, tweets capture a casual, speech-like discourse in writing. Creating a massive corpus of millions of messages is relatively effortless, simply by taking advantage of the “firehose” of tweets that Twitter’s streaming service makes available—and who influences whom is much more apparent than in daily life. As such, the new medium is illustrating phenomena that language researchers have never had such easy access to before now.
Twitter’s mountains of language data aren’t just of scholarly interest. Being able to mine tweets for their writer’s gender, it turns out, has commercial appeal for advertisers. While the sociolinguists were gathering in Bloomington, Twitter announced that it was beginning to use “contextual signals” to determine just this, so that an advertiser could promote “a new line of cosmetics without having its message delivered to men not likely to be interested in that content.” The scholars doing research on Twitter and gender, meanwhile, want to analyze those same “contextual signals” for a deeper understanding of gender differences in language and our expectations of how men and women speak and write.
Tyler Schnoebelen, who recently completed his PhD in linguistics at Stanford University, told the NWAV crowd how he and his colleagues plumbed Twitter to create a corpus of more than 9 million tweets from English speakers in the United States. There’s no gender checkbox on Twitter, but by looking at the distribution of given names in census data, they were able to assign genders with a high degree of accuracy. (You’re not likely to find many men going by “Annette” or women named “Eugene.”)
They then looked at which bits of tweeted language skewed male and female. In line with previous research on gender and discourse, women were found to use more pronouns, emotion terms (like “sad,” “love,” and “glad”), and abbreviations associated with online discourse (like “lol” and “omg”). Women also rate highly on the use of emoticons and “backchannel sounds” (like “ah,” “hmmm,” “ugh,” and “grr”).
Men, on the other hand, have higher frequencies of standard dictionary words, numbers, proper nouns (especially the names of sports teams), and taboo words. Simply by looking at these different rates of word usage, Schnoebelen and his colleagues, David Bamman of Carnegie Mellon University and Jacob Eisenstein of Georgia Tech, can predict the gender of an author on Twitter with 88 percent accuracy.
But Schnoebelen, Bamman, and Eisenstein didn’t stop there, even if such a high level of accuracy in pinpointing gender would be good enough for, say, L’Oréal. They wanted to go beyond the standard binary stereotypes of “Men Are from Mars, Women are from Venus” to understand how “male” and “female” linguistic markers actually work in the world, at least online.
They found that even though you can categorize certain words as having a higher male or female probability, it’s easy to find large swaths of Twitter users who go against these trends. By grouping people by their style of usage, they could find, for example, a cluster of authors that is 72 percent male but nonetheless favors the nonstandard spellings that are supposedly a hallmark of “female” language.
Digging deeper, the researchers looked at the social networks that people create on Twitter, making connections by “following” and replying to other users. When you take these networks into account, the gender picture gets even more complex. It turns out that the statistical outliers (men who use language that’s associated with women, and vice versa) are more likely to have networks skewing to the other gender. A man who favors emoticons is more likely to have a high proportion of women in his network. And a woman who frequently mentions the names of sports teams likely has a lot of male friends. The takeaway from Schnoebelen’s presentation is that a simple binary model of gender isn’t sufficient in understanding the welter of language styles in the Twittersphere—or, by implication, in everyday life.
While this research has broad implications for thinking about the language styles used by men and women, it also tells us something about the peculiar mode of discourse that is tweeting. Unlike more settled genres of interaction, Twitter has yet to establish well-defined norms of usage. It’s the Wild West of language, which makes it both exciting and daunting for linguistic scholars. Lying somewhere in the gray zone between speech and writing, Twitter-ese can shine a light on how we make up the rules of language use as we go along.