Measuring Linguistic and Cultural Evolution Using Books and Tweets
2019
- 180Usage
Metric Options: CountsSelecting the 1-year or 3-year option will change the metrics count to percentiles, illustrating how an article or review compares to other articles or reviews within the selected time period in the same journal. Selecting the 1-year option compares the metrics against other articles/reviews that were also published in the same calendar year. Selecting the 3-year option compares the metrics against other articles/reviews that were also published in the same calendar year plus the two years prior.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Metrics Details
- Usage180
- Downloads108
- Abstract Views72
Thesis / Dissertation Description
Written language provides a snapshot of linguistic, cultural, and current events information for a given time period. Aggregating these snapshots by studying many texts over time reveals trends in the evolution of language, culture, and society. The ever-increasing amount of electronic text, both from the digitization of books and other paper documents to the increasing frequency with which electronic text is used as a means of communication, has given us an unprecedented opportunity to study these trends. In this dissertation, we use hundreds of thousands of books spanning two centuries scanned by Google, and over 100 billion messages, or ‘tweets’, posted to the social media platform, Twitter, over the course of a decade to study the English language, as well as study the evolution of culture and society as inferred from the changes in language.We begin by studying the current state of verb regularization and how this compares between the more formal writing of books and the more colloquial writing of tweets on Twitter. We find that the extent of verb regularization is greater on Twitter, taken as a whole, than in English Fiction books, and also for tweets geotagged in the United States relative to American English books, but the opposite is true for tweets geotagged in the United Kingdom relative to British English books. We also find interesting regional variations in regularization across counties in the United States. However, once differences in population are accounted for, we do not identify strong correlations with socio-demographic variables.Next, we study stretchable words, a fundamental aspect of spoken language that, until the advent of social media, was rarely observed within written language. We examine the frequency distributions of stretchable words and introduce two central parameters that capture their main characteristics of balance and stretch. We explore their dynamics by creating visual tools we call ‘balance plots’ and ‘spelling trees’. We also discuss how the tools and methods we develop could be used to study mistypings and misspellings, and may have further applications both within and beyond language.Finally, we take a closer look at the English Fiction n-gram dataset created by Google. We begin by explaining why using token counts as a proxy of word, or more generally, ‘n-gram’, importance is fundamentally flawed. We then devise a method to rebuild the Google Books corpus so that meaningful linguistic and cultural trends may be reliably discerned. We use book counts as the primary ranking for an n-gram and use subsampling to normalize across time to mitigate the extraneous results created by the underlying exponential increase in data volume over time. We also combine the subsampled data over a number of years as a method of smoothing. We then use these improved methods to study linguistic and cultural evolution across the last two centuries. We examine the dynamics of Zipf distributions for n-grams by measuring the churn of language reflected in the flux of n-grams across rank boundaries. Finally, we examine linguistic change using wordshift plots and a rank divergence measure with a tunable parameter to compare the language of two different time periods. Our results address several methodological shortcomings associated with the raw Google Books data, strengthening the potential for cultural inference by word changes.
Bibliographic Details
Provide Feedback
Have ideas for a new metric? Would you like to see something else here?Let us know