Domain of One’s Own: A Corpus Study, Part 4 – Topic Model
This is the fourth of four posts taking a deep data dive into a collection of over 300 articles written about Domain of One’s Own. Part 1 and Part 2 explored who has been writing about DoOO, the most common words and phrases used by those authors, and how the use of some terms change over time. Part 3 briefly explored a sentiment analysis of words and authors in the corpus. In this post, we take a look at emergent topics in the DoOO corpus. All data, visualizations, and code for this project can be found in its repository on GitHub.
A common tool in text mining to uncover topics in more nuanced ways than individual words or n-grams is topic modeling. A topic model looks for words that tend to occur in the same documents with great frequency, but which are not typical of the entire corpus, thus providing a weighted list of words that tend to express a particular topic. It’s a helpful early-stage analytical tool to uncover pieces that explore the same topic, even if they use slightly different vocabularies.
Now this corpus is already united by a single topic: Domain of One’s Own. But while a topic model may not uncover radically different emphases in these pieces, it does uncover some subtleties that we might otherwise miss, and which are worth taking a closer look at.
Refining the model
One difficulty with topic modeling is choosing the number of topics to look for, as this is an input, not an output, of the analytical algorithm. Too small a number and the model doesn’t tease out the different topics well. Too large, and the distinctions are too fine to be useful. For example, here’s a 6-topic model of the DoOO corpus. (Each chart includes the 10 words most strongly associated with a topic, showing their weights or “beta” value. For more details on the math behind topic modeling, see Mark Steyvers’s “Probabilistic Topic Models” or David Blei’s “Introduction to Probabilistic Topic Models.”)
These charts show the ten most distinct words of each topic found by the algorithmic analysis, as well as their weights for that topic (shown by the length of the bar). There are many of the same words in each topic, showing that the model is not distinguishing topics very well at this level. I tried a few other topic numbers and found, at least for the purposes of this study, that 16 topics seems to provide helpful distinctions.
One obvious thing that jumps out here is Topic 3, which is clearly the result of a failure of the web scraper to strip the html tags from the post(s). There are topics tied to specific course experiences (like Topics 10 and 16). There are topics tied specifically to sysadmin work (like Topics 12 and 14). There are WordPress-/blogging-specific topics (like Topics 4 and 15). And there are general ed-tech-themed topics (like Topics 6 and 7). Here are the number of posts per topic.
The issue with topic modeling in this corpus is that because the corpus is already centered around a single general topic, when we increase the number of topics to parse different (sub-)topics finely enough, we get a lot of overlap. That’s why many of the actual topics listed above correspond to multiple topics in the model. That said, we can still use these results to tease out articles related to these different topics, and to each other. (And to find the posts where the scraper unsuccessfully stripped the html code!) And we can use the topic model to find which authors tend to write about which topics, and which topics tend to be written about by which authors.
For instance, this visualization helps us see pretty quickly which topics are most popular, which authors are most prolific, and which topics and authors pair together. It doesn’t answer any major questions, but it can help guide us towards interesting patterns worth chasing down. For instance, one of Jim Groom’s most prominent topics is Topic 5, which is associated with APIs and student data more than, say, WordPress, DS106, or digital pedagogy. This also happens to be a pretty rare topic overall though, with posts by Marguerite McNeal and Chad Jardine being the only others to score highest for this topic. (Those posts are both centered on giving students control over their own data.) Why is this such a popular topic for Jim but not overall? And why are Marguerite’s and Chad’s posts the only other ones to get binned in this topic?
My attention is also drawn to Topic 10, a very small topic with articles only by Jim Groom and Jeremy Dean. Since this looked initially like a DS106-heavy topic (and thus one I would assume to be limited to UMW authors), Jeremy’s inclusion here stands out. I’m not sure exactly why, but it’s grabbed my attention, and I want to take a closer look. (That’s exactly what a topic model is good for: directing attention to things worth a closer look.)
It turns out that while Topic 10’s top words are very DS106-centric, when we look further down the list, we can see that it’s really more of a pedagogical topic, one that focuses on the cultivation of community, often in a class. And that’s largely what Jeremy’s post is about ― how to use WordPress and hypothes.is on DoOO to help cultivate a community of collaboration among students. (It’s also worth noting that because Jeremy’s post has a fair bit of how-to about it, the topic to which this model assigns it the second greatest weight is Topic 4 ― another rare one, and the one dominated by Tim Owens’s more technical explanations of DoOO.)
These are just some of the cool things we can do with the data from this corpus of articles about Domain of One’s Own. Thanks to Lee who assembled the list! And, of course, to everyone who wrote these insightful pieces! Feel free to visit the GitHub repository and play around with it yourself. Maybe you’ll find something cool that I didn’t. :) And look out for Lee’s upcoming pieces in which she follows up on my quantitative “distant reading” with a few qualitative, “close” readings of her own.
Featured image by Bernard Spragg.