Domain of One’s Own: A Corpus Study, Part 2 – Bigrams and Changing Trends

This is the second of four posts taking a deep data dive into a collection of over 300 articles written about Domain of One’s Own. Part 1 explored who has been writing about DoOO and the most common words used by those authors. In this post, we look at two-word phrases, institutional references, and how the use of some terms change over time. Parts 3 and 4 will be published next week and will explore sentiment and emergent topics in the DoOO corpus. All data, visualizations, and code for this project can be found in its repository on GitHub.

Single words can tell us a lot, but n-grams (phrases of n length) can help us tease out more details. Just what are we “syndicating”? What kind of “bottlenecks” or “standards” are people writing about? And are every “Jim” and “Tim” in the corpus really Jim Groom and Tim Owens?

Here are the top bigrams in the corpus.

DoOO stats visualization: most common bigrams

There’s understandably a heavy UMW focus at the top of the list, given that UMW authors occupy three of the top four spots on the most prolific authors list. So let’s split UMW authors (again, this will be mostly Jim Groom’s posts) and non-UMW authors. Here are UMW authors:

DoOO stats visualization: most common bigrams from UMW authors

And here are non-UMW authors:

DoOO stats visualization: most common bigrams from non-UMW authors

As before, we can see differences more clearly and starkly using a two-dimensional visualization. (Click on the image to explore the high-resolution visualization.)

DoOO stats visualization: comparing bigram frequency from UMW and non-UMW authors

In this plot, we can see that both UMW and non-UMW authors refer to key things like Reclaim Hostingdigital identity, and online presence with relatively equal frequency. One key difference involves the names of people found in each corpus. UMW authors are much more likely to refer to important figures in the history of DoOO and DS106 like Martha Burtis, Tim Owens, Ryan Brazell, and Alan Levine. Non-UMW authors are more likely to write about Jim Groom (who wrote most of the UMW corpus), and to refer to institutions like UMW, Emory, OU, and Davidson.

It’s also interesting to note, again, how much more the non-UMW corpus focuses on bureaucratic elements like learning analytics, and general topics like learning environment, learning management, digital learning, digital humanities, and educational technology (though the short form ed tech is more prominent in the UMW corpus). I think this is most likely because of how many of the UMW corpus posts describe the specifics of building DoOO ― both the technical infrastructure and the people doing the work. But teasing that out would require a deeper look at specific articles.

Another key difference is in the relative frequency of personal and institutional names. UMW Blogs is more prominent in the UMW corpus, while Davidson Domains, Emory University, and OU Create are more prominent in writings from those outside UMW. However, I’m suspicious that this has less to do with who’s writing and more to do with when they are writing. Most of the UMW corpus was written by Jim Groom during, and even before, the pilot of DoOO at UMW, before the DoOO incubator session in Atlanta, before OU received funding for a major rolling out of DoOO on their campus, etc.

DoOO stats visualization: dates when UMW and non-UMW content was written

As shown in this figure (based on word counts over time), the non-UMW corpus begins in late 2012, right about at the first quartile for the UMW corpus (meaning 25% of the UMW content had already been published). The first quartile and median of the non-UMW corpus match up with the median and third quartile (75th percentile) of the UMW corpus respectively. So it makes sense that things more prominent in the UMW corpus may simply be published earlier, and vice versa. Is that the case for these terms?

Let’s look at institution names. Here is the month-by-month relative frequency of occurrence for various institutional names in the whole corpus.

DoOO stats visualization: mentions of institutions over time

And here is the same just for UMW authors.

DoOO stats visualization: occurrence of institution names over time, UMW authors

And for non-UMW authors.

DoOO stats visualization: occurrence of institution names, non-UMW authors

While UMW is discussed more by UMW authors, and Oklahoma in particular is discussed less by UMW authors, the peaks for most institutional words tend to line up. And the highest peak for mentions of Emory in this corpus comes from the UMW corpus before the earliest non-UMW article. So it seems the difference in frequency of different institutional names between UMW and non-UMW authors can largely be attributed to when the piece was written.

Another interesting change over time has to do with the names of the various platforms institutions use. A number of programs ― like UMW, Emory, and Davidson ― tend to call their program Domain of One’s Own or [Institution] Domains. It seems that more recently, there has been a rise in programs (like OU and Middlebury) using create rather than domains in their title. And then there’s the use of the word reclaim, which like create is both the name of a platform (Reclaim Hosting) used by many institutions and a verb describing what students and faculty do with DoOO (reclaim your domain/space/digital identity/etc.). How does this pan out over time?

DoOO stats visualization: usage of create and reclaim over time

Once reclaim enters the corpus in 2013 (the year Reclaim Hosting was founded, and also the year that DoOO rolled out to the whole campus at UMW), it maintains a pretty steady presence. Interestingly, though, create drops at right about the same time as reclaim enters. Is there a conscious reworking of the narrative going on here, with the notion of doing creative work on the web shifting towards reclaiming ownership of our own space on the web? Or is this just an apparent shift in the whole corpus that’s really simply a shift of the two most prolific authors in the corpus as they begin work on their startup?

Well, here’s the same plot, limited to posts by Reclaim Hosting founders Jim Groom and Tim Owens.

DoOO stats visualization: reclaim vs create, Jim Groom and Tim Owens

The familiar dip in create as reclaim enters the corpus is clear. Now here is the rest of the corpus.

DoOO stats visualization: create vs reclaim, no Jim Groom or Tim Owens

There is no dip in create when reclaim enters the corpus! So it was just Jim and Tim. (Investigating why that dip happens would be an interesting “close reading” follow-up to this “distant reading” project.) There are some rather obvious spikes in this non-Jim-and-Tim corpus worth noting: a spike in reclaim roughly when the word (and the hosting startup) enters the scene, and a spike in create in late 2015, right about the time that OU Create went campus-wide.

Again, these are just a few of the insights and patterns that we can find from this data. If you find this interesting, be sure to stay tuned for Parts 3 and 4 next week, which will unpack a sentiment analysis and a topic model of these articles. And, of course, you can always download the data and code from GitHub if you want to take a closer look yourself and chase down other patterns in the data.

Featured image by Runs With Scissors (CC BY-NC-ND).