If you’ve gotten to this page, it’s because you’ve been reading about my data analysis of the NYT’s OpEd, and are interested in exactly how I come to my conclusions.
First off, I didn’t mention in my BuzzFeed post that I added texts by John Kerry and Tom Vilsak. I wanted to add some non-Trump-oriented noise to the model to ensure it was both sufficiently robust and also to observe the behavior of these obviously incorrect data points.
As I mentioned in the BuzzFeed post, selecting data was a bit tricky. Speech writers likely had a heavy hand in many of these published speeches and essays, and transcripts of spoken-word interviews possibly have different statistical properties than those of written documents. I first attempted to find essays that might be comparable to an Op-Ed; however, with a few exceptions (including Marsha Coats’s essay!) these weren’t abundantly available. I therefore compromised and took a sampling of recent interview transcripts and published speeches, imagining that the variation would hopefully let each official’s underlying linguistic fingerprint shine through.
And more importantly, how do I scan these linguistic fingerprints? In English speech and writing, certain words consistently occur much more than others. For instance, in the Op-Ed, the top five most-often-used words are “the”, “to”, “of”, “and”, and “is.” These are also usually the most frequent words in the 16 suspects’ texts. However, even though most English speakers share the same most-frequent words, different English speakers will subconsciously rely on some more heavily than others. For instance, in the dataset I’m using, Mike Pence uses “the” more frequently than anyone else. “And” is the fourth-most-frequent word used by the Op-Ed writer, yet it is Elaine Chao’s fifth-most-frequent word. These variations represent a person’s linguistic fingerprint.
I use this fingerprinting in a variety of ways to match the Op-Ed to a potential writer. Most basically, I treat the word frequencies as coordinates, and then measure the distance between these coordinates. As an example, let’s imagine three writers who use the words “the” and “of” with different frequencies. The first uses “the” a lot but doesn’t use the word “of” very much. The second does the opposite, often relying on the word “of” while avoiding the word “the.” The third writer uses both words, but this writer slightly favors “of.” The chart below plots each of these hypothetical writers by their usage of these two words, creating word-frequency “coordinates.” Let’s also imagine that we stumble upon some mystery text that we suspected was written by one of these three authors. We could plot the mystery text’s use of “the” and “of,” and then use those coordinates to measure which of our known author’s coordinates are closest. As I show in my hypothetical chart, if the mystery text uses a lot of “the” and only a little “of,” then its “of/the” coordinates place it closest to Writer 1. My model uses exactly this process, but now incorporating not only “the” and “of,” but also each of the 10 most frequent words that occur in the Op-Ed. I then rerun the test increasing the words considered in denominations of ten. (So, I run 10 versions of the test using the top 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100 words.) In each version, I see which officials’ texts are “closest” to the Op-Ed writer.
I then implement two other similar tests. I use a more complex version of the first test, called a cluster analysis: this method measures the distance not only between the Op-Ed and each other dataset, but the distances between each other point in the dataset. This slew of distances then groups texts into neighborhoods of similarity, and I observe which texts live in the same neighborhood as the Op-Ed. The example above shows how these hypothetical writers might group into two clusters. The similarities between Writers 2 and 3 group them into a cluster, with Writer 1 and the Mystery Text clustering together. This approach would therefore also find Writer 1 to be the likely author of the mystery text. (For data nerds: I do both agglomerative and divisive clustering; both output similar results on this data, so I do not distinguish them in this essay.)
Finally, I treat the percentages of the top words as proxies for surprise. For instance, the Op-Ed uses the word “is” just about the same amount of time as Nikki Haley tends to, and so it won’t be particularly surprised at how often Haley uses this word. However, Haley uses “to” more often than the author of the Op-Ed, and so the method will be surprised by that discrepancy. In the end, I look to see whose texts “surprise” the Op-Ed’s text the least. (For data nerds: I use cross entropy for this analysis.)
The two charts below show the resulting distance and surprise. In both, there’s some variation when the models use smaller numbers of words. The fluctuation that occurs around the 10, 20, and 30 marks shows Kellyanne Conway, Elaine Chow, and Ivanka inconsistently battling for most-likely author. But, one suspect consistently pops out in both these models. In both, Marsha Coates (the dark blue line) is a strong contender early on, and pulls into the lead as more words are added to the model.
Now, on the one hand, it would be entirely reasonable to expect Marsha Coats’s text to begin winning this race when more words are added into the model. After all, the one document of hers I’m using discusses the same subject matter as the Op-Ed: the president. Therefore, words like “Trump” and “Presidency” are potentially more likely to occur in both Mrs. Coats’s essay and in the Op-Ed.
On the other hand, the way Mrs. Coats’s text behaves in the cluster analyses somewhat mitigates these concerns. Recall that this process doesn’t just judge the distance between the Op-Ed and the texts of each official, but rather measures all distances between all texts in order to arrange the authors into different neighborhoods or clusters. Below, I show a clustering of the texts’ top 20 words. Now, even with the model relying on a small number of words, the analysis finds that Marsha Coates’s essay and the Op-Ed are sufficiently alike to deserve their own cul-de-sac. And this happens not only when I use 20 words, but also 30, 40, and 50 words. In fact, the same finding appears in all versions of this test up to 100 words! (There are plenty of interesting tidbits to glean from this chart –not the least of which is Nikki Haley’s linguistic iconoclasm compared to other members of this group of politicians–but these observations will have to wait for another day.)