luke hollis


Using Web Mining to Track Word Sense Variation, Inspired by Roland Barthes

View relata: a Semiological Dictionary inspired by Roland Barthes at relata (“related things”--or more literally, “things that refer back [to an origin]”) seeks to visualize the semiological potential contained by a word, tracking the word sense variations and acting as a descriptive definition of word usage in corpora mined from three primary sources--webpages indexed by Google, tweets, and Reddit comments--and fluctuations in meaning for the target word between these sources. In his Elements of Semiology, Barthes describes the basic properties of the signifier as the following:

"The nature of the signifier suggests roughly the same remarks as that of the signified: it is purely a relatum, whose definition cannot be separated from that of the signified. . . . The clarification of the signifiers is nothing but the structuralisation proper of the system. What has to be done is to cut up the 'endless' message constituted by the whole of the messages emitted at the level of the studied corpus, into minimal significant units . . . then to group these units into paradigmatic classes, and finally to classify the syntagmatic relations which link these units."

relata attempts to demonstrate just such a network of complex interrelations between signified word classes and depict ‘avenues’ or ‘clusters’ of closely associated meanings that are most relevant to the target word. In graphing these relationships, I hoped that relata would provide some limited means of displaying and navigating that what Barthes inherits from Lacan as an “articulated system of significations . . . that becomes a description of the collective field of imagination.”

It was my intention that the graph should visualize Lacan’s “insistence of the signifying chain” that binds and orients the “imaginary incidence” underlying symbolic experience (in language and other systems). In order to identify patterns of meaning in the textual data mined from each source, I utilized latent Dirichlet allocation topic modeling (initially modeling 500 topics with 100 learning iterations; then 200 topics with 50 learning iterations) to generate the topics of corpora composed from data mined from the sources for each word.

In the force gravity graph, each topic is a child of the target word, and each word in the topics may belong to multiple parent topics--in this way, I hoped to image the associations between topics and thus avenues of meaning extending throughout the network of topic words. Clicking on any word in the graph will highlight that word’s topic red, its nearest neighbors green, and its tertiary neighbors navy blue. The relevancy of a link between word and topic and topic and target word is shown in its width (wider links are more relevant) and its length (shorter links are more relevant--though this is partially obscured by word collision, friction, and other factors).

The word “change,” for instance, when scaled to the 0.001 relevancy threshold, has its most relevant topic containing words such as “climate, “emission,” “warming, “temperature, “report,” “gas,” greenhouse, “carbon,” etc. This topic shares terms directly with four other topics: home energy usage (“lighting,” “bulb,” “appliance,” “improvement,” etc.); legislation (“regulation,” “authority,” “act,” “administration,” etc.); climate patterns (“pressure,” “atmosphere,” “ozone”, “polar,” etc.); and finally the kyoto convention, (“kyoto,” “convention,” “party,” “treaty,” etc.). Through these, the topic is connected to eight additional tertiary topics which deviate further in meaning from the original topic concerning climate change, ranging from topics concerning technology to political protests. Together, all highlighted words map a single “avenue” of meaning for the target word, “change.” There are many such avenues of meaning from the word "change" that are able to be studied through exploring the graph--such as topics for governmental changes (“coup,” “revolt,” “spring,” etc.), version control software changes, and even gender changes.

So why is graphing this useful? First (in the most basic sense) tracing the avenues of meaning in a given word yields a much different definition than the denoted meaning ascribed to that word. For instance, “change” has a long list of meanings ascribed to it via ("change" definition), but the first mention of “climate” is 10,000 pixels down a 13,000 pixel webpage. By tracking the varying avenues of meaning clustered around a given topic, we can understand more not only about the usage of a given word and the way that it is transformed in different contexts but also gain a statistical understanding that may reference ways in which the mind perceives that word and its associated cluster of terms.

Barthes discusses the two axes of language as those of syntagms and associations--relata concerns itself with the associations which Saussure terms as the following: “the units which have something in common are associated in memory and thus from groups within which various relationships can be found.” relata hopes to explore these associations in a public sense--to study it as a method for gaining semiological understanding of a target word. It is the goal of relata to provide this understanding of each target word by tracking the clustering of topics from word usage.

In conclusion, relata is an experiment in exploring semiological principles through web mining and data visualization which I intend to point toward more research rather than offer significant results per se. Perhaps we know a fair amount about the target words' usage in English (or other words in currently spoken languages), but there are many opportunities for this sort of study to be beneficial to dead languages where we have substantial corpora remaining. Projects such as the Classical Languages Toolkit or Arboreal with the Archimedes Project are great examples of progress being made in statistical analysis of classical languages. I hope that relata can serve as a model to foster more attention toward Barthes’s work and connections between semiological analysis and natural language processing.

But wouldn’t relata be much better with such-and-such feature? Yes. I’d love to hear from you! Let me know your thoughts in the comments below or at @lukehollis.