Three families of automated text analysis

Austin van Loon

doi:10.1016/j.ssresearch.2022.102798

Three families of automated text analysis

Soc Sci Res. 2022 Nov:108:102798. doi: 10.1016/j.ssresearch.2022.102798. Epub 2022 Oct 1.

Author

Austin van Loon¹

Affiliation

¹ Stanford University, USA. Electronic address: avanloon@stanford.edu.

PMID: 36334926
DOI: 10.1016/j.ssresearch.2022.102798

Abstract

Since the beginning of this millennium, data in the form of human-generated text in a machine-readable format has become increasingly available to social scientists, presenting a unique window into social life. However, harnessing vast quantities of this highly unstructured data in a systematic way presents a unique combination of analytical and methodological challenges. Luckily, our understanding of how to overcome these challenges has also developed greatly over this same period. In this article, I present a novel typology of the methods social scientists have used to analyze text data at scale in the interest of testing and developing social theory. I describe three "families" of methods: analyses of (1) term frequency, (2) document structure, and (3) semantic similarity. For each family of methods, I discuss their logical and statistical foundations, analytical strengths and weaknesses, as well as prominent variants and applications.

Keywords: Automated text analysis; Social science; Text analysis; Text as data.

MeSH terms

Humans
Semantics*