12 Nov 2025 Articles

Textual Factors: A Scalable, Interpretable, and Data-Driven Approach to Analyzing Unstructured Information

2 minute read

Share

Vice President Xiao Zhang and his co-authors published a peer-reviewed article in Management Science that introduces a general and interpretable framework for analyzing large-scale text data. The study combines modern neural language models with generative statistical techniques, allowing researchers to extract a structured set of latent “textual factors” from vast collections of unstructured text.

This factor structure not only captures the richness of natural language but also enables seamless integration with downstream regression analyses, making it easier to use insights from text data to answer important questions in economics, finance, and other social science fields.

The views expressed in this text are the sole responsibility of the authors and cannot be attributed to Compass Lexecon or any other parties.

Abstract

We introduce a general approach for analyzing large-scale text-based data, combining the strengths of neural network language processing and generative statistical modeling to create a factor structure of unstructured data for downstream regressions typically used in social sciences. We generate textual factors by (i) representing texts using vector word embedding, (ii) clustering the vectors using locality-sensitive hashing to generate supports of topics, and (iii) identifying relatively interpretable spanning clusters (i.e., textual factors) through topic modeling. Our data-driven approach captures complex linguistic structures while ensuring computational scalability and economic interpretability, plausibly attaining certain advantages over and complementing other unstructured data analytics used by researchers, including emergent large language models. We conduct initial validation tests of the framework and discuss three types of its applications: (i) enhancing prediction and inference with texts, (ii) interpreting (non–text-based) models, and (iii) constructing new text-based metrics and explanatory variables. We illustrate each of these applications using examples in finance and economics such as macroeconomic forecasting from news articles, interpreting multifactor asset pricing models from corporate filings, and measuring theme-based technology breakthroughs from patents. Finally, we provide a flexible statistical package of textual factors for online distribution to facilitate future research and applications.

A new version of Compass Lexecon is available.