Salta al contenuto principale
Passa alla visualizzazione normale.

ANDREA SIMONETTI

MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS

  • Autori: Alessandro Albano; Andrea Simonetti
  • Anno di pubblicazione: 2020
  • Tipologia: Abstract in atti di convegno pubblicato in volume
  • OA Link: http://hdl.handle.net/10447/455292

Abstract

Topic models arise from the need of understanding and exploring large text document collections and predicting their underlying structure. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) has quickly become one of the most popular text modelling techniques. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models give no guaranty on the interpretability of their outputs. The topics learned from texts may be characterized by a set of irrelevant or unchained words. Therefore, topic models require validation of the coherence of estimated topics. However, the automatic evaluation of the latent space of a topic model is a difficult task. Formerly, the most used metric for evaluating the quality of a topic model was the held-out likelihood. Still, the literature has shown that this method emphasizes complexity rather than interpretability. Although many procedures were recently proposed (Röder et al., 2015), the automatic evaluation of topic coherence remains an open research area. Our work aims to provide a new technique based on Statistically Validated Network (Tumminello et al., 2011). Our approach consists in representing each topic as a network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-occurrences in sentences against the null hypothesis of random co-occurrence. Thus, we propose a new coherence measure based on the structure of the statistically validated network. Furthermore, the new measure provides a ranking of topics and distinguishes high-quality from low-quality topics. The intuition is that the pairwise associations of words is strictly related to the semantic coherence and interpretability of a topic.