ANTONELLA PLAIA

Statistically Validated Networks for evaluating coherence in topic models

Autori: Andrea Simonetti; Alessandro Albano; Antonella Plaia; Michele Tumminello
Anno di pubblicazione: 2022
Tipologia: Contributo in atti di convegno pubblicato in volume
OA Link: http://hdl.handle.net/10447/531495

Abstract

Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be characterized by a set of irrelevant or unchained words, being useless for the interpretation. In the framework of topic quality evaluation, the pairwise semantic cohesion among the top-N most probable words (for a given topic) is calculated by measures based on words co-occurrences. Many topic-quality metrics were proposed defining different score measures such as: Pointwise Mutual Information (PMI), also called UCI; an asymmetrical measure called UMass; Normalized Pointwise Mutual Information (NPMI), a measure based on tf-idf scores , and a measure called CV proposed by Roder et al. Although these several measures in the literature have already considered cooccurrence between words as a measure of association, none has undertaken a statistical approach based on hypotheses testing to assess whether the co-occurrence obtained between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Thus, we propose a new coherence measure based on Statistically Validated Network to evaluate the interpretability of the top words of a topic.