MARIANGELA SCIANDRA

Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches

Autori: Alessandro Albano; Mariangela Sciandra; Antonella Plaia
Anno di pubblicazione: 2022
Tipologia: Contributo in atti di convegno pubblicato in volume
OA Link: http://hdl.handle.net/10447/576096

Abstract

Probabilistic topic models are machine learning tools for processing and understanding large text document collections. Among the different models in the literature, Latent Dirichlet Allocation (LDA) has turned out to be the benchmark of the topic modelling community. The key idea is to represent text documents as random mixtures over latent semantic structures called topics. Each topic follows a multinomial distribution over the vocabulary words. In order to understand the result of a topic model, researchers usually select the top-n (essential words) words with the highest probability given a topic and look for meaningful and interpretable semantic themes. This work proposes a new method for exploring topics in LDA models, using Statistically Validated Networks (SVNs). The main idea of the proposed method is to consider co-occurrence between essential words as a measure of association. Two different approaches, called undirected and directed are proposed. Firstly, the symmetrical asso- ciation between two words is taken into account, i.e. how many times two words are found in the same sentence. Conversely, in the directed approach, the order in which the words are in the sentence is also considered. We use hypothesis testing to assess whether the co-occurrence between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Specifically, textual data is represented as a bipartite network in which one set of nodes is made by sentences, and the other set of nodes is made by a list of essential words associated with a given topic. A link between a word and a sentence is set if the word belongs to that sentence. Therefore, the projection of the bipartite network on the set of words results in a word-co-occurrence network. Note that the directed approach produces a directed network while the undirected one an undirected network. Indeed, a directed link from one word to another may be val- idated, but not the other way around. The two methods are applied to a real dataset, highlighting the differences.