Salta al contenuto principale
Passa alla visualizzazione normale.

ANDREA SIMONETTI

Statistically Validated Network approach for document clustering and topic modeling

  • Autori: Andrea Simonetti; Alessandro Albano
  • Anno di pubblicazione: 2023
  • Tipologia: Contributo in atti di convegno pubblicato in volume
  • OA Link: http://hdl.handle.net/10447/631817

Abstract

In machine learning, document clustering and topic modeling are scientific challenges concerning the extraction of useful information from a collection of texts. Traditional approaches, such as Latent Dirichlet Allocation (LDA), rely on maximising likeli- hood functions. In this paper, we explore a paradigm shift towards network represen- tation of textual data and the associated challenges of community detection [3]. We proposes a new method to face the tasks of document clustering and topic modeling, representing a collection of documents as a bipartite network. Then, we introduce the application of Statistically Validated Networks (SVN) to filter out irrelevant con- nections within the projected networks of words and documents. The SVN method is promising in the framework of topic modeling. For instance, Simonetti et al. (2022) recently proposed a new application of SVN to measure the coherence of topics. In- stead, we aim to identify the topics themselves. By doing so, we can naturally find topics with high coherence according to the measure proposed by the authors. Moreover, the modularity contribution of each community (topic) can be interpreted as a measure of coherence since it is an intensive quantity that assesses the tendency of words within a given topic to occur in the same sentences jointly