TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
Özet
With the exponential growth in the daily publication of scientific articles, automatic
classification and categorization can assist in assigning articles to a predefined
category. Article titles are concise descriptions of the articles’ content with
valuable information that can be useful in document classification and
categorization. However, shortness, data sparseness, limited word occurrences,
and the inadequate contextual information of scientific document titles hinder the
direct application of conventional text mining and machine learning algorithms on
these short texts, making their classification a challenging task. This study firstly
explores the performance of our earlier study, TextNetTopics on the short text.
Secondly, here we propose an advanced version called TextNetTopics Pro, which
is a novel short-text classification framework that utilizes a promising combination
of lexical features organized in topics of words and topic distribution extracted by
a topic model to alleviate the data-sparseness problem when classifying short
texts. We evaluate our proposed approach using nine state-of-the-art short-text
topic models on two publicly available datasets of scientific article titles as shorttext documents. The first dataset is related to the Biomedical field, and the other
one is related to Computer Science publications. Additionally, we comparatively
evaluate the predictive performance of the models generated with and without
using the abstracts. Finally, we demonstrate the robustness and effectiveness of
the proposed approach in handling the imbalanced data, particularly in the
classification of Drug-Induced Liver Injury articles as part of the CAMDA
challenge. Taking advantage of the semantic information detected by topic
models proved to be a reliable way to improve the overall performance of ML
classifiers.