Search and classify topics in a corpus of text using the latent dirichlet allocation model

Iparraguirre-Villanueva, Orlando; Sierra-Liñan, Fernando Alex; Herrera Salazar, José Luis; Beltozar-Clemente, Saul; Pucuhuayla-Revatta, Félix; Zapata-Paulini, Joselyn E.; Cabanillas-Carbonell, Michael A.

Publicación:
Search and classify topics in a corpus of text using the latent dirichlet allocation model

dc.contributor.author	Iparraguirre-Villanueva, Orlando
dc.contributor.author	Sierra-Liñan, Fernando Alex
dc.contributor.author	Herrera Salazar, José Luis
dc.contributor.author	Beltozar-Clemente, Saul
dc.contributor.author	Pucuhuayla-Revatta, Félix
dc.contributor.author	Zapata-Paulini, Joselyn E.
dc.contributor.author	Cabanillas-Carbonell, Michael A.
dc.date.accessioned	2025-09-05T16:33:56Z
dc.description.abstract	This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology. © 2023 Elsevier B.V., All rights reserved.
dc.identifier.doi	10.11591/ijeecs.v30.i1.pp246-256
dc.identifier.scopus	2-s2.0-85147159751
dc.identifier.uri	https://cris.uwiener.edu.pe/handle/001/401
dc.identifier.uuid	62d42f48-bb07-49ec-bacc-793a637d2f2a
dc.language.iso	en
dc.publisher	Institute of Advanced Engineering and Science
dc.relation.citationissue	1
dc.relation.citationvolume	30
dc.relation.ispartofseries	Indonesian Journal of Electrical Engineering and Computer Science
dc.relation.issn	25024760
dc.rights	http://purl.org/coar/access_right/c_abf2
dc.title	Search and classify topics in a corpus of text using the latent dirichlet allocation model
dc.type	http://purl.org/coar/resource_type/c_2df8fbb1
dspace.entity.type	Publication
oaire.citation.endPage	256
oaire.citation.startPage	246

Colecciones

Publicaciones

Publicación: Search and classify topics in a corpus of text using the latent dirichlet allocation model

Archivos

Colecciones

Publicación:
Search and classify topics in a corpus of text using the latent dirichlet allocation model