Improved Topic Modeling in Biomedical Texts Using UMLS: A MedMentions Approach

Authors

S. Jayabharathi
M. Logambal

Abstract

Effectively extracting and classifying topics from large volumes of medical texts is crucial for knowledge discovery and information retrieval in biomedical research. Traditional topic modeling techniques, while useful, often struggle to capture the intricate semantics of medical terminology. This research investigates how integrating Unified Medical Language System (UMLS) principles can enhance topic modeling using the MedMentions dataset. We evaluate four approaches: BERTopic, Latent Dirichlet Allocation (LDA), LDA with a Recurrent Network (LDA-RNet), and a novel BERTopic with a Recurrent Network (BERTopic-RNet). Our goal is to improve topic coherence and relevance by incorporating UMLS concepts into these models. Experimental results demonstrate that UMLS-enhanced models significantly outperform conventional methods in both topic coherence and clinical relevance. This study provides valuable insights into the application of advanced topic modeling techniques in medical text analysis, paving the way for more effective and interpretable medical data mining.

Downloads

Download data is not yet available.

References

Srivastava, A., & Sutton, C. (2017). Autoencoding Variational Inference For Topic Models. “Proceedings of the International Conference on Learning Representations (ICLR)”. Available: https://arxiv.org/abs/1703.01488

Qiang, J., Qian, Z., Li, Y., Yuan, Y., & Wu, X. (2020). Short Text Topic Modeling Techniques, Applications, and Performance: A Survey. “IEEE Transactions on Knowledge and Data Engineering”. DOI: 10.1109/TKDE.2020.2981333

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. “Proceedings of the National Academy of Sciences, 101(Suppl 1), 5228-5235”. DOI: 10.1073/pnas.0307752101

Grootendorst, M. (2020). BERTopic: Leveraging BERT and c-TF-IDF for Topic Modeling. “arXiv preprint arXiv:2010.06159”. Available: https://arxiv.org/abs/2010.06159

Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. “Neural Computation, 9(8), 1735-1780”. DOI: 10.1162/neco.1997.9.8.1735

Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. “Nucleic Acids Research, 32(Database issue), D267-D270”. DOI: 10.1093/nar/gkh061

Cohen, T., Widdows, D., & Schvaneveldt, R. W. (2017). Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 68, 1-14.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. “Journal of Machine Learning Research, 3, 993-1022.

Roberts, K., Demner-Fushman, D., & Tonning, J. M. (2018). Overview of the TAC 2018 Drug-Drug Interaction Extraction from Drug Labels Track. In Proceedings of the Text Analysis Conference (TAC).

Bodenreider, O. (2004). The Unified Medical Language System (UMLS): Integrating Biomedical Terminology. Nucleic Acids Research, 32(Database issue), D267-D270.

McCray, A. T., Burgun, A., & Bodenreider, O. (2001). Aggregating UMLS Semantic Types for Reducing Conceptual Complexity. Studies in Health Technology and Informatics, 84, 216-220.

Kulkarni, S., Singh, A., & Ramakrishnan, G. (2020). BERTopic: Leveraging BERT for Topic Modeling. arXiv preprint arXiv:2008.10306.

Dieng, A. B., Ruiz, F. J., Blei, D. M., & Miller, T. (2019). Topic Modeling in Embedding Spaces. In Proceedings of the 36th International Conference on Machine Learning (ICML).

Liu, F., Yu, H., & Zhou, Y. (2016). Enhanced Medical Named Entity Recognition with UMLS Concept Mapping. Journal of Biomedical Informatics, 60, 334-341.

Cohen, T., Roberts, K., Gururangan, S., & Jones, L. (2018). MedMentions: A large biomedical corpus annotated with UMLS concepts. Bioinformatics, 34(22), 3973-3981.

Liu, S., Ma, W., Moore, R., Ganesan, V., & Nelson, S. (2005). RxNorm: prescription for electronic drug information exchange. “IT Professional, 7(5), 17-23”. DOI: 10.1109/MITP.2005.128

Viegas, F., Wattenberg, M., Van Ham, F., Kriss, J., & McKeon, M. (2007). ManyEyes: a Site for Visualization at Internet Scale. “IEEE Transactions on Visualization and Computer Graphics, 13(6), 1121-1128”. DOI: 10.1109/TVCG.2007.70577

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. “Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)”. DOI: 10.18653/v1/N19-1423

Hoffman, M., Bach, F., & Blei, D. (2010). Online Learning for Latent Dirichlet Allocation. “Advances in Neural Information Processing Systems (NIPS), 23, 856-864”. Available: https://proceedings.neurips.cc/paper/2010/file/390236f13b9c28d3c8f616378dd3b07b-Paper.pdf

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. “Advances in Neural Information Processing Systems, 33, 1877-1901”. DOI: 10.5555/3455716.3455749

Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. “INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association”. Available: https://www.isca-speech.org/archive/interspeech_2010/i10_1045.html

Graves, A., Mohamed, A. r., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. “2013 IEEE International Conference on Acoustics, Speech and Signal Processing”. DOI: 10.1109/ICASSP.2013.6638947

Downloads

Published

2025-05-28

How to Cite

Jayabharathi S, Logambal M. Improved Topic Modeling in Biomedical Texts Using UMLS: A MedMentions Approach. J Neonatal Surg [Internet]. 2025 May 28 [cited 2025 Dec. 15];14(30S):887-900. Available from: https://www.jneonatalsurg.com/index.php/jns/article/view/6656

Download Citation

Issue

Vol. 14 No. 30S (2025): Journal of Neonatal Surgery

Section

Original Article

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially.

Terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Improved Topic Modeling in Biomedical Texts Using UMLS: A MedMentions Approach

Authors

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

You are free to:

Current Issue

Information

Developed By

Make a Submission