Improved Topic Modeling in Biomedical Texts Using UMLS: A MedMentions Approach

Authors

  • S. Jayabharathi
  • M. Logambal

Abstract

Effectively extracting and classifying topics from large volumes of medical texts is crucial for knowledge discovery and information retrieval in biomedical research. Traditional topic modeling techniques, while useful, often struggle to capture the intricate semantics of medical terminology. This research investigates how integrating Unified Medical Language System (UMLS) principles can enhance topic modeling using the MedMentions dataset. We evaluate four approaches: BERTopic, Latent Dirichlet Allocation (LDA), LDA with a Recurrent Network (LDA-RNet), and a novel BERTopic with a Recurrent Network (BERTopic-RNet)Our goal is to improve topic coherence and relevance by incorporating UMLS concepts into these models. Experimental results demonstrate that UMLS-enhanced models significantly outperform conventional methods in both topic coherence and clinical relevance. This study provides valuable insights into the application of advanced topic modeling techniques in medical text analysis, paving the way for more effective and interpretable medical data mining.

Downloads

Download data is not yet available.

References

Srivastava, A., & Sutton, C. (2017). Autoencoding Variational Inference For Topic Models. “Proceedings of the International Conference on Learning Representations (ICLR)”. Available: https://arxiv.org/abs/1703.01488

Qiang, J., Qian, Z., Li, Y., Yuan, Y., & Wu, X. (2020). Short Text Topic Modeling Techniques, Applications, and Performance: A Survey. “IEEE Transactions on Knowledge and Data Engineering”. DOI: 10.1109/TKDE.2020.2981333

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. “Proceedings of the National Academy of Sciences, 101(Suppl 1), 5228-5235”. DOI: 10.1073/pnas.0307752101

Grootendorst, M. (2020). BERTopic: Leveraging BERT and c-TF-IDF for Topic Modeling. “arXiv preprint arXiv:2010.06159”. Available: https://arxiv.org/abs/2010.06159

Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. “Neural Computation, 9(8), 1735-1780”. DOI: 10.1162/neco.1997.9.8.1735

Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. “Nucleic Acids Research, 32(Database issue), D267-D270”. DOI: 10.1093/nar/gkh061

Cohen, T., Widdows, D., & Schvaneveldt, R. W. (2017). Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 68, 1-14.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. “Journal of Machine Learning Research, 3, 993-1022.

Roberts, K., Demner-Fushman, D., & Tonning, J. M. (2018). Overview of the TAC 2018 Drug-Drug Interaction Extraction from Drug Labels Track. In Proceedings of the Text Analysis Conference (TAC).

Bodenreider, O. (2004). The Unified Medical Language System (UMLS): Integrating Biomedical Terminology. Nucleic Acids Research, 32(Database issue), D267-D270.

McCray, A. T., Burgun, A., & Bodenreider, O. (2001). Aggregating UMLS Semantic Types for Reducing Conceptual Complexity. Studies in Health Technology and Informatics, 84, 216-220.

Kulkarni, S., Singh, A., & Ramakrishnan, G. (2020). BERTopic: Leveraging BERT for Topic Modeling. arXiv preprint arXiv:2008.10306.

Dieng, A. B., Ruiz, F. J., Blei, D. M., & Miller, T. (2019). Topic Modeling in Embedding Spaces. In Proceedings of the 36th International Conference on Machine Learning (ICML).

Liu, F., Yu, H., & Zhou, Y. (2016). Enhanced Medical Named Entity Recognition with UMLS Concept Mapping. Journal of Biomedical Informatics, 60, 334-341.

Cohen, T., Roberts, K., Gururangan, S., & Jones, L. (2018). MedMentions: A large biomedical corpus annotated with UMLS concepts. Bioinformatics, 34(22), 3973-3981.

Liu, S., Ma, W., Moore, R., Ganesan, V., & Nelson, S. (2005). RxNorm: prescription for electronic drug information exchange. “IT Professional, 7(5), 17-23”. DOI: 10.1109/MITP.2005.128

Viegas, F., Wattenberg, M., Van Ham, F., Kriss, J., & McKeon, M. (2007). ManyEyes: a Site for Visualization at Internet Scale. “IEEE Transactions on Visualization and Computer Graphics, 13(6), 1121-1128”. DOI: 10.1109/TVCG.2007.70577

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. “Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)”. DOI: 10.18653/v1/N19-1423

Hoffman, M., Bach, F., & Blei, D. (2010). Online Learning for Latent Dirichlet Allocation. “Advances in Neural Information Processing Systems (NIPS), 23, 856-864”. Available: https://proceedings.neurips.cc/paper/2010/file/390236f13b9c28d3c8f616378dd3b07b-Paper.pdf

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. “Advances in Neural Information Processing Systems, 33, 1877-1901”. DOI: 10.5555/3455716.3455749

Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. “INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association”. Available: https://www.isca-speech.org/archive/interspeech_2010/i10_1045.html

Graves, A., Mohamed, A. r., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. “2013 IEEE International Conference on Acoustics, Speech and Signal Processing”. DOI: 10.1109/ICASSP.2013.6638947

Downloads

Published

2025-05-28

How to Cite

1.
Jayabharathi S, Logambal M. Improved Topic Modeling in Biomedical Texts Using UMLS: A MedMentions Approach. J Neonatal Surg [Internet]. 2025May28 [cited 2025Sep.20];14(30S):887-900. Available from: https://www.jneonatalsurg.com/index.php/jns/article/view/6656