Improved Topic Modeling in Biomedical Texts Using UMLS: A MedMentions Approach
Abstract
Effectively extracting and classifying topics from large volumes of medical texts is crucial for knowledge discovery and information retrieval in biomedical research. Traditional topic modeling techniques, while useful, often struggle to capture the intricate semantics of medical terminology. This research investigates how integrating Unified Medical Language System (UMLS) principles can enhance topic modeling using the MedMentions dataset. We evaluate four approaches: BERTopic, Latent Dirichlet Allocation (LDA), LDA with a Recurrent Network (LDA-RNet), and a novel BERTopic with a Recurrent Network (BERTopic-RNet). Our goal is to improve topic coherence and relevance by incorporating UMLS concepts into these models. Experimental results demonstrate that UMLS-enhanced models significantly outperform conventional methods in both topic coherence and clinical relevance. This study provides valuable insights into the application of advanced topic modeling techniques in medical text analysis, paving the way for more effective and interpretable medical data mining.
Downloads
References
Srivastava, A., & Sutton, C. (2017). Autoencoding Variational Inference For Topic Models. “Proceedings of the International Conference on Learning Representations (ICLR)”. Available: https://arxiv.org/abs/1703.01488
Qiang, J., Qian, Z., Li, Y., Yuan, Y., & Wu, X. (2020). Short Text Topic Modeling Techniques, Applications, and Performance: A Survey. “IEEE Transactions on Knowledge and Data Engineering”. DOI: 10.1109/TKDE.2020.2981333
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. “Proceedings of the National Academy of Sciences, 101(Suppl 1), 5228-5235”. DOI: 10.1073/pnas.0307752101
Grootendorst, M. (2020). BERTopic: Leveraging BERT and c-TF-IDF for Topic Modeling. “arXiv preprint arXiv:2010.06159”. Available: https://arxiv.org/abs/2010.06159
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. “Neural Computation, 9(8), 1735-1780”. DOI: 10.1162/neco.1997.9.8.1735
Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. “Nucleic Acids Research, 32(Database issue), D267-D270”. DOI: 10.1093/nar/gkh061
Cohen, T., Widdows, D., & Schvaneveldt, R. W. (2017). Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 68, 1-14.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. “Journal of Machine Learning Research, 3, 993-1022.
Roberts, K., Demner-Fushman, D., & Tonning, J. M. (2018). Overview of the TAC 2018 Drug-Drug Interaction Extraction from Drug Labels Track. In Proceedings of the Text Analysis Conference (TAC).
Bodenreider, O. (2004). The Unified Medical Language System (UMLS): Integrating Biomedical Terminology. Nucleic Acids Research, 32(Database issue), D267-D270.
McCray, A. T., Burgun, A., & Bodenreider, O. (2001). Aggregating UMLS Semantic Types for Reducing Conceptual Complexity. Studies in Health Technology and Informatics, 84, 216-220.
Kulkarni, S., Singh, A., & Ramakrishnan, G. (2020). BERTopic: Leveraging BERT for Topic Modeling. arXiv preprint arXiv:2008.10306.
Dieng, A. B., Ruiz, F. J., Blei, D. M., & Miller, T. (2019). Topic Modeling in Embedding Spaces. In Proceedings of the 36th International Conference on Machine Learning (ICML).
Liu, F., Yu, H., & Zhou, Y. (2016). Enhanced Medical Named Entity Recognition with UMLS Concept Mapping. Journal of Biomedical Informatics, 60, 334-341.
Cohen, T., Roberts, K., Gururangan, S., & Jones, L. (2018). MedMentions: A large biomedical corpus annotated with UMLS concepts. Bioinformatics, 34(22), 3973-3981.
Liu, S., Ma, W., Moore, R., Ganesan, V., & Nelson, S. (2005). RxNorm: prescription for electronic drug information exchange. “IT Professional, 7(5), 17-23”. DOI: 10.1109/MITP.2005.128
Viegas, F., Wattenberg, M., Van Ham, F., Kriss, J., & McKeon, M. (2007). ManyEyes: a Site for Visualization at Internet Scale. “IEEE Transactions on Visualization and Computer Graphics, 13(6), 1121-1128”. DOI: 10.1109/TVCG.2007.70577
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. “Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)”. DOI: 10.18653/v1/N19-1423
Hoffman, M., Bach, F., & Blei, D. (2010). Online Learning for Latent Dirichlet Allocation. “Advances in Neural Information Processing Systems (NIPS), 23, 856-864”. Available: https://proceedings.neurips.cc/paper/2010/file/390236f13b9c28d3c8f616378dd3b07b-Paper.pdf
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. “Advances in Neural Information Processing Systems, 33, 1877-1901”. DOI: 10.5555/3455716.3455749
Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. “INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association”. Available: https://www.isca-speech.org/archive/interspeech_2010/i10_1045.html
Graves, A., Mohamed, A. r., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. “2013 IEEE International Conference on Acoustics, Speech and Signal Processing”. DOI: 10.1109/ICASSP.2013.6638947
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
Terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.