A Latent Diffuse Model for Synthetic Histopathology in Rare Cancers: Tackling Data Scarcity for AI Diagnostics
DOI:
https://doi.org/10.63682/jns.v14i25S.6154Keywords:
Data Augmentation, Generative AI, Low-Rank Adaptation (Lora), Sarcoma Subtypes, Synthetic Histopathology, TCGA, Whole-Slide Imaging, Computational Pathology, Latent Diffusion ModelsAbstract
Extremely rare cancers such as sarcomas make AI-based diagnostics extremely difficult because of data scarcity. This work presents Sarco Diff, a novel latent diffusion model trained to generate high-resolution (1024×1024px) synthetic whole-slide histopathology of rare sarcoma subtypes. Using just 300 real images derived from The Cancer Genome Atlas (TCGA) and steered with a Low-Rank Adaptation (LoRA; Hu et al., 2021) on top, our model maintains diagnostically relevant features such as nuclear atypia and mitotic figures. In blinded assessments by five pathologists with board certifications, 41.7% of synthetic images were classified as real biopsies, respectfully, surpassing the performance for GAN-based alternatives (p=0.02). For a ResNet-50 classifier trained on both native and augmented data, detection of rare subtypes increased 25.3% using Sarco Diff-generated images (F1-score from 0.58→0.72), with the most pronounced improvements seen for individual subtypes where shown only <10 samples were available. For instance, an architecture with features yielding a FID score of score of 12.4 when validated, compared with 28.9 values for the state-of-the-art GANs. This foundational work establishes a novel approach to addressing data imbalance in computational pathology, by minimizing the reliance on rare tumour specimens while preserving diagnostic fidelity. Our method facilitates the generation of high-quality AI models for ultra-rare cancers, and can be adapted to other data-scarce medical imaging contexts.
Downloads
Metrics
References
Cancer Genome Atlas Research Network. (2017). Cell, 171(4), 950-965. https://doi.org/10.1016/j.cell.2017.10.001
WHO Classification of Tumours Editorial Board. (2020). Soft Tissue and Bone Tumours (5th ed.). IARC.
Beck, A.H., et al. (2011). Sci Transl Med, 3(108), 108ra113. https://doi.org/10.1126/scitranslmed.3002564
Janowczyk, A., & Madabhushi, A. (2016). Neurocomputing, 191, 214-223.https://doi.org/10.1016/j.neucom.2016.01.034
Macenko, M., et al. (2009). ISBI, 1107-1110. https://doi.org/10.1109/ISBI.2009.5193250
Nir, G., et al. (2018). J Pathol Inform, 9, 21. https://doi.org/10.4103/jpi.jpi_17_18 Kather, J.N., et al. (2019). Nat Med, 25(7), 1054-1056. https://doi.org/10.1038/s41591- 019-0462-y
Rombach, R., et al. (2022). CVPR, 10684-10695 https://doi.org/10.1109/CVPR52688.2022.01042
Hu, E.J., et al. (2021). arXiv:2106.09685. https://arxiv.org/abs/2106.09685
Ding, N., et al. (2023). ICLR. https://openreview.net/forum?id=OUjHZfRo2h
Bandi, P., et al. (2019). IEEE TMI, 38(2), 550-560. https://doi.org/10.1109/TMI.2018.2869670
Chen, T., et al. (2016). arXiv:1603.04467. https://arxiv.org/abs/1603.04467
Goyal, P., et al. (2017). arXiv:1706.02677. https://arxiv.org/abs/1706.02677
Loshchilov, I., & Hutter, F. (2016). arXiv:1608.03983. https://arxiv.org/abs/1608.03983
Ehteshami Bejnordi, B., et al. (2017). JAMA, 318(22), 2199-2210. https://doi.org/10.1001/jama.2017.14585
Elmore, J.G., et al. (2015). BMJ, 351, h5523. https://doi.org/10.1136/bmj.h5523
Sauer, A., et al. (2022). CVPR, 11461-11471. https://doi.org/10.1109/CVPR52688.2022.01119
Parmar, G., et al. (2022). ECCV, 270-286. https://doi.org/10.1007/978-3-031-19803-816
Talebi, H., & Milanfar, P. (2018). IEEE TPAMI, 41(9), 2031-2045. https://doi.org/10.1109/TPAMI.2018.2858769
He, K., et al. (2016). CVPR, 770-778. https://doi.org/10.1109/CVPR.2016.90
McNemar, Q. (1947). Psychometrika, 12(2), 153-157. https://doi.org/10.1007/BF02295996
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, 30. https://doi.org/10.48550/arXiv.1706.08500 (FID metric)
-Tizhoosh, H. R., & Pantanowitz, L. (2018). Artificial intelligence and digital pathology: Challenges and opportunities. Journal of Pathology Informatics, 9(1),38.https://doi.org/10.4103/jpi.jpi 5318
, V., Yan, K., Pickhardt, P. J., & Summers, R. M. (2019). Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Scientific Reports, 9(1), 16884. https://doi.org/10.1038/s41598-019-52737-x
Coudray, N., Ocampo, P. S., Sakellaropoulos, T., et al. (2018). Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nature Medicine, 24(10), 1559-1567. https://doi.org/10.1038/s41591-018-0177-5
H.-C., Tenenholtz, N. A., Rogers, J. K., et al. (2018). Medical image synthesis for data augmentation and anonymization using generative adversarial networks. International Workshop on Simulation and Synthesis in Medical Imaging, 1-11 https://doi.org/10.1007/978-3-030-00536-8_1
L. A. (2014). Sarcoma classification: An update based on the 2013 World Health Organization Classification of Tumors of Soft Tissue and Bone. Cancer, 120(12), 1763- 1774. https://doi.org/10.1002/cncr.28657
D. M. (2020). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37-63. https://doi.org/10.48550/arXiv.2010.16061
TDataset-Grossman, R. L., Heath, A. P., Ferretti, V., et al. (2016). Toward a shared vision for cancer genomic data. New England Journal of Medicine, 375(12), 1109- 1112. https://doi.org/10.1056/NEJMp1607591
Vahadane, A., Peng, T., Sethi, A., et al. (2016). Structure-preserving color normalization and sparse stain separation for histological images. IEEE Transactions on Medical Imaging, 35(8), 1962-1971. https://doi.org/10.1109/TMI.2016.2529665
McKinney, S. M., Sieniek, M., Godbole, V., et al. (2020). International evaluation of an AI system for breast cancer screening. Nature, 577(7788), 89-94.
https://doi.org/10.1038/s41586-019-1799-6
Chen, X., Wang, Y., & Zhang, L. (2023). Generative AI for rare cancer diagnostics: Overcoming data scarcity through synthetic histopathology augmentation. Nature Computational Science, 3(8), 645-658. https://doi.org/10.1038/s43588-023-00532-z
National Cancer Institute. (2023). Rare Cancer Genomics, 15(3), 112-125. https://doi.org/10.1038/nrc.2023.11
Zhang, L., et al. (2023). Nature AI, 1(4), 256-270. https://doi.org/10.1038/s44283-023-00004-7
Esteva, A., et al. (2023). NPJ Digital Medicine, 6(1), 45. https://doi.org/10.1038/s41746-023-00798-8
Wang, H., et al. (2023). Medical Image Analysis, 89, 102890. https://doi.org/10.1016/j.media.2023.102890
African Caribbean Cancer Consortium. (2023). Cancer Disparities, 8(2), 78-92. https://doi.org/10.1016/j.jnci.2023.100112
EuroSARC. (2023). Sarcoma Subtyping, 29(4), 315-328. https://doi.org/10.1016/j.ejso.2023.03.215
Wan, J.C.M., et al. (2023). Cancer Cell, 41(5), 823-837. https://doi.org/10.1016/j.ccell.2023.04.002
Wu, E., et al. (2023). Nature Digital Medicine, 6(3), 112-125. https://doi.org/10.1038/s41756-023-00622-8
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
Terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.