Image Caption Detector Using LSTM and CNN

Authors

Atyam Mehermani Meghana
B. Sekharbabu

DOI:

https://doi.org/10.52783/jns.v14.3192

Keywords:

LSTM, CNN, VGG16, Flickr8K, BLEU

Abstract

In computer vision, the general quality is determined by how well the image is comprehensible. Image Captioning is a
concept where many models are proposed for a better understanding of the image. In recent times, this technology has been
used in many fields like recommendation systems, News channels, Accident detection, and many more. The existing deep
learning technique is the Bag of Words model, Spatial pyramid matching and Markov models are used for the automation
caption generation but have difficulty in capturing content. So, the proposed methodology is LSTM and CNN are used for
effective image representation and have end-to-end learning of the image. VGG16 is used for feature extraction as it has the
benefit of hierarchical feature representation. This paper uses the Flickr8K dataset and contains the images along with the
five different captions concerning each image. After training, the predictions are made on the model and then evaluated using
the metrics BLEU 1, and BLEU 2 which measure the overlapping of the words between generated captions and reference
captions and achieves better results than existing models

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

L. Ramos, E. Casas, C. Romero, F. Rivas-Echeverría and M. E. Morocho-Cayamcela, "A Study of ConvNeXt

Architectures for Enhanced Image Captioning," in IEEE Access, vol. 12, pp. 13711-13728, 2024, doi:

1109/ACCESS.2024.3356551.

M. A. Arasi, H. M. Alshahrani, N. Alruwais, A. Motwakel, N. A. Ahmed and A. Mohamed, "Automated Image

Captioning Using Sparrow Search Algorithm With Improved Deep Learning Model," in IEEE Access, vol. 11,

pp. 104633-104642, 2023, doi: 10.1109/ACCESS.2023.3317276.

S. Amirian, K. Rasheed, T. R. Taha and H. R. Arabnia, "Automatic Image and Video Caption Generation With

Deep Learning: A Concise Review and Algorithmic Overlap," in IEEE Access, vol. 8, pp. 218386-218400,

, doi: 10.1109/ACCESS.2020.3042484.

N. Xu et al., "Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image

Captioning," in IEEE Transactions on Multimedia, vol. 22, no. 5, pp. 1372-1383, May 2020, doi:

1109/TMM.2019.2941820.

M. Yang et al., "Multitask Learning for Cross-Domain Image Captioning," in IEEE Transactions on

Multimedia, vol. 21, no. 4, pp. 1047-1061, April 2019, doi: 10.1109/TMM.2018.2869276.

L. Wu, M. Xu, L. Sang, T. Yao and T. Mei, "Noise Augmented Double-Stream Graph Convolutional Networks

for Image Captioning," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 8, pp.

-3127, Aug. 2021, doi: 10.1109/TCSVT.2020.3036860.

P. Mahalakshmi and N. S. Fatima, "Summarization of Text and Image Captioning in Information Retrieval

Using Deep Learning Techniques," in IEEE Access, vol. 10, pp. 18289-18297, 2022, doi:

1109/ACCESS.2022.3150414.

M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga and M. Bennamoun, "Text to Image Synthesis for Improved

Image Captioning," in IEEE Access, vol. 9, pp. 64918-64928, 2021, doi: 10.1109/ACCESS.2021.3075579.

C. Chen et al., "Towards Better Caption Supervision for Object Detection," in IEEE Transactions on

Visualization and Computer Graphics, vol. 28, no. 4, pp. 1941-1954, 1 April 2022, doi:

1109/TVCG.2021.3138933.

Y. Xu, W. Yu, P. Ghamisi, M. Kopp and S. Hochreiter, "Txt2Img-MHN: Remote Sensing Image Generation

From Text Using Modern Hopfield Networks," in IEEE Transactions on Image Processing, vol. 32, pp. 5737

, 2023, doi: 10.1109/TIP.2023.3323799.

X. Li, A. Yuan and X. Lu, "Vision-to-Language Tasks Based on Attributes and Attention Mechanism," in IEEE

Transactions on Cybernetics, vol. 51, no. 2, pp. 913-926, Feb. 2021, doi: 10.1109/TCYB.2019.2914351.

S. Zhang, Y. Zhang, Z. Chen and Z. Li, "VSAM-Based Visual Keyword Generation for Image Caption," in

IEEE Access, vol. 9, pp. 27638-27649, 2021, doi: 10.110

Downloads

Published

2025-04-08

How to Cite

Meghana AM, B. Sekharbabu BS. Image Caption Detector Using LSTM and CNN. J Neonatal Surg [Internet]. 2025Apr.8 [cited 2025Nov.1];14(13S):102-1. Available from: https://www.jneonatalsurg.com/index.php/jns/article/view/3192

Download Citation

Issue

Vol. 14 No. 13S (2025): Journal of Neonatal Surgery

Section

Original Article

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially.

Terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.