Image Caption Detector Using LSTM and CNN
DOI:
https://doi.org/10.52783/jns.v14.3192Keywords:
LSTM, CNN, VGG16, Flickr8K, BLEUAbstract
In computer vision, the general quality is determined by how well the image is comprehensible. Image Captioning is a
concept where many models are proposed for a better understanding of the image. In recent times, this technology has been
used in many fields like recommendation systems, News channels, Accident detection, and many more. The existing deep
learning technique is the Bag of Words model, Spatial pyramid matching and Markov models are used for the automation
caption generation but have difficulty in capturing content. So, the proposed methodology is LSTM and CNN are used for
effective image representation and have end-to-end learning of the image. VGG16 is used for feature extraction as it has the
benefit of hierarchical feature representation. This paper uses the Flickr8K dataset and contains the images along with the
five different captions concerning each image. After training, the predictions are made on the model and then evaluated using
the metrics BLEU 1, and BLEU 2 which measure the overlapping of the words between generated captions and reference
captions and achieves better results than existing models
Downloads
Metrics
References
L. Ramos, E. Casas, C. Romero, F. Rivas-Echeverría and M. E. Morocho-Cayamcela, "A Study of ConvNeXt
Architectures for Enhanced Image Captioning," in IEEE Access, vol. 12, pp. 13711-13728, 2024, doi:
1109/ACCESS.2024.3356551.
M. A. Arasi, H. M. Alshahrani, N. Alruwais, A. Motwakel, N. A. Ahmed and A. Mohamed, "Automated Image
Captioning Using Sparrow Search Algorithm With Improved Deep Learning Model," in IEEE Access, vol. 11,
pp. 104633-104642, 2023, doi: 10.1109/ACCESS.2023.3317276.
S. Amirian, K. Rasheed, T. R. Taha and H. R. Arabnia, "Automatic Image and Video Caption Generation With
Deep Learning: A Concise Review and Algorithmic Overlap," in IEEE Access, vol. 8, pp. 218386-218400,
, doi: 10.1109/ACCESS.2020.3042484.
N. Xu et al., "Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image
Captioning," in IEEE Transactions on Multimedia, vol. 22, no. 5, pp. 1372-1383, May 2020, doi:
1109/TMM.2019.2941820.
M. Yang et al., "Multitask Learning for Cross-Domain Image Captioning," in IEEE Transactions on
Multimedia, vol. 21, no. 4, pp. 1047-1061, April 2019, doi: 10.1109/TMM.2018.2869276.
L. Wu, M. Xu, L. Sang, T. Yao and T. Mei, "Noise Augmented Double-Stream Graph Convolutional Networks
for Image Captioning," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 8, pp.
-3127, Aug. 2021, doi: 10.1109/TCSVT.2020.3036860.
P. Mahalakshmi and N. S. Fatima, "Summarization of Text and Image Captioning in Information Retrieval
Using Deep Learning Techniques," in IEEE Access, vol. 10, pp. 18289-18297, 2022, doi:
1109/ACCESS.2022.3150414.
M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga and M. Bennamoun, "Text to Image Synthesis for Improved
Image Captioning," in IEEE Access, vol. 9, pp. 64918-64928, 2021, doi: 10.1109/ACCESS.2021.3075579.
C. Chen et al., "Towards Better Caption Supervision for Object Detection," in IEEE Transactions on
Visualization and Computer Graphics, vol. 28, no. 4, pp. 1941-1954, 1 April 2022, doi:
1109/TVCG.2021.3138933.
Y. Xu, W. Yu, P. Ghamisi, M. Kopp and S. Hochreiter, "Txt2Img-MHN: Remote Sensing Image Generation
From Text Using Modern Hopfield Networks," in IEEE Transactions on Image Processing, vol. 32, pp. 5737
, 2023, doi: 10.1109/TIP.2023.3323799.
X. Li, A. Yuan and X. Lu, "Vision-to-Language Tasks Based on Attributes and Attention Mechanism," in IEEE
Transactions on Cybernetics, vol. 51, no. 2, pp. 913-926, Feb. 2021, doi: 10.1109/TCYB.2019.2914351.
S. Zhang, Y. Zhang, Z. Chen and Z. Li, "VSAM-Based Visual Keyword Generation for Image Caption," in
IEEE Access, vol. 9, pp. 27638-27649, 2021, doi: 10.110
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
Terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.