Design of an Integrated Method Using Transformer-Based Sequence Models and RankNet for Video Transcript Processing

Authors

Minu Choudhary
Sourabh Rungta
Shikha Pandey
Vikas Pandey

Keywords:

Semantic Similarity, Multiple Modal Fusion, Transformer Models, Video Classification, RankNet

Abstract

The increasing consumption of educational videos translates into requirements for fast and correct multimedia processing of video transcripts, especially in educational domains. Traditional methods usually fail to keep up with the amount of data produced through these sources, thereby affecting transcription accuracy, semantic understanding, content classification, and relevance ranking. Specific to these methods is their reliance on models in isolation, which all capture only a subset of the complicated relationships between textual and visual data, hence often leading to less optimal performance across these tasks. This work provides an integrated, holistic way to incorporate multiple state-of-the-art methodologies within one framework for those limitations. Proposed work take advantage of a T5 (Text-to-Text Transfer Transformer)/BART(Bidirectional and Auto- Regressive Transformers) based Transformer sequence-to-sequence model in transcript pre-processing and segmentation to further bring down Word error rate (WER) by 15-20% and improve context segmentation accuracy up to about 25% when applied on Massive Open Online Courses (MOOCs) dataset samples. This work then utilizes Sentence-BERT (SBERT) for enhanced semantic understanding, where in semantically meaningful sentence embeddings are created that improve the average cosine similarity score by about 20% over the baseline models. This work focuses first on the multiple modal fusion model, which concatenates video features from a pre-trained Convolutional Neural Network(CNN) and text features from SBERT to increase around 10-15% in classification accuracy. At the end, this work has a pairwise ranking algorithm known as RankNet that integrates all these feature improvements in previous modules to produce accurate ranking of the top-10 most relevant videos, thereby achieving 18% improvement inMean Reciprocal Rank (MRR). The main novelty of this research is the Unified Transformer-Based Multiple Task Learning Framework. In a single pass, it performs transcription, semantic similarity, classification, and ranking. This reduces computational costs by 25%, improves overall accuracy by 15%, and decreases inference time by 20%. Our model sets a new standard for efficient, accurate processing of video transcripts with broad applications across a wide array of fields.

Downloads

Download data is not yet available.

References

W. Jo et al., "Simultaneous Video Retrieval and Alignment", in IEEE Access, vol. 11, pp. 28466-28478, 2023, doi: 10.1109/ACCESS.2023.3259733.

L. Vadicamo et al., "Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS Competition", in IEEE Access, vol. 12, pp. 79342-79366, 2024, doi: 10.1109/ACCESS.2024.3405638.

P. Xu et al., "Fine-Grained Instance-Level Sketch-Based Video Retrieval", in IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 5, pp. 1995-2007, May 2021, doi: 10.1109/TCSVT.2020.3014491.

H. Yoon and J. -H. Han, "Content-Based Video Retrieval With Prototypes of Deep Features", in IEEE Access, vol. 10, pp. 30730-30742, 2022, doi: 10.1109/ACCESS.2022.3160214.

G. Ren, X. Lu and Y. Li, "Joint Face Retrieval System Based On a New Quadruplet Network in Videos of Multiple Camera", in IEEE Access, vol. 9, pp. 56709-56725, 2021, doi: 10.1109/ACCESS.2021.3072055.

S. R. Dubey, "A Decade Survey of Content Based Image Retrieval Using Deep Learning", in IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 5, pp. 2687-2704, May 2022, doi: 10.1109/TCSVT.2021.3080920.

H. Kou, Y. Yang and Y. Hua, "KnowER: Knowledge enhancement for efficient text-video retrieval", in Intelligent and Converged Networks, vol. 4, no. 2, pp. 93-105, June 2023, doi: 10.23919/ICN.2023.0009.

L. Rossetto et al., "Interactive Video Retrieval in the Age of Deep Learning – Detailed Evaluation of VBS 2019", in IEEE Transactions on Multimedia, vol. 23, pp. 243-256, 2021, doi: 10.1109/TMM.2020.2980944.

R. Zuo et al., "Fine-Grained Video Retrieval With Scene Sketches", in IEEE Transactions on Image Processing, vol. 32, pp. 3136-3149, 2023, doi: 10.1109/TIP.2023.3278474.

P. Maniotis and N. Thomos, "Tile-Based Edge Caching for 360° Live Video Streaming", in IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 12, pp. 4938-4950, Dec. 2021, doi: 10.1109/TCSVT.2021.3055985.

W. Jo, G. Lim, J. Kim, J. Yun and Y. Choi, "Exploring the Temporal Cues to Enhance Video Retrieval on Standardized CDVA", in IEEE Access, vol. 10, pp. 38973-38981, 2022, doi: 10.1109/ACCESS.2022.3165177.

D. Han, X. Cheng, N. Guo, X. Ye, B. Rainer and P. Priller, "Momentum Cross-Modal Contrastive Learning for Video Moment Retrieval", in IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 7, pp. 5977-5994, July 2024, doi: 10.1109/TCSVT.2023.3344097.

H. Tang, J. Zhu, M. Liu, Z. Gao and Z. Cheng, "Frame-Wise Cross-Modal Matching for Video Moment Retrieval", in IEEE Transactions on Multimedia, vol. 24, pp. 1338-1349, 2022, doi: 10.1109/TMM.2021.3063631.

H. Sun, J. Xu, J. Wang, Q. Qi, C. Ge and J. Liao, "DLI-Net: Dual Local Interaction Network for Fine-Grained Sketch-Based Image Retrieval", in IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 7177-7189, Oct. 2022, doi: 10.1109/TCSVT.2022.3171972.

J. P. Ebenezer, Z. Shang, Y. Wu, H. Wei, S. Sethuraman and A. C. Bovik, "ChipQA: No-Reference Video Quality Prediction via Space-Time Chips", in IEEE Transactions on Image Processing, vol. 30, pp. 8059-8074, 2021, doi: 10.1109/TIP.2021.3112055.

F. Liu et al., "Infrared and Visible Cross-Modal Image Retrieval Through Shared Features", in IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 11, pp. 4485-4496, Nov. 2021, doi: 10.1109/TCSVT.2020.3048945.

H. Fang, P. Xiong, L. Xu and W. Luo, "Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations", in IEEE Transactions on Multimedia, vol. 25, pp. 7772-7785, 2023, doi: 10.1109/TMM.2022.3227416.

J. Dong, X. Wang, L. Zhang, C. Xu, G. Yang and X. Li, "Feature Re-Learning with Data Augmentation for Video Relevance Prediction", in IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 5, pp. 1946-1959, 1 May 2021, doi: 10.1109/TKDE.2019.2947442.

Z. Zhang et al., "Chinese Title Generation for Short Videos: Dataset, Metric and Algorithm", in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 7, pp. 5192-5208, July 2024, doi: 10.1109/TPAMI.2024.3365739.

[20] N. A. Nasir and S. -H. Jeong, "Fast Content Delivery Using a Testbed-Based Information-Centric Network", in IEEE Access, vol. 9, pp. 101600-101613, 2021, doi: 10.1109/ACCESS.2021.3096042.

F. Zhang, M. Xu and C. Xu, "Geometry Sensitive Cross-Modal Reasoning for Composed Query Based Image Retrieval", in IEEE Transactions on Image Processing, vol. 31, pp. 1000-1011, 2022, doi: 10.1109/TIP.2021.3138302.

Y. Zhang, Q. Qian, H. Wang, C. Liu, W. Chen and F. Wang, "Graph Convolution Based Efficient Re-Ranking for Visual Retrieval", in IEEE Transactions on Multimedia, vol. 26, pp. 1089-1101, 2024, doi: 10.1109/TMM.2023.3276167.

B. Yang, M. Cao and Y. Zou, "Concept-Aware Video Captioning: Describing Videos With Effective Prior Information", in IEEE Transactions on Image Processing, vol. 32, pp. 5366-5378, 2023, doi: 10.1109/TIP.2023.3307969.

O. Tursun, S. Denman, S. Sivapalan, S. Sridharan, C. Fookes and S. Mau, "Component-Based Attention for Large-Scale Trademark Retrieval", in IEEE Transactions on Information Forensics and Security, vol. 17, pp. 2350-2363, 2022, doi: 10.1109/TIFS.2019.2959921.

H. Zhang, A. Sun, W. Jing, L. Zhen, J. T. Zhou and R. S. M. Goh, "Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework", in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8, pp. 4252-4266, 1 Aug. 2022, doi: 10.1109/TPAMI.2021.3060449.

Downloads

Published

2025-07-18

How to Cite

Choudhary M, Rungta S, Pandey S, Pandey V. Design of an Integrated Method Using Transformer-Based Sequence Models and RankNet for Video Transcript Processing. J Neonatal Surg [Internet]. 2025 Jul. 18 [cited 2025 Dec. 13];14(4S):1430-9. Available from: https://www.jneonatalsurg.com/index.php/jns/article/view/8381

Download Citation

Issue

Vol. 14 No. 4S (2025): Journal of Neonatal Surgery

Section

Original Article

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially.

Terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Design of an Integrated Method Using Transformer-Based Sequence Models and RankNet for Video Transcript Processing

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

You are free to:

Similar Articles

Current Issue

Information

Developed By

Make a Submission