Generating Textual Summary from Videos Using AI (NLP)

Authors

  • Syed Muhammad Hassan Sindh Madressatul Islam University, Karachi, Pakistan
  • Usman Khan kiet , Karachi Institute of Economics and Technology image/svg+xml
  • Adnan Ansari kiet , Karachi Institute of Economics and Technology image/svg+xml
  • Imtiaz Hussain Karachi Institute of Economics and Technology image/svg+xml

DOI:

https://doi.org/10.51153/kjcis.v8i1.242

Keywords:

Video, Audio, Summarization, Python, Natural Language Processing (NLP), Video Summary

Abstract

The process of deriving a summary from a given sequence of sentences is known as text summarization. There are two different kinds of summaries: extractive and abstractive. In an extractive summary, words are taken out of the original text and combined into a brief. In addition to reproducing the words from the input, the abstractive summary also creates new terms based on its comprehension of the text. This report explores the development and implementation of a system aimed at generating textual summaries of videos solely from audio content. The system utilizes cutting-edge approaches in Natural Language Processing (NLP) and Machine learning (ML), Language Models (LM). It employs Whisper model and BART model to transcribe spoken audio, extract meaningful information, and summarize the content to create concise video summaries. By combining these models and techniques, this system is capable to handle both English, Hindi/Urdu and the bilingual conversational videos and is generating the correct results with an average accuracy of 70% (ROGUE Score) and 100% of F1-Score (ROGUE).

References

Chiu, Chung-Cheng, et al. "State-of-the-art speech recognition with sequence-to-sequence models." 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018.

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

Liu, Peter J., et al. "Generating wikipedia by summarizing long sequences." arXiv preprint arXiv:1801.10198 (2018).

van Alten, David CD, et al. "Self-regulated learning support in flipped learning videos enhances learning outcomes." Computers & Education 158 (2020): 104000.

Satpute, Sneha S., et al. "Smart Video Summarization using Subtitles." International Research Journal Of Modernization In Engineering Technology And Science 4.07 (2022).

See, Abigail, Peter J. Liu, and Christopher D. Manning. "Get to the point: Summarization with pointer-generator networks." arXiv preprint arXiv:1704.04368 (2017).

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

Williams, Adina, Nikita Nangia, and Samuel R. Bowman. "A broad-coverage challenge corpus for sentence understanding through inference." arXiv preprint arXiv:1704.05426 (2017).

Yu, Adams Wei, et al. "Qanet: Combining local convolution with global self-attention for reading comprehension." arXiv preprint arXiv:1804.09541 (2018).

Khan, Abdullah Aman, et al. "Content-aware summarization of broadcast sports videos: an audio–visual feature extraction approach." Neural Processing Letters 52 (2020): 1945-1968.

Huang, Jia-Hong, et al. "Gpt2mvs: Generative pre-trained transformer-2 for multi-modal video summarization." Proceedings of the 2021 International Conference on Multimedia Retrieval. 2021.

Gonzalez, Hannah, et al. "Automatically Generated Summaries of Video Lectures May Enhance Students’ Learning Experience." Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). 2023.

Badamdorj, Taivanbat, et al. "Joint visual and audio learning for video highlight detection." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Nalla, Saiteja, et al. "Watch hours in minutes: Summarizing videos with user intent." Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer International Publishing, 2020.

Rafiq, Muhammad, et al. "Scene classification for sports video summarization using transfer learning." Sensors 20.6 (2020): 1702.

Mukherjee, Sourajit, et al. "Topic-aware multimodal summarization." Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022. 2022.

Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." International Conference on Machine Learning. PMLR, 2023.

Dilawari, Aniqa, and Muhammad Usman Ghani Khan. "ASoVS: abstractive summarization of video sequences." IEEE Access 7 (2019): 29253-29263.

Min, Xiongkuo, et al. "A multimodal saliency model for videos with high audio-visual correspondence." IEEE Transactions on Image Processing 29 (2020): 3805-3819.

Palaskar, Shruti, et al. "Multimodal abstractive summarization for how2 videos." arXiv preprint arXiv:1906.07901 (2019).

Raut, Vrushali, and Reena Gunjan. "Video summarization approaches in wireless capsule endoscopy: A review." E3S web of conferences. Vol. 170. EDP Sciences, 2020.

Seo, Paul Hongsuck, Arsha Nagrani, and Cordelia Schmid. "Look before you speak: Visually contextualized utterances." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

Zhou, Yang. "Audio-driven Character Animation." (2021).

Saganowski, Stanis?aw. "Bringing emotion recognition out of the lab into real life: Recent advances in sensors and machine learning." Electronics 11.3 (2022): 496.

Howard, Jeremy, and Sebastian Ruder. "Universal language model fine-tuning for text classification." arXiv preprint arXiv:1801.06146 (2018).

Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.

Mishra, Swaroop, et al. "Cross-task generalization via natural language crowdsourcing instructions." arXiv preprint arXiv:2104.08773 (2021).

Sanh, Victor, et al. "Multitask prompted training enables zero-shot task generalization." arXiv preprint arXiv:2110.08207 (2021).

Bach, Stephen H., et al. "Promptsource: An integrated development environment and repository for natural language prompts." arXiv preprint arXiv:2202.01279 (2022).

Goyal, Tanya, Junyi Jessy Li, and Greg Durrett. "News summarization and evaluation in the era of gpt-3." arXiv preprint arXiv:2209.12356 (2022).

Pavel, Amy, et al. "Video digests: a browsable, skimmable format for informational lecture videos." UIST. Vol. 10. 2014.

Shimada, Atsushi, et al. "Automatic Generation of Personalized Review Materials Based on Across-Learning-System Analysis." CrossLAK. 2016.

Kim, Juho, et al. "Data-driven interaction techniques for improving navigation of educational videos." Proceedings of the 27th annual ACM symposium on User interface software and technology. 2014.

Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019).

Liu, Yang, and Mirella Lapata. "Text summarization with pretrained encoders." arXiv preprint arXiv:1908.08345 (2019).

Published

2025-07-01

How to Cite

Generating Textual Summary from Videos Using AI (NLP). (2025). KIET Journal of Computing and Information Sciences, 8(1). https://doi.org/10.51153/kjcis.v8i1.242

Most read articles by the same author(s)

Make a Submission

Make a Submission

Developed By

Open Journal Systems
-->