Performance Analysis of BERT and CLIP Models in Multimodal Sentiment Classification of Short Video Content

Very Setiawan; Endang Anggiratih; Najwa Eka Putriningsih; Jonathan Eldo Kusuma

doi:10.30871/jaic.v10i3.12822

Authors

Very Setiawan Universitas Pignatelli Triputra
Endang Anggiratih Universitas Pignatelli Triputra
Najwa Eka Putriningsih Universitas Pignatelli Triputra
Jonathan Eldo Kusuma Universitas Pignatelli Triputra

DOI:

https://doi.org/10.30871/jaic.v10i3.12822

Keywords:

CLIP, IndoBERT, Multimodal, Sentiment Analyst, Short Video, Transformer Model

Abstract

The rapid growth of short video platforms such as YouTube Shorts has increased the need for effective sentiment analysis methods capable of capturing public opinion in multimodal content. This study analyzes and compares the effectiveness of unimodal and multimodal approaches for sentiment classification of Indonesian short videos, focusing on IndoBERT for text-based modeling and CLIP for multimodal integration. The main objective is to investigate whether incorporating visual information alongside textual data can improve sentiment classification performance compared to a text-only approach. The dataset consists of 1,128 Indonesian short videos collected from YouTube Shorts. Audio data are transcribed into text using Automatic Speech Recognition (ASR), while visual information is represented using video thumbnails. Sentiment labels are automatically categorized into three classes (positive, neutral, and negative) using a pre-trained IndoBERT model. In the training phase, the unimodal approach relies solely on textual features extracted by IndoBERT, whereas the multimodal approach integrates textual and visual features using CLIP through feature-level fusion. Model performance is evaluated using accuracy, precision, recall, F1-score, and computational time analysis. The experimental results show that the unimodal text-based model outperforms the multimodal model, achieving higher accuracy (86% vs 82%) and better overall evaluation metrics. IndoBERT also demonstrates better convergence behavior compared to English BERT, with training accuracy increasing from 0.76 to 0.86 and validation accuracy from 0.77 to 0.88, along with lower loss values. In contrast, English BERT achieves lower performance, with training accuracy rising from 0.72 to 0.79 and validation accuracy from 0.73 to 0.80. Furthermore, the unimodal approach requires significantly less computation time (18 minutes compared to 35 minutes). These findings indicate that textual information plays a dominant role in sentiment expression in Indonesian short video content, while visual features increase computational complexity without significant performance gains.

Downloads

Download data is not yet available.

References

[1] Z. Van Veldhoven and J. Vanthienen, “Digital transformation as an interaction-driven perspective between business, society, and technology,” Electron. Mark., vol. 32, no. 2, pp. 629–644, 2022, doi: 10.1007/s12525-021-00464-5.

[2] L. Theodorakopoulos, A. Theodoropoulou, and C. Klavdianos, “Interactive Viral Marketing Through Big Data Analytics, Influencer Networks, AI Integration, and Ethical Dimensions,” J. Theor. Appl. Electron. Commer. Res., vol. 20, no. 2, 2025, doi: 10.3390/jtaer20020115.

[3] S. Nur, S. Sahibu, and M. Razak, “Aspect-Based Sentiment Analysis of Tourist Attractions in Labuanbajo Using the Transformer Model as a Recommendation for Improving Service Quality,” vol. 10, no. 1, pp. 496–502, 2026.

[4] E. G. Prasetio, L. B. Handoko, and K. Hastuti, “Improving Retrieval-Augmented Generation Performance Using the MAF-RAG Architecture , EVR – VOR Vector Retrieval , and Multi-Agent Fallback Reasoning,” vol. 10, no. 1, pp. 212–223, 2026.

[5] V. D. Setiawan and D. U. Iswavigra, “Sentiment Analysis to Evaluate Public Service Perception among Surakarta City Residents Using the BiLSTM Model,” J. Informatics Telecommun. Eng., vol. 9, no. July, pp. 229–239, 2025.

[6] F. Santosa, E. Oktafanda, H. Setiawan, and A. Latif, “Advanced Long Short-Term Memory (LSTM) Models for Forecasting Indonesian Stock Prices,” J. Galaksi, vol. 1, no. 3, pp. 198–208, 2024, doi: 10.70103/galaksi.v1i3.42.

[7] J. R. Jim, M. A. R. Talukder, P. Malakar, M. M. Kabir, K. Nur, and M. F. Mridha, “Recent advancements and challenges of NLP-based sentiment analysis: A state-of-the-art review,” Nat. Lang. Process. J., vol. 6, no. February, p. 100059, 2024, doi: 10.1016/j.nlp.2024.100059.

[8] Z. Liu, B. Zhou, D. Chu, Y. Sun, and L. Meng, “Modality translation-based multimodal sentiment analysis under uncertain missing modalities,” Inf. Fusion, vol. 101, p. 101973, 2024, doi: https://doi.org/10.1016/j.inffus.2023.101973.

[9] H. Mao, Z. Yuan, H. Xu, W. Yu, Y. Liu, and K. Gao, “{M}-{SENA}: An Integrated Platform for Multimodal Sentiment Analysis,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, May 2022, pp. 204–213, doi: 10.18653/v1/2022.acl-demo.20.

[10] Y. Sun, H. Yuan, and F. Xu, “Financial sentiment analysis for pre-trained language models incorporating dictionary knowledge and neutral features,” Nat. Lang. Process. J., vol. 11, p. 100148, 2025, doi: https://doi.org/10.1016/j.nlp.2025.100148.

[11] A. D. Dobrzycki, A. M. Bernardos, L. Bergesio, A. Pomirski, and D. Sáez-Trigueros, “Exploring the Use of Contrastive Language-Image Pre-Training for Human Posture Classification: Insights from Yoga Pose Analysis,” Mathematics, vol. 12, no. 1, 2024, doi: 10.3390/math12010076.

[12] X. Pan, T. Ye, D. Han, S. Song, and G. Huang, “Contrastive Language-Image Pre-Training with Knowledge Graphs,” Adv. Neural Inf. Process. Syst., vol. 35, 2022.

[13] Z. Lu, H. Li, N. A. Parikh, J. R. Dillman, and L. He, “RadCLIP: Enhancing Radiologic Image Analysis Through Contrastive Language–Image Pretraining,” IEEE Trans. Neural Networks Learn. Syst., vol. 36, no. 10, pp. 17613–17622, 2025, doi: 10.1109/TNNLS.2025.3568036.

[14] H. Zhao, M. Yang, X. Bai, and H. Liu, “A Survey on Multimodal Aspect-Based Sentiment Analysis,” IEEE Access, vol. 12, pp. 12039–12052, 2024, doi: 10.1109/ACCESS.2024.3354844.

[15] G. Mu, C. Chen, X. Li, J. Li, X. Ju, and J. Dai, “Multimodal Sentiment Analysis of Government Information Comments Based on Contrastive Learning and Cross-Attention Fusion Networks,” IEEE Access, vol. 12, no. November, pp. 165525–165538, 2024, doi: 10.1109/ACCESS.2024.3493933.

[16] U. Sehar, S. Kanwal, K. Dashtipur, U. Mir, U. Abbasi, and F. Khan, “Urdu Sentiment Analysis via Multimodal Data Mining Based on Deep Learning Algorithms,” IEEE Access, vol. 9, pp. 153072–153082, 2021, doi: 10.1109/ACCESS.2021.3122025.

[17] M. R. Nursyam, M. Koprawi, and D. Ariyus, “Optimizing Email Spam Detection through Handling Class Imbalance with Class Weights and Hyperparameter Using GridSearchCV,” vol. 10, no. 1, pp. 232–244, 2026.

[18] I. Muslim, M. Firdaus, and R. Habibi, “Named Entity Recognition in Indonesian History Textbook Using BERT Model,” CogITo Smart J., vol. 11, no. 1, pp. 140–151, 2025, doi: 10.31154/cogito.v11i1.880.140-151.

[19] G. Z. Nabiilah, S. Y. Prasetyo, Z. N. Izdihar, and A. S. Girsang, “BERT base model for toxic comment analysis on Indonesian social media,” Procedia Comput. Sci., vol. 216, pp. 714–721, 2023, doi: https://doi.org/10.1016/j.procs.2022.12.188.

[20] D. U. Iswavigra, V. D. Setiawan, M. Ulfa, and B. Ommr, “Sentiment Analysis Using Bidirectional Encoder Representations from Transformers for Indonesian Stock Price Prediction with Long Short-Term Memory and Gated Recurrent Unit Models,” vol. 7, no. 2, pp. 961–976, 2026.

[21] V. D. Setiawan, D. U. Iswavigra, and E. Anggiratih, “Implementation of IndoBERT for Sentiment Analysis of the Constitutional Court ’ s Decision Regarding the Minimum Age of Vice Presidential Candidates,” Sci. J. Informatics, vol. 12, no. 3, pp. 397–406, 2025, doi: 10.15294/sji.v12i3.26360.

[22] H. You et al., “Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training,” in Computer Vision -- ECCV 2022, 2022, pp. 69–87.

[23] G. Arya et al., “Multimodal Hate Speech Detection in Memes Using Contrastive Language-Image Pre-Training,” IEEE Access, vol. 12, pp. 22359–22375, 2024, doi: 10.1109/ACCESS.2024.3361322.

[24] G. Boosting, “Analysis of the Best Social Media Platforms for Promotion Using Machine Learning and RFE Feature Selection : A Comparative Study of,” vol. 10, no. 1, pp. 513–521, 2026.

Performance Analysis of BERT and CLIP Models in Multimodal Sentiment Classification of Short Video Content

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

submit

tools

issn