Performance Analysis of BERT and CLIP Models in Multimodal Sentiment Classification of Short Video Content
DOI:
https://doi.org/10.30871/jaic.v10i3.12822Keywords:
CLIP, IndoBERT, Multimodal, Sentiment Analyst, Short Video, Transformer ModelAbstract
The rapid growth of short video platforms such as YouTube Shorts has increased the need for effective sentiment analysis methods capable of capturing public opinion in multimodal content. This study analyzes and compares the effectiveness of unimodal and multimodal approaches for sentiment classification of Indonesian short videos, focusing on IndoBERT for text-based modeling and CLIP for multimodal integration. The main objective is to investigate whether incorporating visual information alongside textual data can improve sentiment classification performance compared to a text-only approach. The dataset consists of 1,128 Indonesian short videos collected from YouTube Shorts. Audio data are transcribed into text using Automatic Speech Recognition (ASR), while visual information is represented using video thumbnails. Sentiment labels are automatically categorized into three classes (positive, neutral, and negative) using a pre-trained IndoBERT model. In the training phase, the unimodal approach relies solely on textual features extracted by IndoBERT, whereas the multimodal approach integrates textual and visual features using CLIP through feature-level fusion. Model performance is evaluated using accuracy, precision, recall, F1-score, and computational time analysis. The experimental results show that the unimodal text-based model outperforms the multimodal model, achieving higher accuracy (86% vs 82%) and better overall evaluation metrics. IndoBERT also demonstrates better convergence behavior compared to English BERT, with training accuracy increasing from 0.76 to 0.86 and validation accuracy from 0.77 to 0.88, along with lower loss values. In contrast, English BERT achieves lower performance, with training accuracy rising from 0.72 to 0.79 and validation accuracy from 0.73 to 0.80. Furthermore, the unimodal approach requires significantly less computation time (18 minutes compared to 35 minutes). These findings indicate that textual information plays a dominant role in sentiment expression in Indonesian short video content, while visual features increase computational complexity without significant performance gains.
Downloads
References
[1] Z. Van Veldhoven and J. Vanthienen, “Digital transformation as an interaction-driven perspective between business, society, and technology,” Electron. Mark., vol. 32, no. 2, pp. 629–644, 2022, doi: 10.1007/s12525-021-00464-5.
[2] L. Theodorakopoulos, A. Theodoropoulou, and C. Klavdianos, “Interactive Viral Marketing Through Big Data Analytics, Influencer Networks, AI Integration, and Ethical Dimensions,” J. Theor. Appl. Electron. Commer. Res., vol. 20, no. 2, 2025, doi: 10.3390/jtaer20020115.
[3] S. Nur, S. Sahibu, and M. Razak, “Aspect-Based Sentiment Analysis of Tourist Attractions in Labuanbajo Using the Transformer Model as a Recommendation for Improving Service Quality,” vol. 10, no. 1, pp. 496–502, 2026.
[4] E. G. Prasetio, L. B. Handoko, and K. Hastuti, “Improving Retrieval-Augmented Generation Performance Using the MAF-RAG Architecture , EVR – VOR Vector Retrieval , and Multi-Agent Fallback Reasoning,” vol. 10, no. 1, pp. 212–223, 2026.
[5] V. D. Setiawan and D. U. Iswavigra, “Sentiment Analysis to Evaluate Public Service Perception among Surakarta City Residents Using the BiLSTM Model,” J. Informatics Telecommun. Eng., vol. 9, no. July, pp. 229–239, 2025.
[6] F. Santosa, E. Oktafanda, H. Setiawan, and A. Latif, “Advanced Long Short-Term Memory (LSTM) Models for Forecasting Indonesian Stock Prices,” J. Galaksi, vol. 1, no. 3, pp. 198–208, 2024, doi: 10.70103/galaksi.v1i3.42.
[7] J. R. Jim, M. A. R. Talukder, P. Malakar, M. M. Kabir, K. Nur, and M. F. Mridha, “Recent advancements and challenges of NLP-based sentiment analysis: A state-of-the-art review,” Nat. Lang. Process. J., vol. 6, no. February, p. 100059, 2024, doi: 10.1016/j.nlp.2024.100059.
[8] Z. Liu, B. Zhou, D. Chu, Y. Sun, and L. Meng, “Modality translation-based multimodal sentiment analysis under uncertain missing modalities,” Inf. Fusion, vol. 101, p. 101973, 2024, doi: https://doi.org/10.1016/j.inffus.2023.101973.
[9] H. Mao, Z. Yuan, H. Xu, W. Yu, Y. Liu, and K. Gao, “{M}-{SENA}: An Integrated Platform for Multimodal Sentiment Analysis,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, May 2022, pp. 204–213, doi: 10.18653/v1/2022.acl-demo.20.
[10] Y. Sun, H. Yuan, and F. Xu, “Financial sentiment analysis for pre-trained language models incorporating dictionary knowledge and neutral features,” Nat. Lang. Process. J., vol. 11, p. 100148, 2025, doi: https://doi.org/10.1016/j.nlp.2025.100148.
[11] A. D. Dobrzycki, A. M. Bernardos, L. Bergesio, A. Pomirski, and D. Sáez-Trigueros, “Exploring the Use of Contrastive Language-Image Pre-Training for Human Posture Classification: Insights from Yoga Pose Analysis,” Mathematics, vol. 12, no. 1, 2024, doi: 10.3390/math12010076.
[12] X. Pan, T. Ye, D. Han, S. Song, and G. Huang, “Contrastive Language-Image Pre-Training with Knowledge Graphs,” Adv. Neural Inf. Process. Syst., vol. 35, 2022.
[13] Z. Lu, H. Li, N. A. Parikh, J. R. Dillman, and L. He, “RadCLIP: Enhancing Radiologic Image Analysis Through Contrastive Language–Image Pretraining,” IEEE Trans. Neural Networks Learn. Syst., vol. 36, no. 10, pp. 17613–17622, 2025, doi: 10.1109/TNNLS.2025.3568036.
[14] H. Zhao, M. Yang, X. Bai, and H. Liu, “A Survey on Multimodal Aspect-Based Sentiment Analysis,” IEEE Access, vol. 12, pp. 12039–12052, 2024, doi: 10.1109/ACCESS.2024.3354844.
[15] G. Mu, C. Chen, X. Li, J. Li, X. Ju, and J. Dai, “Multimodal Sentiment Analysis of Government Information Comments Based on Contrastive Learning and Cross-Attention Fusion Networks,” IEEE Access, vol. 12, no. November, pp. 165525–165538, 2024, doi: 10.1109/ACCESS.2024.3493933.
[16] U. Sehar, S. Kanwal, K. Dashtipur, U. Mir, U. Abbasi, and F. Khan, “Urdu Sentiment Analysis via Multimodal Data Mining Based on Deep Learning Algorithms,” IEEE Access, vol. 9, pp. 153072–153082, 2021, doi: 10.1109/ACCESS.2021.3122025.
[17] M. R. Nursyam, M. Koprawi, and D. Ariyus, “Optimizing Email Spam Detection through Handling Class Imbalance with Class Weights and Hyperparameter Using GridSearchCV,” vol. 10, no. 1, pp. 232–244, 2026.
[18] I. Muslim, M. Firdaus, and R. Habibi, “Named Entity Recognition in Indonesian History Textbook Using BERT Model,” CogITo Smart J., vol. 11, no. 1, pp. 140–151, 2025, doi: 10.31154/cogito.v11i1.880.140-151.
[19] G. Z. Nabiilah, S. Y. Prasetyo, Z. N. Izdihar, and A. S. Girsang, “BERT base model for toxic comment analysis on Indonesian social media,” Procedia Comput. Sci., vol. 216, pp. 714–721, 2023, doi: https://doi.org/10.1016/j.procs.2022.12.188.
[20] D. U. Iswavigra, V. D. Setiawan, M. Ulfa, and B. Ommr, “Sentiment Analysis Using Bidirectional Encoder Representations from Transformers for Indonesian Stock Price Prediction with Long Short-Term Memory and Gated Recurrent Unit Models,” vol. 7, no. 2, pp. 961–976, 2026.
[21] V. D. Setiawan, D. U. Iswavigra, and E. Anggiratih, “Implementation of IndoBERT for Sentiment Analysis of the Constitutional Court ’ s Decision Regarding the Minimum Age of Vice Presidential Candidates,” Sci. J. Informatics, vol. 12, no. 3, pp. 397–406, 2025, doi: 10.15294/sji.v12i3.26360.
[22] H. You et al., “Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training,” in Computer Vision -- ECCV 2022, 2022, pp. 69–87.
[23] G. Arya et al., “Multimodal Hate Speech Detection in Memes Using Contrastive Language-Image Pre-Training,” IEEE Access, vol. 12, pp. 22359–22375, 2024, doi: 10.1109/ACCESS.2024.3361322.
[24] G. Boosting, “Analysis of the Best Social Media Platforms for Promotion Using Machine Learning and RFE Feature Selection : A Comparative Study of,” vol. 10, no. 1, pp. 513–521, 2026.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Very Setiawan, Endang Anggiratih, Najwa Eka Putriningsih, Jonathan Eldo Kusuma

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








