Implementation of SSL-Vision Transformer (ViT) for Multi-Lung Disease Classification on X-Ray Images
DOI:
https://doi.org/10.30871/jaic.v10i1.11844Keywords:
Self-Supervised Learning, Vision Transformer, Multi-label Classification, CheXpert, X-rayAbstract
Chest X-ray imaging is one of the most widely used modalities for lung disease screening; however, manual interpretation remains challenging due to overlapping pathological patterns and the frequent presence of multiple coexisting abnormalities. In recent years, Vision Transformer (ViT) models have demonstrated strong potential for medical image analysis by capturing global contextual relationships. Nevertheless, their performance is highly dependent on large-scale labeled datasets, which are costly and difficult to obtain in clinical settings. To address this limitation, this study proposes a Self-Supervised Learning Vision Transformer (SSL-ViT) framework for multi-label lung disease classification using the CheXpert-v1.0-small dataset. The proposed approach leverages self-supervised pretraining to learn robust and transferable visual representations from unlabeled chest X-ray images prior to supervised fine-tuning. A total of twelve clinically relevant thoracic disease labels are retained, while non-disease labels are excluded to enhance interpretability and reduce confounding effects. Experimental results demonstrate that SSL-ViT achieves a high recall of 0.73 and a peak AUC of 0.75 on the test set, indicating strong sensitivity in detecting pathological cases. Compared to the baseline ViT model, SSL-ViT exhibits a recall-oriented performance profile that is particularly suitable for screening applications, where minimizing false negatives is critical. Furthermore, Grad-CAM visualizations confirm that the model focuses on anatomically meaningful lung regions, supporting its clinical relevance. These findings suggest that SSL-enhanced Vision Transformers provide a robust and effective solution for multi-label chest X-ray screening tasks.
Downloads
References
[1] J. Zhou, Y. Xu, J. Liu, L. Feng, J. Yu, and D. Chen, “Global burden of lung cancer in 2022 and projections to 2050: Incidence and mortality estimates from GLOBOCAN,” Cancer Epidemiol, vol. 93, p. 102693, Dec. 2024, doi: 10.1016/j.canep.2024.102693.
[2] Z. Wang et al., “Global, regional, and national burden of chronic obstructive pulmonary disease and its attributable risk factors from 1990 to 2021: an analysis for the Global Burden of Disease Study 2021,” Respir Res, vol. 26, no. 1, p. 2, Jan. 2025, doi: 10.1186/s12931-024-03051-2.
[3] K. E. S. Wijaya, G. A. Pradipta, and D. Hermawan, “Optimisasi Parameter VGGNet melalui Bayesian Optimization untuk Klasifikasi Nodul Paru,” Seminar Hasil Penelitian Informatika dan Komputer (SPINTER)| Institut Teknologi dan Bisnis STIKOM Bali, pp. 882–887, 2024.
[4] Z. Ge, D. Mahapatra, S. Sedai, R. Garnavi, and R. Chakravorty, “Chest X-rays Classification: A Multi-Label and Fine-Grained Problem,” Jul. 2018, doi: 10.48550/arXiv.1807.07247.
[5] N. I. Khani and S. Rakasiwi, “Penerapan Convolutional Neural Network dengan ResNet-50 untuk Klasifikasi Penyakit Kulit Wajah Efektif,” Edumatic: Jurnal Pendidikan Informatika, vol. 9, no. 1, pp. 217–225, Apr. 2025, doi: 10.29408/edumatic.v9i1.29572.
[6] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in ICLR 2021 Conference Paper, Oct. 2021.
[7] T. Chen et al., “A vision transformer machine learning model for COVID-19 diagnosis using chest X-ray images,” Healthcare Analytics, vol. 5, p. 100332, Jun. 2024, doi: 10.1016/j.health.2024.100332.
[8] J. Ko, S. Park, and H. G. Woo, “Optimization of vision transformer-based detection of lung diseases from chest X-ray images,” BMC Med Inform Decis Mak, vol. 24, no. 1, p. 191, Jul. 2024, doi: 10.1186/s12911-024-02591-3.
[9] U. Marikkar, S. Atito, M. Awais, and A. Mahdi, “LT-ViT: A Vision Transformer for Multi-Label Chest X-Ray Classification,” in 2023 IEEE International Conference on Image Processing (ICIP), IEEE, Oct. 2023, pp. 2565–2569. doi: 10.1109/ICIP49359.2023.10222175.
[10] J. Irvin et al., “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison,” Jan. 2019, doi: 10.48550/arXiv.1901.07031.
[11] X. Liu et al., “Self-supervised Learning: Generative or Contrastive,” IEEE Trans Knowl Data Eng, pp. 1–1, 2021, doi: 10.1109/TKDE.2021.3090866.
[12] L. Oakden-Rayner, G. Carneiro, T. Bessen, J. C. Nascimento, A. P. Bradley, and L. J. Palmer, “Precision Radiology: Predicting longevity using feature engineering and deep learning methods in a radiomics framework,” Sci Rep, vol. 7, no. 1, p. 1648, May 2017, doi: 10.1038/s41598-017-01931-w.
[13] P. Rajpurkar et al., “CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning,” stanfordmlgroup, Dec. 2017, [Online]. Available: http://arxiv.org/abs/1711.05225
[14] F. C. Ghesu et al., “Marginal Space Deep Learning: Efficient Architecture for Volumetric Image Parsing,” IEEE Trans Med Imaging, vol. 35, no. 5, pp. 1217–1228, May 2016, doi: 10.1109/TMI.2016.2538802.
[15] J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann, “Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study,” PLoS Med, vol. 15, no. 11, p. e1002683, Nov. 2018, doi: 10.1371/journal.pmed.1002683.
[16] V. R. Joseph, “Optimal ratio for data splitting,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 15, no. 4, pp. 531–538, Aug. 2022, doi: 10.1002/sam.11583.
[17] Apache MXNet, “Random Horizontal Flip,” Papers With Code.
[18] S. Zini, A. Gomez-Villa, M. Buzzelli, B. Twardowski, A. D. Bagdanov, and J. van de Weijer, “Planckian Jitter: countering the color-crippling effects of color jitter on self-supervised training,” ArXiv, Feb. 2023.
[19] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, in ICML’20. JMLR.org, 2020.
[20] R. Yulvina et al., “Hybrid Vision Transformer and Convolutional Neural Network for Multi-Class and Multi-Label Classification of Tuberculosis Anomalies on Chest X-Ray,” Computers, vol. 13, no. 12, p. 343, Dec. 2024, doi: 10.3390/computers13120343.
[21] J. Li, “Area under the ROC Curve has the most consistent evaluation for binary classification,” PLoS One, vol. 19, no. 12, p. e0316019, Dec. 2024, doi: 10.1371/journal.pone.0316019.
[22] M. B. A. McDermott, H. Zhang, L. H. Hansen, G. Angelotti, and J. Gallifant, “A Closer Look at AUROC and AUPRC under Class Imbalance,” Jan. 2025, doi: 10.48550/arXiv.2401.06091.
[23] B. Wu et al., “Visual Transformers: Token-based Image Representation and Processing for Computer Vision,” Nov. 2020, doi: https://doi.org/10.48550/arXiv.2006.03677.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Rafi Haqul Baasith, Theopilus Bayu, Arifiyanto Hadinegoro, Uyock Saputro

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








