Implementation of SSL-Vision Transformer (ViT) for Multi-Lung Disease Classification on X-Ray Images

Authors

  • Rafi Haqul Baasith Universitas Amikom Yogyakarta
  • Theopilus Bayu Universitas Amikom Yogyakarta
  • Arifiyanto Hadinegoro Universitas Amikom Yogyakarta
  • Uyock Saputro Universitas Amikom Yogyakarta

DOI:

https://doi.org/10.30871/jaic.v10i1.11844

Keywords:

Self-Supervised Learning, Vision Transformer, Multi-label Classification, CheXpert, X-ray

Abstract

Chest X-ray imaging is one of the most widely used modalities for lung disease screening; however, manual interpretation remains challenging due to overlapping pathological patterns and the frequent presence of multiple coexisting abnormalities. In recent years, Vision Transformer (ViT) models have demonstrated strong potential for medical image analysis by capturing global contextual relationships. Nevertheless, their performance is highly dependent on large-scale labeled datasets, which are costly and difficult to obtain in clinical settings. To address this limitation, this study proposes a Self-Supervised Learning Vision Transformer (SSL-ViT) framework for multi-label lung disease classification using the CheXpert-v1.0-small dataset. The proposed approach leverages self-supervised pretraining to learn robust and transferable visual representations from unlabeled chest X-ray images prior to supervised fine-tuning. A total of twelve clinically relevant thoracic disease labels are retained, while non-disease labels are excluded to enhance interpretability and reduce confounding effects. Experimental results demonstrate that SSL-ViT achieves a high recall of 0.73 and a peak AUC of 0.75 on the test set, indicating strong sensitivity in detecting pathological cases. Compared to the baseline ViT model, SSL-ViT exhibits a recall-oriented performance profile that is particularly suitable for screening applications, where minimizing false negatives is critical. Furthermore, Grad-CAM visualizations confirm that the model focuses on anatomically meaningful lung regions, supporting its clinical relevance. These findings suggest that SSL-enhanced Vision Transformers provide a robust and effective solution for multi-label chest X-ray screening tasks.

Downloads

Download data is not yet available.

References

[1] J. Zhou, Y. Xu, J. Liu, L. Feng, J. Yu, and D. Chen, “Global burden of lung cancer in 2022 and projections to 2050: Incidence and mortality estimates from GLOBOCAN,” Cancer Epidemiol, vol. 93, p. 102693, Dec. 2024, doi: 10.1016/j.canep.2024.102693.

[2] Z. Wang et al., “Global, regional, and national burden of chronic obstructive pulmonary disease and its attributable risk factors from 1990 to 2021: an analysis for the Global Burden of Disease Study 2021,” Respir Res, vol. 26, no. 1, p. 2, Jan. 2025, doi: 10.1186/s12931-024-03051-2.

[3] K. E. S. Wijaya, G. A. Pradipta, and D. Hermawan, “Optimisasi Parameter VGGNet melalui Bayesian Optimization untuk Klasifikasi Nodul Paru,” Seminar Hasil Penelitian Informatika dan Komputer (SPINTER)| Institut Teknologi dan Bisnis STIKOM Bali, pp. 882–887, 2024.

[4] Z. Ge, D. Mahapatra, S. Sedai, R. Garnavi, and R. Chakravorty, “Chest X-rays Classification: A Multi-Label and Fine-Grained Problem,” Jul. 2018, doi: 10.48550/arXiv.1807.07247.

[5] N. I. Khani and S. Rakasiwi, “Penerapan Convolutional Neural Network dengan ResNet-50 untuk Klasifikasi Penyakit Kulit Wajah Efektif,” Edumatic: Jurnal Pendidikan Informatika, vol. 9, no. 1, pp. 217–225, Apr. 2025, doi: 10.29408/edumatic.v9i1.29572.

[6] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in ICLR 2021 Conference Paper, Oct. 2021.

[7] T. Chen et al., “A vision transformer machine learning model for COVID-19 diagnosis using chest X-ray images,” Healthcare Analytics, vol. 5, p. 100332, Jun. 2024, doi: 10.1016/j.health.2024.100332.

[8] J. Ko, S. Park, and H. G. Woo, “Optimization of vision transformer-based detection of lung diseases from chest X-ray images,” BMC Med Inform Decis Mak, vol. 24, no. 1, p. 191, Jul. 2024, doi: 10.1186/s12911-024-02591-3.

[9] U. Marikkar, S. Atito, M. Awais, and A. Mahdi, “LT-ViT: A Vision Transformer for Multi-Label Chest X-Ray Classification,” in 2023 IEEE International Conference on Image Processing (ICIP), IEEE, Oct. 2023, pp. 2565–2569. doi: 10.1109/ICIP49359.2023.10222175.

[10] J. Irvin et al., “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison,” Jan. 2019, doi: 10.48550/arXiv.1901.07031.

[11] X. Liu et al., “Self-supervised Learning: Generative or Contrastive,” IEEE Trans Knowl Data Eng, pp. 1–1, 2021, doi: 10.1109/TKDE.2021.3090866.

[12] L. Oakden-Rayner, G. Carneiro, T. Bessen, J. C. Nascimento, A. P. Bradley, and L. J. Palmer, “Precision Radiology: Predicting longevity using feature engineering and deep learning methods in a radiomics framework,” Sci Rep, vol. 7, no. 1, p. 1648, May 2017, doi: 10.1038/s41598-017-01931-w.

[13] P. Rajpurkar et al., “CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning,” stanfordmlgroup, Dec. 2017, [Online]. Available: http://arxiv.org/abs/1711.05225

[14] F. C. Ghesu et al., “Marginal Space Deep Learning: Efficient Architecture for Volumetric Image Parsing,” IEEE Trans Med Imaging, vol. 35, no. 5, pp. 1217–1228, May 2016, doi: 10.1109/TMI.2016.2538802.

[15] J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann, “Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study,” PLoS Med, vol. 15, no. 11, p. e1002683, Nov. 2018, doi: 10.1371/journal.pmed.1002683.

[16] V. R. Joseph, “Optimal ratio for data splitting,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 15, no. 4, pp. 531–538, Aug. 2022, doi: 10.1002/sam.11583.

[17] Apache MXNet, “Random Horizontal Flip,” Papers With Code.

[18] S. Zini, A. Gomez-Villa, M. Buzzelli, B. Twardowski, A. D. Bagdanov, and J. van de Weijer, “Planckian Jitter: countering the color-crippling effects of color jitter on self-supervised training,” ArXiv, Feb. 2023.

[19] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, in ICML’20. JMLR.org, 2020.

[20] R. Yulvina et al., “Hybrid Vision Transformer and Convolutional Neural Network for Multi-Class and Multi-Label Classification of Tuberculosis Anomalies on Chest X-Ray,” Computers, vol. 13, no. 12, p. 343, Dec. 2024, doi: 10.3390/computers13120343.

[21] J. Li, “Area under the ROC Curve has the most consistent evaluation for binary classification,” PLoS One, vol. 19, no. 12, p. e0316019, Dec. 2024, doi: 10.1371/journal.pone.0316019.

[22] M. B. A. McDermott, H. Zhang, L. H. Hansen, G. Angelotti, and J. Gallifant, “A Closer Look at AUROC and AUPRC under Class Imbalance,” Jan. 2025, doi: 10.48550/arXiv.2401.06091.

[23] B. Wu et al., “Visual Transformers: Token-based Image Representation and Processing for Computer Vision,” Nov. 2020, doi: https://doi.org/10.48550/arXiv.2006.03677.

Downloads

Published

2026-02-04

How to Cite

[1]
R. H. Baasith, T. Bayu, A. Hadinegoro, and U. Saputro, “Implementation of SSL-Vision Transformer (ViT) for Multi-Lung Disease Classification on X-Ray Images”, JAIC, vol. 10, no. 1, pp. 298–308, Feb. 2026.

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.