An End-to-End NLP Pipeline Combining Web Scraping, CamemBERT Fine-Tuning and Zero-Shot Biomedical Named-Entity Recognition for Early Epidemic Signal Detection from French-Language Online News

Authors

  • Franklin Mwamba Institut de Recherche en Sciences de la Santé
  • Fiston Oshasha General Commissariat for Atomic Energy, Regional Center for Nuclear Studies of Kinshasa
  • Saint Jean Djungu University of Kinshasa
  • John Poma University of Kinshasa

DOI:

https://doi.org/10.30871/jaic.v10i3.12773

Keywords:

Epidemiological surveillance, Early warning system, Natural Language Processing, Text classification, Named Entity Recognition, Public health, DRC

Abstract

Epidemiological surveillance in the Democratic Republic of the Congo (DRC) suffers from reporting delays and limited digital infrastructure, while online French-language news provides a complementary real-time signal that current systems exploit poorly. We design, deploy, and rigorously evaluate an end-to-end natural-language processing (NLP) pipeline that integrates targeted web scraping of Congolese online media, sentence-level binary classification of epidemic content with a fine-tuned CamemBERT transformer, zero-shot biomedical named-entity recognition (CamemBERT-bio-GLiNER) restricted to disease, location and date, and an alerting dashboard built on a Django/Celery stack. The classifier was fine-tuned on a hybrid corpus of 11,433 sentences combining 1,433 manually annotated real news sentences and 10,000 template-generated synthetic sentences, and is benchmarked against two classical baselines (TF-IDF combined with Logistic Regression and Linear SVM) on an independent, manually annotated test set of 997 sentences (341 epidemic, 656 non-epidemic) constructed from a second scraping campaign performed three months later. We report precision, recall, F1, PR-AUC and ROC-AUC with 1,000-iteration bootstrap 95% confidence intervals. CamemBERT reaches F1 = 0.754 [0.717-0.787] and PR-AUC = 0.699 [0.644-0.756] for the epidemic class, while the Linear SVM baseline reaches F1 = 0.858 ± 0.037 and PR-AUC = 0.926 ± 0.024 in 5-fold stratified cross-validation, outperforming the transformer, a result we attribute to the dominance of synthetic data in the training corpus. A single-batch operational run of the full pipeline on MediaCongo processed 30 articles and 501 sentences in 37.5 s on a single GPU, producing 43 alerts that correctly captured the May 2026 Ebola Bundibugyo outbreak in Ituri. The system, the external benchmark, and all evaluation scripts are released as open source.

Downloads

Download data is not yet available.

References

[1] J. S. Brownstein, C. C. Freifeld, and L. C. Madoff, “Digital disease detection — Harnessing the Web for public health surveillance,” New England Journal of Medicine, vol. 360, no. 21, pp. 2153–2155, 2009, doi: 10.1056/NEJMp0900702.

[2] Celery Project, “Celery: Distributed task queue,” 2023. [Online]. Available: https://docs.celeryq.dev/

[3] C. M. Wolfe et al., “Systematic review of Integrated Disease Surveillance and Response (IDSR) implementation in the African region,” PLoS ONE, vol. 16, no. 2, e0245457, 2021, doi: 10.1371/journal.pone.0245457.

[4] Django Software Foundation, “Django documentation,” 2023. [Online]. Available: https://docs.djangoproject.com/

[5] J. Ginsberg et al., “Detecting influenza epidemics using search engine query data,” Nature, vol. 457, no. 7232, pp. 1012–1014, 2009, doi: 10.1038/nature07634.

[6] J. Hong et al., “Relation extraction from news articles (RENA): A tool for epidemic surveillance,” arXiv preprint arXiv:2311.01472, 2023. doi: 10.48550/arXiv.2311.01472.

[7] Hugging Face, “CamemBERT model documentation,” 2023. [Online]. Available: https://huggingface.co/camembert-base

[8] J. Miano et al., “Using event-based web-scraping methods and bidirectional transformers to characterize COVID-19 outbreaks in food production and retail settings,” in Artificial Intelligence in Medicine (AIME 2021), Lecture Notes in Artificial Intelligence, vol. 12721, pp. 187–198, Springer, 2021, doi: 10.1007/978-3-030-77211-6_21.

[9] G. Lejeune, R. Brixtel, A. Doucet, and N. Lucas, “Multilingual event extraction for epidemic detection,” Artificial Intelligence in Medicine, vol. 65, no. 2, pp. 131–143, 2015, doi: 10.1016/j.artmed.2015.06.005.

[10] L. Martin et al., “CamemBERT: A tasty French language model,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219, 2020, doi: 10.18653/v1/2020.acl-main.645.

[11] World Health Organization, “Rapport sur les épidémies et la couverture sanitaire en Afrique centrale,” Bureau régional de l’OMS pour l’Afrique, 2021. [Online]. Available: https://www.afro.who.int

[12] D. Phutane et al., “Predicting future trends in disease outbreaks using web scraping and machine learning,” Veermata Jijabai Technological Institute, Mumbai, India, 2025. [Online]. Available: https://www.researchgate.net/publication/392892557_Predicting_Future_Trends_in_Disease_Outbreaks_Using_Web_Scraping_and_Machine_Learning_5_th_Druhi_Phutane

[13] L. Richardson, “Beautiful Soup documentation,” 2023. [Online]. Available: https://www.crummy.com/software/BeautifulSoup/

[14] T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, 2020, doi: 10.18653/v1/2020.emnlp-demos.6.

[15] Almanach, “CamemBERT-bio-gliner-v0.1: Zero-shot French biomedical NER model based on GLiNER with CamemBERT-bio backbone,” Hugging Face model card, 2025. [Online]. Available: https://huggingface.co/almanach/camembert-bio-gliner-v0.1.” 2024.

[16] M. T. Ribeiro, S. Singh, and C. Guestrin, "'Why should I trust you?': Explaining the predictions of any classifier," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), pp. 1135–1144, 2016, doi: 10.1145/2939672.2939778.

Downloads

Published

2026-06-12

How to Cite

[1]
F. Mwamba, F. Oshasha, S. J. Djungu, and J. Poma, “An End-to-End NLP Pipeline Combining Web Scraping, CamemBERT Fine-Tuning and Zero-Shot Biomedical Named-Entity Recognition for Early Epidemic Signal Detection from French-Language Online News”, JAIC, vol. 10, no. 3, pp. 2546–2555, Jun. 2026.

Most read articles by the same author(s)

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.