An End-to-End NLP Pipeline Combining Web Scraping, CamemBERT Fine-Tuning and Zero-Shot Biomedical Named-Entity Recognition for Early Epidemic Signal Detection from French-Language Online News
DOI:
https://doi.org/10.30871/jaic.v10i3.12773Keywords:
Epidemiological surveillance, Early warning system, Natural Language Processing, Text classification, Named Entity Recognition, Public health, DRCAbstract
Epidemiological surveillance in the Democratic Republic of the Congo (DRC) suffers from reporting delays and limited digital infrastructure, while online French-language news provides a complementary real-time signal that current systems exploit poorly. We design, deploy, and rigorously evaluate an end-to-end natural-language processing (NLP) pipeline that integrates targeted web scraping of Congolese online media, sentence-level binary classification of epidemic content with a fine-tuned CamemBERT transformer, zero-shot biomedical named-entity recognition (CamemBERT-bio-GLiNER) restricted to disease, location and date, and an alerting dashboard built on a Django/Celery stack. The classifier was fine-tuned on a hybrid corpus of 11,433 sentences combining 1,433 manually annotated real news sentences and 10,000 template-generated synthetic sentences, and is benchmarked against two classical baselines (TF-IDF combined with Logistic Regression and Linear SVM) on an independent, manually annotated test set of 997 sentences (341 epidemic, 656 non-epidemic) constructed from a second scraping campaign performed three months later. We report precision, recall, F1, PR-AUC and ROC-AUC with 1,000-iteration bootstrap 95% confidence intervals. CamemBERT reaches F1 = 0.754 [0.717-0.787] and PR-AUC = 0.699 [0.644-0.756] for the epidemic class, while the Linear SVM baseline reaches F1 = 0.858 ± 0.037 and PR-AUC = 0.926 ± 0.024 in 5-fold stratified cross-validation, outperforming the transformer, a result we attribute to the dominance of synthetic data in the training corpus. A single-batch operational run of the full pipeline on MediaCongo processed 30 articles and 501 sentences in 37.5 s on a single GPU, producing 43 alerts that correctly captured the May 2026 Ebola Bundibugyo outbreak in Ituri. The system, the external benchmark, and all evaluation scripts are released as open source.
Downloads
References
[1] J. S. Brownstein, C. C. Freifeld, and L. C. Madoff, “Digital disease detection — Harnessing the Web for public health surveillance,” New England Journal of Medicine, vol. 360, no. 21, pp. 2153–2155, 2009, doi: 10.1056/NEJMp0900702.
[2] Celery Project, “Celery: Distributed task queue,” 2023. [Online]. Available: https://docs.celeryq.dev/
[3] C. M. Wolfe et al., “Systematic review of Integrated Disease Surveillance and Response (IDSR) implementation in the African region,” PLoS ONE, vol. 16, no. 2, e0245457, 2021, doi: 10.1371/journal.pone.0245457.
[4] Django Software Foundation, “Django documentation,” 2023. [Online]. Available: https://docs.djangoproject.com/
[5] J. Ginsberg et al., “Detecting influenza epidemics using search engine query data,” Nature, vol. 457, no. 7232, pp. 1012–1014, 2009, doi: 10.1038/nature07634.
[6] J. Hong et al., “Relation extraction from news articles (RENA): A tool for epidemic surveillance,” arXiv preprint arXiv:2311.01472, 2023. doi: 10.48550/arXiv.2311.01472.
[7] Hugging Face, “CamemBERT model documentation,” 2023. [Online]. Available: https://huggingface.co/camembert-base
[8] J. Miano et al., “Using event-based web-scraping methods and bidirectional transformers to characterize COVID-19 outbreaks in food production and retail settings,” in Artificial Intelligence in Medicine (AIME 2021), Lecture Notes in Artificial Intelligence, vol. 12721, pp. 187–198, Springer, 2021, doi: 10.1007/978-3-030-77211-6_21.
[9] G. Lejeune, R. Brixtel, A. Doucet, and N. Lucas, “Multilingual event extraction for epidemic detection,” Artificial Intelligence in Medicine, vol. 65, no. 2, pp. 131–143, 2015, doi: 10.1016/j.artmed.2015.06.005.
[10] L. Martin et al., “CamemBERT: A tasty French language model,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219, 2020, doi: 10.18653/v1/2020.acl-main.645.
[11] World Health Organization, “Rapport sur les épidémies et la couverture sanitaire en Afrique centrale,” Bureau régional de l’OMS pour l’Afrique, 2021. [Online]. Available: https://www.afro.who.int
[12] D. Phutane et al., “Predicting future trends in disease outbreaks using web scraping and machine learning,” Veermata Jijabai Technological Institute, Mumbai, India, 2025. [Online]. Available: https://www.researchgate.net/publication/392892557_Predicting_Future_Trends_in_Disease_Outbreaks_Using_Web_Scraping_and_Machine_Learning_5_th_Druhi_Phutane
[13] L. Richardson, “Beautiful Soup documentation,” 2023. [Online]. Available: https://www.crummy.com/software/BeautifulSoup/
[14] T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, 2020, doi: 10.18653/v1/2020.emnlp-demos.6.
[15] Almanach, “CamemBERT-bio-gliner-v0.1: Zero-shot French biomedical NER model based on GLiNER with CamemBERT-bio backbone,” Hugging Face model card, 2025. [Online]. Available: https://huggingface.co/almanach/camembert-bio-gliner-v0.1.” 2024.
[16] M. T. Ribeiro, S. Singh, and C. Guestrin, "'Why should I trust you?': Explaining the predictions of any classifier," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), pp. 1135–1144, 2016, doi: 10.1145/2939672.2939778.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Franklin Mwamba, Fiston Oshasha, Saint Jean Djungu, John Poma

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








