CNN-Based AI for Laryngeal Cancer Diagnosis: 2026 Review

CNN-based AI for laryngeal cancer diagnosis is now backed by the first large pooled evidence base. The 2025 Current Oncology meta-analysis combined fifteen studies and 17,559 patients to ask one question: how good is AI at spotting laryngeal cancer on endoscopy? Sensitivity came in at 78%, specificity at 86%, with a pooled diagnostic odds ratio of 53.77 [Alabdalhussein, Artificial Intelligence in Laryngeal Cancer Detection: A Systematic Review and Meta-Analysis, 2025]. That number is the benchmark against which every individual CNN paper now gets read.

This article walks through what convolutional neural networks (CNNs) actually do with a laryngoscopy image, what the best published systems achieve, and where the technology meaningfully changes — or does not change — clinical work.

What CNNs Do with a Laryngoscopy Image

A CNN is a neural network that learns features directly from image data — edges, textures, color patterns, then higher-order shapes — without a human engineer hand-coding what a tumor looks like. Laryngoscopy is unusually well suited to this. The view is standardized, the mucosa is visually rich, and the pathology that matters (irregular vasculature, mucosal whitening, exophytic growth) lives in patterns CNNs are designed to detect.

The two clinical inputs that matter are white-light endoscopy (WLE) and narrow-band imaging (NBI). NBI enhances submucosal vascular detail and has long been the visual cue ENT surgeons use to triage suspicious lesions. CNNs handle both. The 2025 systematic review found no statistically significant difference in AI accuracy between WLE and NBI inputs [Alabdalhussein, Artificial Intelligence in Laryngeal Cancer Detection: A Systematic Review and Meta-Analysis, 2025] — a useful finding for clinics without NBI scopes.

White-light versus narrow-band laryngoscopy image with CNN heatmap highlighting suspicious vocal cord lesion

What the Best Systems Achieve

A 2024 meta-analysis restricted to deep-learning laryngoscopy pooled nine studies covering 106,175 endoscopic images and reported pooled sensitivity of 0.95 (95% CI 0.85–0.98), specificity of 0.96 (95% CI 0.91–0.98), and an AUC of 0.99 [Du, Diagnostic accuracy of deep learning-based algorithms in laryngoscopy: a systematic review and meta-analysis, 2024]. The gap between this and the 78/86% number above is informative: stricter inclusion (deep-learning only, image-level rather than patient-level outcomes) produces much higher numbers. Real-world performance probably sits between the two.

Individual studies — beginning with the foundational work that set the field’s direction — anchor that range.

Xiong 2019: the proof-of-concept external validation. A Chinese multicenter team trained a deep CNN on 13,721 laryngoscopic images from two tertiary hospitals across four classes — laryngeal cancer (LCA), precancerous lesions (PRELCA), benign tumors (BLT), and normal tissue (NORM) — then tested on 1,176 images from three different tertiary hospitals [Xiong, Computer-aided diagnosis of laryngeal cancer via deep learning based on laryngoscopic images, 2019]. Because the testing institutions were entirely separate from training, this is true external validation, not internal.

For the clinically meaningful binary task — urgent (cancer or precancer) versus non-urgent — the model achieved a sensitivity of 0.731, a specificity of 0.922, and an AUC of 0.922 on the main DS3 external test set [Xiong, 2019]. In a separate human-versus-machine protocol (DS1 and DS2 combined as training, DS3 retested), the same architecture posted sensitivity 0.720, specificity 0.948, and AUC 0.953 — and at this threshold matched an endoscopist with 10–20 years of experience while outperforming the two juniors (roughly 3 and 3–10 years).

The 73% sensitivity is the honest part of the result. Missing nearly three in ten cancers is unacceptable as a standalone screen, so the model’s role is to assist the endoscopist, not to gate the biopsy — a second pair of eyes, not a substitute for them. Most subsequent CNN laryngoscopy papers cite this study as the foundational benchmark, and later work pushed performance higher with deeper architectures and richer multimodal inputs.

Densenet201 on laryngoscopic images. A two-center Chinese study using the Densenet201 architecture reported 98.5% accuracy in training, 92.0% on internal validation, and 86.3% on external validation, with AUC consistently above 92%. Performance matched one experienced clinician and beat another on external data [Xu, Computer-Aided Diagnosis of Laryngeal Cancer Based on Deep Learning with Laryngoscopic Images, 2023]. The drop from internal to external validation is the part to remember — it is the most honest single data point in the literature about how these models generalize.

LPAIDS for real-time WLI and NBI. A multicenter Chinese group built a CNN that runs in real time on both white-light and narrow-band video. Across six independent test sets, accuracy ranged 0.949–0.984, sensitivity 0.901–0.986, specificity 0.946–0.987, and AUC 0.965–0.987. In a human–machine video test, the system matched expert laryngologists and outperformed less-experienced ones [Li, Real-time detection of laryngopharyngeal cancer using an artificial intelligence-assisted system with multimodal data, 2023].

Multimodal voice + image ensemble (Korea). A Pusan National University group trained CNN classifiers on both laryngoscopic images and voice samples for early glottic cancer, then combined them with decision-tree ensemble learning. The integration of two modalities raised accuracy on a relatively small dataset — a route under-explored elsewhere [Kwon, Diagnosis of Early Glottic Cancer Using Laryngeal Image and Voice Based on Ensemble Learning of Convolutional Neural Network Classifiers, 2022].

Foundation models, no training. Out-of-the-box Google Gemini 1.5 Pro recognized laryngoscopy in 87/88 frames (98.9%) and correctly diagnosed pathology on 55/88 frames (62.5%) — but only 3/15 videos (20%) — without any fine-tuning [Setzen, AI-Powered Laryngoscopy: Exploring the Future With Google Gemini, 2025]. Useful as a glimpse of where general-purpose AI sits today: surprisingly good at recognition, still weak at pathology.

Architecture summary

Model family	Typical input	Reported strength
Pretrained DCNN (Xiong 2019)	Laryngoscopy still image	First external validation benchmark
Densenet201	Laryngoscopy still image	Generalization, comparable to experienced clinician
FCN-ResNet101	Laryngoscopy still image	Lesion localization (segmentation)
Custom CNN + ensemble	Image + voice	Small-dataset accuracy
Foundation models (Gemini)	Frames + video	Multimodal recognition, weak pathology

A 2026 Korean platform from Chonnam National University used FCN-ResNet101 in a two-stage design: first select valid vocal-cord frames, then localize cancer within them [Jang, Development of AI-Based Laryngeal Cancer Diagnostic Platform Using Laryngoscope Images, 2026]. This filter-then-detect pattern is becoming standard because most raw laryngoscopy frames are noisy or off-target.

Comparison of CNN architectures for laryngeal cancer detection — Xiong 2019, DenseNet201, LPAIDS, Gemini

What CNNs Do Not Solve

Out-of-distribution lesions are the first problem. Models trained on Chinese or Korean tertiary-center data tend to drop in accuracy on external validation, as Xu et al. directly showed (98.5% training → 86.3% external) [Xu, Computer-Aided Diagnosis of Laryngeal Cancer Based on Deep Learning with Laryngoscopic Images, 2023].

Biopsy decisions are the second. Pathological biopsy under laryngoscopy remains the diagnostic gold standard for laryngeal cancer [Li, Real-time detection of laryngopharyngeal cancer using an artificial intelligence-assisted system with multimodal data, 2023]. No current CNN replaces tissue diagnosis.

Regulatory readiness is the third. A separate 2025 meta-analysis covering AI across endoscopy, voice, and histopathology for laryngeal lesions concluded that most published tools remain research-grade and need larger prospective trials before practical clinical adoption [Marrero-Gonzalez, Application of artificial intelligence in laryngeal lesions: a systematic review and meta-analysis, 2025].

Image quality variability is the fourth. Camera, lighting, scope generation, and breath-hold all differ across clinics. The two-stage frame-filtering design used by the Chonnam group [Jang, Development of AI-Based Laryngeal Cancer Diagnostic Platform Using Laryngoscope Images, 2026] is partly an admission of how much real laryngoscopy data is unusable.

Clinical Perspective

The reasonable clinical role for CNN-based laryngoscopy today is as a second-reader on flexible laryngoscopy in primary-care ENT, and as a training tool for residents reading their first hundred glottic lesions. The sensitivity numbers suggest a CNN catches things a junior clinician would miss; the specificity numbers suggest it would not generate unmanageable referral noise.

The roles it should not fill: substitute for biopsy on a suspicious lesion, or primary read on a post-radiation larynx, where mucosal changes confound even experienced eyes. The Korean voice-plus-image ensemble approach is clinically interesting because hoarseness is usually the first symptom that brings a patient to ENT. A system that triages by voice in primary care, then refers to imaging, fits real-world referral patterns better than image-only models.

The honest gap is prospective validation. Most reported accuracy numbers come from retrospective image sets, often single-institution. Until multicenter prospective trials exist — and the 2025 reviews say they do not yet — the right framing is “AI as decision support,” not “AI as diagnosis.”

Key Takeaways

CNN-based AI for laryngeal cancer diagnosis reaches pooled sensitivity 78% and specificity 86% across 17,559 patients (2025 meta-analysis).
The foundational Xiong 2019 study established external validation feasibility (sensitivity 0.731, specificity 0.922, AUC 0.922) — most later CNN laryngoscopy papers cite it as the benchmark.
Restricted to deep-learning studies, pooled sensitivity rises to 95% and specificity to 96% (2024 meta-analysis, 106,175 images).
CNN models outperform non-CNN AI for this task; Densenet201 and ResNet-based architectures dominate.
Best-performing real-time systems (LPAIDS) match expert laryngologists in head-to-head video tests.
Pathological biopsy remains the diagnostic gold standard; CNNs serve as triage and decision support, not replacement.
Multimodal models combining image and voice are an underrated direction with Korean clinical roots.

FAQ

How accurate is AI at detecting laryngeal cancer?

Pooled sensitivity is 78% and specificity 86% across the 2025 systematic review of 15 studies and 17,559 patients. Deep-learning-only subsets reach 95% sensitivity and 96% specificity on retrospective image datasets — clinical prospective performance likely sits between these two ranges.

Does AI replace biopsy?

No. Pathological biopsy under laryngoscopy remains the gold standard. CNN systems help triage which lesions look suspicious enough to warrant biopsy, not which to skip.

Is CNN-based laryngeal cancer AI FDA-approved or KFDA-approved?

As of early 2026, most laryngeal-cancer CNN systems remain research-stage. Recent meta-analyses specifically call for larger prospective multicenter trials before practical clinical adoption.

Is white-light or narrow-band imaging better for AI?

The 2025 meta-analysis found no statistically significant difference in AI accuracy between WLE and NBI inputs. CNN systems handle both, which matters for clinics without NBI scopes.

Can AI detect laryngeal cancer from voice alone?

Voice-only CNN models exist but underperform image-based systems. Multimodal ensembles combining image and voice — pioneered by a Korean group at Pusan National University — outperform either modality alone on small datasets.

References

Alabdalhussein A, Al-Khafaji MH, Al-Busairi R, Al-Dabbagh S, Khan W, Anwar F, Raheem TS, Elkrim M, Sahota RB, Mair M. Artificial Intelligence in Laryngeal Cancer Detection: A Systematic Review and Meta-Analysis. Curr Oncol. 2025;32(6):338.
Du S, Guo J, Huang D, Liu Y, Zhang X, Lu S. Diagnostic accuracy of deep learning-based algorithms in laryngoscopy: a systematic review and meta-analysis. Eur Arch Otorhinolaryngol. 2025;282(1):351-360.
Xiong H, Lin P, Yu JG, Ye J, Xiao L, Tao Y, Jiang Z, Lin W, Liu M, Xu J, Hu W, Lu Y, Liu H, Li Y, Zheng Y, Yang H. Computer-aided diagnosis of laryngeal cancer via deep learning based on laryngoscopic images. EBioMedicine. 2019;48:92-99.
Xu ZH, Fan DG, Huang JQ, Wang JW, Wang Y, Li YZ. Computer-Aided Diagnosis of Laryngeal Cancer Based on Deep Learning with Laryngoscopic Images. Diagnostics (Basel). 2023;13(24):3669.
Li Y, Gu W, Yue H, Lei G, Guo W, Wen Y, Tang H, Luo X, Tu W, Ye J, Hong R, Cai Q, Gu Q, Liu T, Miao B, Wang R, Ren J, Lei W. Real-time detection of laryngopharyngeal cancer using an artificial intelligence-assisted system with multimodal data. J Transl Med. 2023;21(1):698.
Kwon I, Wang SG, Shin SC, Cheon YI, Lee BJ, Lee JC, Lim DW, Jo C, Cho Y, Shin BJ. Diagnosis of Early Glottic Cancer Using Laryngeal Image and Voice Based on Ensemble Learning of Convolutional Neural Network Classifiers. J Voice. 2022. Epub ahead of print.
Setzen SA, Andreadis K, Elemento O, Rameau A. AI-Powered Laryngoscopy: Exploring the Future With Google Gemini. Laryngoscope. 2025;135(6):1851-1853.
Jang HB, Park SB, Lee SJ, Yang GS, Hong AR, Lee DH. Development of AI-Based Laryngeal Cancer Diagnostic Platform Using Laryngoscope Images. Diagnostics (Basel). 2026;16(2):227.
Marrero-Gonzalez AR, Diemer TJ, Nguyen SA, Camilon TJM, Meenan K, O’Rourke A. Application of artificial intelligence in laryngeal lesions: a systematic review and meta-analysis. Eur Arch Otorhinolaryngol. 2025;282(3):1543-1555.

Joonpyo Hong, MD is a board-certified otolaryngologist practicing in Korea. This article reflects his clinical interpretation of published research and does not constitute individual medical advice.