Deep Learning based classification of vocal folds’ vibration dynamics

Mona Kirstin Fehling; Maximilian Linxweiler; Bernhard Schick; J¨org Lohscheller

, Chapter 1: Manufacturing

Deep Learning based classification of vocal folds’ vibration dynamics

Chapter 1: Manufacturing

Published 19.10.2022

Mona Kirstin Fehling⁺⁻
Maximilian Linxweiler⁺⁻
Bernhard Schick⁺⁻
J¨org Lohscheller⁺⁻

Mona Kirstin Fehling

University of Applied Sciences Trier ; Saarland University Hospital ; University Hospital Mannheim / Medical Faculty Mannheim of Heidelberg University

Maximilian Linxweiler

Saarland University Hospital

Bernhard Schick

Saarland University Hospital

J¨org Lohscheller

University of Applied Sciences Trier

PDF

Keywords

vocal fold vibration
voice disorders
high-speed video
Phonovibrogram
classification
deep neural network

Abstract

Vocal fold (VF) dynamics can be captured in real-time using high-speed videolaryngoscopy, laying the basis for quantitative assessment of the VFs vibration properties. A compact representation of the vibrational behavior as captured in these high-speed videos (HSV) is provided by the so-called Phonovibrogram (PVG). The PVG encodes the VFs vibrational behavior by characteristic spatial and temporal patterns in a three-dimensional representation. Based on these characteristic PVG patterns, this work realizes a fully automatic classification of different voice disorders. For this purpose, a Convolutional Neural Network (CNN) was trained and evaluated using a stratified 10-fold cross-validation strategy on PVGs from 220 subjects to solve two different classification tasks: (a) Classification of the vibrational behavior as physiologic or pathologic and (b) classification of the PVGs according to the subjects actual clinical diagnosis as healthy, muscle tension dysphonia (MTD), paresis, or polyp. The trained CNN distinguished with an average classification accuracy of 0.82±0.07 between physiologic and pathologic VF vibration (sensitivity: 0.81±0.12, specificity: 0.82±0.12) and achieved an average classification accuracy of 0.85±0.07 across all classes (sensitivity: 0.71±0.19, specificity: 0.91 ± 0.07) for classification according to the clinical diagnoses. Based on the PVG representation, the presented approach reliably differentiates between physiologic and pathologic VF vibration and is even eligible to distinguish types of voice disorders without user interaction. However, to further increase the method’s performance, a larger amount of training data is required.

PDF

This work is licensed under a Creative Commons Attribution 4.0 International License.