Improving speaker diarization for naturalistic child-adult conversational interactions using contextual information.
Academic Article
Overview
abstract
While deep learning has driven recent improvements in audio speaker diarization, it often faces performance issues in challenging interaction scenarios and varied acoustic settings such as between a child and adult (caregiver/examiner). In this work, the role of contextual factors that affect diarization performance in such interactions is analyzed. Factors that affect each type of diarization error are identified. Furthermore, a DNN is trained on diarization outputs in conjunction with the factors to improve diarization performance. The results demonstrate the usefulness of incorporating context in improving diarization performance of child-adult interactions in clinical settings.