Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition. (arXiv:2206.07327v3 [eess.AS] UPDATED)
By: <a href="http://arxiv.org/find/eess/1/au:+Hu_S/0/1/0/all/0/1">Shujie Hu</a>, <a href="http://arxiv.org/find/eess/1/au:+Xie_X/0/1/0/all/0/1">Xurong Xie</a>, <a href="http://arxiv.org/find/eess/1/au:+Geng_M/0/1/0/all/0/1">Mengzhe Geng</a>, <a href="http://arxiv.org/find/eess/1/au:+Cui_M/0/1/0/all/0/1">Mingyu Cui</a>, <a href="http://arxiv.org/find/eess/1/au:+Deng_J/0/1/0/all/0/1">Jiajun Deng</a>, <a href="http://arxiv.org/find/eess/1/au:+Li_G/0/1/0/all/0/1">Guinan Li</a>, <a href="http://arxiv.org/find/eess/1/au:+Wang_T/0/1/0/all/0/1">Tianzi Wang</a>, <a href="http://arxiv.org/find/eess/1/au:+Liu_X/0/1/0/all/0/1">Xunying Liu</a>, <a href="http://arxiv.org/find/eess/1/au:+Meng_H/0/1/0/all/0/1">Helen Meng</a> Posted: June 23, 2023
Articulatory features are inherently invariant to acoustic signal distortion
and have been successfully incorporated into automatic speech recognition (ASR)
systems designed for normal speech. Their practical application to atypical
task domains such as elderly and disordered speech across languages is often
limited by the difficulty in collecting such specialist data from target
speakers. This paper presents a cross-domain and cross-lingual A2A inversion
approach that utilizes the parallel audio and ultrasound tongue imaging (UTI)
data of the 24-hour TaL corpus in A2A model pre-training before being
cross-domain and cross-lingual adapted to three datasets across two languages:
the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora;
and the English TORGO dysarthric speech data, to produce UTI based articulatory
features. Experiments conducted on three tasks suggested incorporating the
generated articulatory features consistently outperformed the baseline TDNN and
Conformer ASR systems constructed using acoustic features only by statistically
significant word or character error rate reductions up to 4.75%, 2.59% and
2.07% absolute (14.69%, 10.64% and 22.72% relative) after data augmentation,
speaker adaptation and cross system multi-pass decoding were applied.
Provided by:
http://arxiv.org/icons/sfx.gif