Institute for Telecommunication Sciences
the research laboratory of the National Telecommunications and Information Administration

Stephen D. Voran

Abstract: Objective speech quality and intelligibility estimators do not correctly assess speech generated by deep neural networks (DNNs). We use 256 speech files and subjective scores that cover 14 DNN speech conditions and 18 nonDNN speech conditions to show that 8 different full-reference (FR) estimators consistently underestimate subjective scores for the DNN conditions. Conversely, we find that five no-reference (NR) estimators consistently overestimate subjective scores for the DNN conditions. We show that a rudimentary but effective solution to these shortcomings is to simply average an FR result with an NR result. We also explore root causes and propose more fundamental solutions. It has been previously suggested that FR estimators over-penalize inaudible timing variations or jitter. We conduct several experiments that measure and remove jitter from spectral representations of DNN speech inside FR estimators. Jitter removal compensates for some of the underestimation, thus confirming that jitter is a part of the cause. In additional experiments we show that power mismatches on a syllabic time-scale also contribute to the underestimation issue in FR estimators. Regarding NR estimators, we suggest that they can be trained to accurately rate DNN speech when sufficient speech signals and corresponding subjective scores are available.

Keywords: speech quality; speech intelligibility; objective estimator; DNN speech; neural speech


To request a reprint of this report, contact:

Lilli Segre, Publications Officer
Institute for Telecommunication Sciences
(303) 497-3572
LSegre@ntia.gov

For technical information concerning this report, contact:

Stephen D. Voran
Institute for Telecommunication Sciences
(303) 497-3839
svoran@ntia.doc.gov


Disclaimer: Certain commercial equipment, components, and software may be identified in this report to specify adequately the technical aspects of the reported results. In no case does such identification imply recommendation or endorsement by the National Telecommunications and Information Administration, nor does it imply that the equipment or software identified is necessarily the best available for the particular application or uses.

Back to Search Results