- Computer Science; Mathematics; Social Sciences; Arts and Humanities
This work explores the utility of the time-domain signal components, or the Intrinsic Mode Functions (IMFs), of speech signals’, as generated from the data-adaptive filterbank nature of Empirical Mode Decomposition (EMD), in characterizing speakers for the task of text-independent Speaker Verification (SV). A modified version of EMD, denoted as MEMD, which extracts IMFs with lesser mode-mixing, and provides a better representation of the higher frequency spectrum of speech, is also utilized for the SV task. Three different features are extracted over 20 ms frames, from the IMFs of EMD and MEMD. They are, then, tested individually, and in conjunction with the Mel Frequency Cepstral Coefficients (MFCCs), for SV. Two corpora - the NIST SRE 2003 corpus, and the CHAINS corpus - are used for the experiments. The results evaluated on the NIST SRE 2003 database, using the i-vector framework, reveal that the features extracted from the IMFs, in conjunction with the MFCCs, enhances the performance of the SV system. Further, it is observed that only a small set of lower-order IMFs is useful and necessary for characterizing speaker-specific information. The combination of the features with the MFCCs is also found to be useful when short speech utterances of ≤10 s are used for testing. Similarly, the results evaluated on the CHAINS corpus, using the conventional Gaussian Mixture Model (GMM) framework, reveal that the features, in combination with the MFCCs, enhance the performance of the SV system, not only for normal speech, but also for fast and whispered speech. Again, it is observed that only the first few IMFs are needed and useful for achieving such enhanced performance.