| Total: 34
This paper proposes a microphone array based statistical speech activity detection (SAD) method for analyses of poster presentations recorded in the presence of noise. Such poster presentations are a kind of multi-party conversation, where the number of speakers and speaker location are unrestricted, and directional noise sources affect the direction of arrival of the target speech signals. To detect speech activity in such cases without a priori knowledge about the speakers and noise environments, we applied a likelihood ratio test based SAD method to spatial power distributions. The proposed method can exploit the enhanced signals obtained from time-frequency masking, and work even in the presence of environmental noise by utilizing the a priori signal-to-noise ratios of the spatial power distributions. Experiments with recorded poster presentations confirmed that the proposed method significantly improves the SAD accuracies compared with those obtained with a frequency spectrum based statistical SAD method.
In this paper, we apply a discriminative weight training to a statistical model-based voice activity detection (VAD). In our approach, the VAD decision rule is expressed as the geometric mean of optimally weighted likelihood ratios (LRs) based on a minimum classification error (MCE) method. That approach is different from that of previous works in that different weights are assigned to each frequency bin and is considered to be more realistic. According to the experimental results, the proposed approach is found to be effective for the statistical model-based VAD using the LR test.
This paper presents a comparative evaluation of different methods for voice activity detection (VAD). A novel feature set is proposed in order to improve VAD performance in diverse noisy environments. Furthermore, three classifiers for VAD are evaluated. The three classifiers are Gaussian Mixture Model (GMM), Support Vector Machine (SVM) and Decision Tree (DT). Experimental results show that the proposed feature set achieves better performance than spectral entropy. In the comparison of the classifiers, DT shows the best performance in terms of frame-based VAD accuracy as well as computational cost.
Every speech recognition system contains a speech/non-speech detection stage. Detected speech sequences are only passed through the speech recognition stage later on. In a noisy environment, non-speech segments can be an important source of error. In this work, we introduce a new speech/non-speech detection system based on fractal dimension and prosodic features plus the common used MFCC features. We evaluated our system performance using neural network and SVM classifiers on TIMIT speech database with a HMM based speech recognizer. Experimental results show very good performance in speech/non-speech detection.
This work proposes a system for acoustic event classification using signals acquired by a Distributed Microphone Network (DMN). The system is based on the combination of Gaussian Mixture Models (GMM) and Support Vector Machines (SVM). The acoustic event list includes both speech and non-speech events typical of seminars and meetings. The robustness of the system was investigated by considering two scenarios characterized by different types of trained models and testing conditions. Experimental results were obtained by using real-world data collected at two sites. The results in terms of classification error rate show that in each scenario the proposed system outperforms any single classifier based system.
We introduce a new class of speech processing, called Intentional Voice Command Detection (IVCD). It is necessary to reject not only noises but also unintended voices to achieve completely hands-free speech interface. Conventional VAD framework is not sufficient for such purpose, and we discuss how we should define IVCD and how we can realize it. We investigate implementation of IVCD from the viewpoint of feature extraction and classification, and show that the combination of various features and SVM can achieve IVCD accuracy of 93.2% for a large-scale audio database in real home environments.
Detection of acoustic events (AED) that take place in a meetingroom environment becomes a difficult task when signals show a large proportion of temporal overlap of sounds, like in seminartype data, where the acoustic events often occur simultaneously with speech. Whenever the event that produces the sound is related to a given position or movement, video signals may be a useful additional source of information for AED. In this work, we aim at improving the AED accuracy by using two complementary audio-based AED systems, built with SVM and HMM classifiers, and also a video-based AED system, which employs the output of a 3D video tracking algorithm to improve detection of steps. Fuzzy integral is used to fuse the outputs of the three classification systems in two stages. Experimental results using the CLEAR'07 evaluation data show that the detection rate increases by fusing the two audio information sources, and it is further improved by including video information.
We describe a method of simultaneously tracking noise and speech levels for signal-to-noise ratio adaptive speech endpoint detection. The method is based on the Kalman filter framework with switching observations and uses a dynamic distribution that 1) limits the rate of change of these levels 2) enforces a range on the values for the two levels and 3) enforces a ratio between the noise and the signal levels. We call this a Lombard dynamic distribution since it encodes the expectation that a speaker will increase his or her vocal intensity in noise. The method also employs a state transition matrix which encodes a prior on the states and provides a continuity constraint. The new method provides 46.1% relative improvement in WER over a baseline GMM based endpointer at 20 dB SNR.
This paper takes note of the major influences of room acoustic effects on the fundamental frequency F0 of speech and its determination. A detailed description of room acoustic measures and effects is given. As a conclusion of those the dependency of the speaker to microphone distance (SMD) is to be studied combined with the reverberation time T60. Evaluation experiments aiming to find dependencies of the room acoustic effects are accomplished. In contrast to most of the studies which deal with reverberation, this paper proves that T60 cannot be the only measure to describe the behavior of systems in reverberant environments. Experiments studying the dependency on the SMD are new contributions within this paper. The experiments are an extension of the previous studies on the dependency on artificial reverberation, where twelve F0 estimation methods are compared. Here, two further methods are added and the systematic use of real measured room impulse responses (RIR) is applied. Apart from the SMD dependency other main contributions are the experiments of different disturbing effects on high and low F0. The results show male F0 estimation is significantly more sensitive to reverberation than female.
Pitch marking is a major task in speech processing. Thus, an accurate detection of pitch marks (PM) is required. In this paper, we propose a hybrid method for pitch marking that combines outputs of two different speech signal based pitch marking algorithms (PMA). We use a finite state machine (FSM) to represent and combine the pitch marks. The hybrid PMA is implemented in four stages: preprocessing, alignment, selection and postprocessing. In the alignment stage, the preprocessed pitch marks are shifted to a local minimum of the speech signal and the confidence score for every pitch mark is calculated. The confidence scores are used as transition weights for the FSM. The PMA outputs are combined into a single sequence of pitch marks. The more accurate pitch marks with the highest confidence score are chosen in the selection stage. A PM reference database contains 10 minutes speech including manually adjusted PM. The evaluation results indicate that the proposed hybrid method outperforms the single PMAs but also other current state-of-the-art algorithms which have been evaluated on a second reference database containing 44 speakers.
In this paper, we propose a novel representation of F0 contours that provides a computationally efficient algorithm for automatically estimating the parameters of a F0 control model for singing voices. Although the best known F0 control model, based on a second-order system with a piece-wise constant function as its input, can generate F0 contours of natural singing voices, this model has no means of learning the model parameters from observed F0 contours automatically. Therefore, by modeling the piece-wise constant function by Hidden Markov Models (HMM) and approximating the second order differential equation by the difference equation, we estimate model parameters optimally based on iteration of Viterbi training and an LPC-like solver. Our representation is a generative model and can identify both the target musical note sequence and the dynamics of singing behaviors included in the F0 contours. Our experimental results show that the proposed method can separate the dynamics from the target musical note sequence and generate the F0 contours using estimated model parameters.
Most multi-pitch algorithms are tested for performance only in voiced regions of speech, and are prone to yield pitch estimates even when the participating speakers are unvoiced. This paper presents a multi-pitch algorithm that detects the voiced and unvoiced regions in a mixture of two speakers, identifies the number of speakers in voiced regions, and yields the pitch estimates of each speaker in those regions. The algorithm relies on the 2-Dimensional AMDF for estimating the periodicity of the signal, and uses the temporal evolution of the 2-D AMDF to estimate the number of speakers present in periodic regions. Evaluation of this algorithm on a frame-wise basis demonstrates accurate voiced / unvoiced decisions and also gives pitch estimation results comparable to the state of the art. The pitch estimation errors are quantitatively analyzed and shown to be resulting partly from speaker domination & pitch matching between speakers.
In this paper, we present an approach to track the pitch of two simultaneous speakers. Using a well-known feature extraction method based on the correlogram, we track the resulting data using a factorial hidden Markov model (FHMM). In contrast to the recently developed multipitch determination algorithm [1], which is based on a HMM, we can accurately associate estimated pitch points with their corresponding source speakers. We evaluate our approach on the "Mocha-TIMIT" database [2] of speech utterances mixed at 0dB, and compare the results to the multipitch determination algorithm [1] used as a baseline. Experiments show that our FHMM tracker yields good performance for both pitch estimation and correct speaker assignment.
In this paper, a new cochannel speech separation algorithm using multi-pitch extraction and speaker model based sequential grouping is proposed. After auditory segmentation based on onset and offset analysis, robust multi-pitch estimation algorithm is performed on each segment and the corresponding voiced portions are segregated. Then speaker pair model based on support vector machine (SVM) is employed to determine the optimal sequential grouping alignments and group the speaker homogeneous segments into pure speaker streams. Systematic evaluation on the speech separation challenge database shows significant improvement over the baseline performance.
All fundamental frequency estimators based on spectral analysis rely heavily on a proper harmonic selection of the voice analyzed. Since in practice other spectral peaks pertaining to sources external to the speech considered may be present in the signal, various schemes have been designed to ensure a satisfactory elimination of non pertinent harmonics.
A new project on multi-modal analysis of poster sessions is introduced. We have designed an environment dedicated to recording of poster conversations using multiple sensors, and collected a number of sessions, to which a variety of multi-modal information is annotated, including utterance units for individual speakers, backchannels, nodding, gazing, and pointing. Automatic speaker diarization, that is a combination of speech activity detection and speaker identification, is conducted using a set of distant microphones, and a reasonable performance is obtained. Then, we investigate automatic classification of conversation segments into two modes: presentation mode and question-answer mode. Preliminary experiments show that multi-modal features on nonverbal behaviors play a significant role in the indexing of this kind of conversations.
This paper deals with an HMM-based automatic phonetic segmentation (APS) system and proposes to increase its performance by employing a pitch-synchronous (PS) coding scheme. Such a coding scheme uses different frames of speech throughout voiced and unvoiced speech regions and enables thus better modelling of each individual phone. The PS coding scheme is shown to outperform the traditionally utilised pitch-asynchronous (PA) coding scheme for two corpora of Czech speech (one female and one male) both in the case of a base (not-refined) APS and in the case of a CART-refined APS. Better results were observed for each of the voicing-dependent boundary types (unvoiced-unvoiced, unvoiced-voiced, voiced-unvoiced and voiced-voiced).
This paper describes two experimental protocols for direct comparison of human and machine phonetic discrimination performance in continuous speech. These protocols attempt to isolate phonetic discrimination while eliminating for language and segmentation biases. Results of two human experiments are described including comparisons with automatic phonetic recognition baselines. Our experiments suggest that in conversational telephone speech, human performance on these tasks exceeds that of machines by 15%. Furthermore, in a related controlled language model experiment, human subjects were better able to correctly predict words in conversational speech by 45%.
In order to accelerate the promotion of speech recognition systems to the public; understanding characteristics of speech in real environments is one of the most important issues. This paper reports variations of speech characteristics in a car environment. To analyze speech characteristics in the specific environment, a corpus, recorded carefully in terms of equality of utterances and conditions for whole set of speakers, is necessary. We created a new corpus named "Drivers' Japanese Speech Corpus in a Car Environment (DJS-C)": composed of utterances of words useful for the operation of in-vehicle information appliances. Analysis of the DJS-C corpus shows that differences in speech characteristics are diverse among drivers and change with driving conditions. Quantitative analysis and speech recognition experiments show that performance degrades due to Distance between Phonemes, Uniqueness of Speaker's Voice, and SNNR.
This paper presents the steps needed to make a corpus of Dutch spontaneous dialogues accessible for automatic phonetic research aimed at increasing our understanding of reduction phenomena and the role of fine phonetic detail. Since the corpus was not created with automatic processing in mind, it needed to be reshaped. The first part of this paper describes the actions needed for this reshaping in some detail. The second part reports the results of a preliminary analysis of the reduction phenomena in the corpus. For this purpose a phonemic transcription of the corpus was created by means of a forced alignment, first with a lexicon of canonical pronunciations and then with multiple pronunciation variants per word. In this study pronunciation variants were generated by applying a large set of phonetic processes that have been implicated in reduction to the canonical pronunciations of the words. This relatively straightforward procedure allows us to produce plausible pronunciation variants and to verify and extend the results of previous reduction studies reported in the literature.
This paper describes the ECESS evaluation campaign of voice activity and voicing detection. Standard VAD classifies signal into speech and non-speech, we extend it to VAD+ so that it classifies a signal as a sequence of non-speech, voiced and unvoiced segments. The evaluation is performed on a portion of the Spanish SPEECON database with manually labeled segmentation. To avoid errors caused by the limited precision of manual labeling we introduce "dead zones" - tolerance intervals +-5 ms around label changes in the data set. In these tolerance intervals we don't evaluate the signal.
In this paper we describe WikiSpeech, a content management system for the web-based creation of speech databases for the development of spoken language technology and basic research. Its main features are full support for the typical recording, annotation and project administration workflow, easy editing of the speech content, plus a fully localizable user interface.
This paper presents the results of a set of experiments assessing the perceived quality of the Polish version of the BOSS unit selection synthesis system. The experiments aimed to evaluate the potential improvement of synthesis quality by three factors pertaining to corpus structure and coverage as well as levels of corpus annotation. The three factors affecting synthesis quality were (i) manual vs. automatic corpus annotation, (ii) coverage of CVC triphones in rich intonational patterns, and (iii) coverage of complex consonant clusters. Results indicate that a manual correction of automatic annotations enhances synthesis quality. Increased coverage of CVC sequences and consonant clusters also improved the perceived synthesis quality, but the effect was smaller than anticipated.
Live interoperation of several speech- and text-processing engines is key to tasks such as real-time cross-language story segmentation, topic clustering, and captioning of video. One requirement for interoperation is a common data format shared across engines, so that the output of one can be understood as the input of another. The GALE Type System has been created to serve this purpose for interoperating language-identification, speaker-recognition, speech-recognition, named-entity-detection, translation, story-segmentation, topic-clustering, summarization, and headline-generation engines in the context of Unstructured Information Management Architecture. GTS includes types designed to bridge across the domains of these engines, for example, linking the text-only domain of translation to the time-domain types needed for speech processing, and the monolingual domain of information-extraction engines to the cross-language types needed for translation.
Selecting efficiently a minimum amount of text from a large-scale text corpus to achieve a maximum coverage of certain units is an important problem in spoken language processing area. In this paper, the above text selection problem is first formulated as a maximum coverage problem with a Knapsack constraint (MCK). An efficient rank-predicted pseudo-greedy approach is then proposed to solve this problem. Experiments on a Chinese text selection task are conducted to verify the efficiency of the proposed approach. Experimental results show that our approach can improve significantly the text selection speed yet without sacrificing the coverage score compared with traditional greedy approach.