IETE Technical Review
Home | About us | Search | Current Issue | Past Issues | Guidelines | Subscribe | ContactLogin 
IETE Technical Review
  Users Online: 3 Print this page  Email this page Small font size Default font size Increase font size



 

ARTICLE
Year : 2009  |  Volume : 26  |  Issue : 2  |  Page : 108-114 Previous Article Table of Contents Next Article  

Automatic Evaluation of Fluency in Spoken Language


 Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089-2564, USA

Correspondence Address:
Kartik Audhkhasi
Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089-2564
USA
Login to access the Email id

Source of Support: None, Conflict of Interest: None


doi:10.4103/0256-4602.49093


Get Permissions

   Abstract 

This paper introduces the problem of automatic evaluation of spoken language fluency, and reviews some techniques proposed in the literature to solve it. In the first section, the need for automatically evaluating human spoken language skills is motivated through real life scenarios ranging from call centers to kindergarten classes. This is followed by a definition of the problem. Next, some important works in the domain are presented. Definitions of various relevant terms from speech and language processing subjects are intertwined in the paper to aid the uninitiated reader. This paper concludes by discussing the research challenges in the area.

Keywords: Fluency evaluation, Language learning, Prosodic features, Lexical features, Speech processing, Automatic speech recognition.


How to cite this article:
Audhkhasi K. Automatic Evaluation of Fluency in Spoken Language. IETE Tech Rev 2009;26:108-14

How to cite this URL:
Audhkhasi K. Automatic Evaluation of Fluency in Spoken Language. IETE Tech Rev [serial online] 2009 [cited 2010 Feb 9];26:108-14. Available from: http://tr.ietejournals.org/text.asp?2009/26/2/108/49093


   1. Introduction Top


The need to subjectively evaluate various spoken language parameters like grammar, pronunciation, speaking rate and fluency arises in many real-life problems. One typical scenario is a call center, where customer satisfaction is the top priority. The agents are continuously evaluated and trained in their spoken language fluency skills. Human assessors subject them to various tests, which include both reading and extempore tasks. The agents are rated on their performance in these assessment tests. The assessors use a variety of criteria to judge the agents, like speaking rate, number of unfilled (silence) and filled pauses (meaningless word like "UMM", "AHH" etc.), repetitiveness of sentences and thoughts (in an extempore task) etc.

Two logical conclusions can be easily made about this evaluation process. First, the process is inherently subjective, and most of the time, the assessors use their intuitive notion of what constitutes ideal speech. Second, this subjective evaluation is both time-consuming and expensive. The assessors must themselves be trained in the art of evaluating spoken language skills, and the evaluation process requires their physical presence. This motivates the need for automatic algorithms to assess spoken language parameters, which do not require intervention by human assessors. Automatic evaluation of spoken language parameters is thus, an extremely important and active research area. A prominent example of such research is the Technology Based Assessment of Language and Literacy (TBALL) [1] project at the University of Southern California, where attempts have been made to devise automatic algorithms for evaluating the English ­literacy skills of Mexican-American children in grades K-2. The various evaluation tests designed include phonemic awareness, letter naming and sounding, isolated word reading, story reading and listening comprehension.


   2. Problem Definition Top


Before giving the problem definition, it is important to understand the structure of a typical spoken language disfluency. Levelt [2] proposed what is now called the Repair Interval Model (RIM) of a disfluency, as shown in [Figure 1]. According to this model, disfluent speech consists of three main segments: the reparandum, the edit phase and the alteration. The speaker formulates thoughts in his/her mind about what he/she will speak. But, in the middle of the utterance, he/she may feel the need for a correction, since the utterance may be grammatically incorrect, conceptually incoherent or even not what he/she wanted to convey in the first place. The occurrence of this correction manifests itself as a discontinuity in the normal flow of words and ideas in the sentence, and causes a disfluency. According to the RIM model, the reparandum is that region before the actual interruption which the speaker intends to replace. Next, the edit phase follows, where the speaker is thinking of a replacement for the reparandum. This phase is characterized by the utterance of unfilled pauses (e.g. silence, breath), filled pauses (meaningless words like "UMM", "AHH" etc.) or discourse markers (e.g. "like", "you know", "actually" etc.). It must be noted that the discourse markers are also perfectly valid words in fluent speech. The alteration is the region where the correction has been made by the speaker; thus, fluent speech has resumed. The short time segment between the reparandum and the edit phase is referred to as the interruption point, since it signifies the interruption or departure from fluency. [Table 1] gives some examples of transcriptions of disfluent speech ­segments, and classifies the various regions into the reparandum, interruption point, edit phase and alteration.

There are two broad categories of algorithms which deal with the problem of automatically evaluating disfluent speech. A majority of the techniques belong to the first class, and aim at detecting the precise time locations where the speaker has become disfluent [3],[4],[5],[6],[7] . The motive behind such algorithms is primarily to enable rich transcription of the Automatic Speech Recognition (ASR) system output. ASR systems have two major components: the Acoustic Model (AM) and the Language Model (LM) [8] . The AM models the statistics of the spectral characteristics of the speech signal, and the LM models the structure of the language by capturing the probabilities of various word sequences. The text used for training the LM is normally clean and devoid of any disfluency markers and partial words. Hence, the performance of the ASR system degrades during a disfluent segment, and the output of the system is often meaningless text. Rich transcription of the ASR output denotes the process of providing additional layers of information apart from the simple recognition hypothesis. In the case of disfluent speech, this layer would include tags indicating the occurrence of regions where the recognition hypothesis is likely to be incorrect. A related area is finding confidence measures for ASR [9],[10],[11] , which provide estimates of the correctness of the various words in the recognition hypothesis.

The second class of algorithms are interested in ­estimating objective scores indicative of the fluency of a speaker [12],[13],[14] . Most of these algorithms build upon the knowledge of the disfluency locations. The performance measure of such algorithms is the similarity of the objective scores to the subjective ones assigned by the human assessors. This is in contrast to the case of algorithms of the first class, where the accuracy of the location of disfluent segments is the main performance criterion. Although the focus of this paper is the second class, techniques from the first class will also be presented. The next section starts by discussing the acoustic phonetic consequences of disfluent speech. An understanding of these physiologically-driven phenomena is critical for deriving features for detecting disfluent segments.


   3. Techniques Top


3.1 Acoustic-Phonetic Phenomena During Disfluencies

Shriberg [15] discusses the various acoustic phonetic phenomena which can be observed in different segments of the RIM. An understanding of these phenomena helps in choosing prosodic (speech signal-based) features for evaluating spoken language fluency. Some of the important phenomena during the reparandum and the edit phase are mentioned below:

1) During the reparandum, lengthening of syllables takes place. Syllables are a unit of speech, and consist of a nucleus (normally a vowel), often preceded and/or followed by consonants. It is pointed out in [15] that the intonation patterns are preserved even in the presence of this syllable lengthening. Intonation is an acoustic-phonetic term which refers to the ­dynamics of the pitch or the fundamental frequency of the speech signal. It is the pitch which is responsible for the shrillness of voice.

2) Another important phenomena in the reparandum is word cutoffs. Since the reparandum is the segment of speech which the speaker intends to replace, he/she normally cuts off the sentence in the middle of some word. Detection of these partial words is a challenging problem not only in automatic fluency evaluation, but also in ASR. Since these words are not part of the lexicon, the ASR system substitutes them by some other valid words from the dictionary. This invariably changes the meaning of the transcribed sentence. Automatic detection of such partial words enables rich annotation of the transcripts, which adds an additional layer of information about the confidence in the ASR hypothesis.

3) The edit phase, which is often marked with the ­presence of filled pauses is also characterized by lengthening of the duration of words. For example, "and" is a common word in English, but it is also often used as a filled pause ("and-AAH"). The duration of this version of "and" is typically much longer than that of the fluent version.

4) In addition to syllable lengthening, the pitch ­trajectory also becomes nearly flat in the edit phase [15] , with a slightly negative slope towards the end of the phase. As will be discussed later, the pitch trajectory can be used to extract prosodic features for the detection of filled pauses and elongated words. In [6] , it has been demonstrated that the trajectories of the formants (which are the resonance frequencies of the human vocal tract) are an even more robust indicator of the occurrence of filled pauses. Although the above acoustic-phonetic phenomena can aid in the development of several prosodic features for automatic fluency evaluation, the features conventionally ­reported in literature are based on the text output from an ASR system [4],[5] . This enables the use of advanced text processing techniques like Part Of Speech (POS) tagging [16] , but also makes the features susceptible to noise in the ASR output. As will be discussed later, a combination of both prosodic and lexical features not only leads to an improvement in performance of automatic fluency evaluation systems, but also makes them more robust. The following two subsections discuss both these types of features, followed by a discussion of different algorithms utilizing either one or a combination of both for evaluation of fluency.

3.2 Prosodic Features

In the RIM, the edit phase is perhaps the section where the prosodic effects of the disfluency are most apparent. As explained earlier, the edit phase is characterized by the presence of filled pauses and elongated words, among other phenomena such as silence, breathiness or even laughter. The first step of any technique to extract prosodic features for automatic fluency evaluation is detection of the location and duration of these filled pauses and silence intervals. Out of these, the second problem, is a well-studied one in speech processing, namely the design of Voice Activity Detectors (VADs) (not in the scope of this paper) [17] provides a good overview of the various VADs proposed in literature. A simple VAD can be designed using features such as the zero crossing rate, short-time energy and the characteristics of the autocorrelation function of the speech frame [8] explains the rationale behind the use of, and techniques for extraction of most of these features. We will focus on the detection of filled pauses, since they are the most common indicators of the edit phase. Authors in [5] note that in the conversational speech Switchboard database, around 40% of all disfluencies contain a filled pause. Thus, the detection of filled pauses is crucial for deriving fluency-sensitive prosodic features.

The acoustic-phonetic phenomena in the previous ­subsection provide important cues for detecting filled pauses. In [7] and [15] , it has been noted that the pitch trajectory remains almost flat during filled pauses. Pitch is defined as the period of vibration of the vocal chords during voiced speech [8] . There are several pitch tracking algorithms, the implementations of which are available online (e.g. in Wavesurfer [18] and Praat [19] ). The flatness of the pitch trajectory is captured by computing its standard deviation over a window of some fixed number of frames. Regions of fluent speech contain intonation, which is defined as the modulation of the pitch trajectory. This results in greater standard deviation of the pitch over the analysis window for fluent speech, and helps in identifying filled pauses and elongated words. Another technique for detecting filled pauses is based on the cepstrum [8] of the speech signal [20] . Authors in [20] define a cepstral instability measure C / i , computed from the Mel Frequency Cepstral Coefficients (MFCCs) [8] of frames within a distance of N/ 2 frames from the i th frame. It is computed as given in equations (1) and (2). M j denotes the MFCC vector for the j th frame, and II . II denotes the Euclidean norm. Since the spectral characteristics of speech remain nearly constant during a filled pause, the frames for filled pauses will have nearly identical­­ MFCCs, and thus a lower cepstral instability measure.





Recently, it has been proposed that formant frequencies, which are defined as the resonant frequencies of the human vocal tract during voiced speech, are better indicators of filled pauses than the pitch or the cepstral coefficients [6] . Since the edit phase represents a region where the speaker is thinking of the next set of words or thoughts to use, the speech articulators (like tongue, lips etc.) do not change their position. Hence, the shape of the vocal tract is nearly constant during a filled pause, and the formant trajectories are nearly flat. In [6] , it has been shown that the standard deviation of the pitch trajectory has a much lower dynamic range than for the formants. This implies poorer discrimination between normal words and filled pauses using a pitch based approach. The block diagram of the algorithm proposed in [6] is given in [Figure 2].

The speech signal is first passed through a VAD, and its formant frequencies are computed using ­Wavesurfer [18] . The standard deviation of the first and second formants (σ( F 1) and σ( F 2)) is computed for every frame over a window of ± N/ 2 frames. The training is done on a database of real-life recordings of call center agents, where the location of these filled pauses have been hand marked. For both σ( F 1) and σ( F 2), two probability density functions (pdfs) are trained: for normal frames and for filled pauses . In the test stage, given a frame to classify, its s( F 1) and s( F 2) are computed, and FP the Log Likelihood Ratio (LLR) for both these standard deviations is found out. The LLR computation for s( F 1) = x is given in (3), and a similar equation can be written for s( F 2). The frame is classified as belonging to the filled pause class if both the LLRs are greater than some pre-trained thresholds. A final stage performs decision smoothing by preventing very short time segments from being classified as filled pauses. The proposed method is shown to outperform both pitch based and cepstral stability based algorithms.



Once the location and duration of filled pauses and silence have been accurately determined, one can derive several features which capture their statistics. [Table 2] lists some prosodic features used in [12] . It has been reported in [12] that a Minimum Mean Squared Error (MMSE) linear combination of these features gives a correlation coefficient of 0 . 546 with the subjective fluency score. The next subsection discusses some commonly used lexical (or text-based) features in the domain of automatic fluency evaluation.

3.3 Lexical Features

The Part Of Speech (POS) tag is one of the most common lexical features used for automatically detecting disfluent speech [3],[4] . POS tagging [16] is the process of annotating the part of speech for each word in a sentence. Examples of parts of speech are nouns, verbs, adjectives etc. Both the definition of the word, and its context are taken into consideration while assigning its POS tag. This is essential, since in natural languages, the same word may have different POS depending on the context. For example, consider the following two sentences containing the same word "rally" in two different contexts:

SENTENCE 1: "The first rally of his election campaign will be held tomorrow."

SENTENCE 2:
"The General needs to rally his troops if he were to defeat the strong enemy."

In the first sentence, "rally" is a noun, while in the second, it is a verb. This simple example clearly shows the importance of context in POS tagging. Modern day POS taggers can achieve very high accuracies on clean text. Another commonly used lexical feature is the word identity itself. For example, "you know" is a common word in fluent speech, but also a very commonly used discourse marker. So if for a given speaker, "you know" occurs more frequently as compared to normal English, it is very likely to be a discourse marker.

An interesting set of lexical features not utilizing the POS tag is presented in [12] , and is given in [Table 3] for reference. As opposed to the output of an ASR system, manual transcriptions have been used to compute these features. Before the computation of these lexical features, the transcripts are pre-processed to remove common stop words, which occur very frequently in the English language (e.g. "a", "and", "I", "is", "of" etc.). Next, the transcripts are passed through the Porter stemmer [21] , which maps the various versions of a word to their base form. It has been reported in [12] that the lexical features are better correlates of the subjective fluency score than prosodic features. It is however, expected that this correlation will drop upon using the ASR system output instead of human transcriptions.

3.4 Algorithms

As mentioned previously, the algorithms in the area of automatic fluency evaluation can be split into two categories based on the objective they are trying to achieve (rich transcription of ASR output versus estimating an objective score of the speaker's fluency). We start with an overview of algorithms from the first category. Snover et al . present a disfluency detection algorithm based entirely on lexical features [4] . They present a technique called Transformation Based Learning (TBL), which is an algorithm to induce rules from training data. The algorithm starts with a fixed set of rule templates. When the labeled data is presented, it selects the rule which leads to the least classification error rate, and applies that rule to the data. In the next iteration, the next best rule is selected, and this process is repeated several times. The final is an ordered list of N rules which are best able to classify the data. In [4] , these rules are conditioned on the word identity, the POS tag, whether the word is followed by silence and whether the word is a high frequency word. Time aligned reference speech transcriptions form the training data for the algorithm. These transcriptions are annotated with speaker identity, sentence boundaries, edit phase, filled pauses, discourse markers and interruption points. The top 3 rules learnt by the TBL algorithm over a database were as follows [4]:

  • Label all fluent filled pauses as fillers.
  • Label the left side of a simple repeat as an edit.
  • Label "you know" as a filler.


The error rate is computed by first aligning the output of the algorithm with the true transcriptions, and computing the Minimum Edit Distance (MED) between the two strings. MED between two strings is defined as the minimum number of insertions, deletions and substitutions needed to convert one to the other. The error rate is now defined as the ratio of the total number of insertions and deletions to the number of disfiuency tokens in the true transcription. It must be noted that with this definition, the maximum error rate is not 100%. The authors report a 12% error rate for detection of fillers using human transcriptions and 53% using the ASR system output. Clearly, there is a huge drop in performance when the ASR transcriptions are used. The main reason behind this is the noisy nature of modern day ASR output, which adversely effects natural language processing techniques such as POS tagging.

Liu et al . [3] deal with the problem of meta-data extraction from speech. The various meta-data considered in the work include sentence boundaries, words in the reparandum, filler words (like filled pauses and discourse markers) and interruption point. The main aim of this work is to improve the usability of the ASR generated transcripts by detecting these meta-data and cleaning the transcripts. Both lexical and prosodic knowledge sources are used to derive features. The lexical features include the word identities, neighbouring events of a word, the POS tag and knowledge of repetitions of a words or phrase. These lexical features are similar to the ones used in the TBL based algorithm by Snover et al [4] . In addition to these lexical features, some prosodic features are also used. These include the normalized duration of all words and silence intervals, pitch based features like average pitch during a word and short-time energy based features. Some additional knowledge in the form of the speaker's gender is also used for classification. For both training and test, two sets of transcriptions are available: human generated (reference) and ASR output. In the training set, these transcriptions have been annotated with the meta-data events. The problem is posed in the framework of Hidden Markov Models (HMMs) [8] . Let us introduce the following notation:

  • W: The word sequence.
  • F: The features (prosodic and lexical).
  • E: The meta-data events.


The meta-data events E are modeled as the hidden states of the HMM. Hence, given W and F, the task is to determine the most likely state sequence Ê . This is given by the following equations:



Assuming conditional independence of F and W, we can further simplify the above equations as follows:



The first term P ( W , E ) is similar to a Language Model (LM) in a conventional ASR system. Instead of simply modeling the sequence of words W, the joint probability density of the word sequence W and the sequence of meta-data events E is modeled. It must be noted that this is facilitated by the availability of transcripts annotated with these events. The second term P ( F I E ) is like an Acoustic Model (AM), except that in place of acoustic features (like MFCC), the conditional probability of fluency-specific lexical and prosodic features is modeled.

The algorithms belonging to the second class focus their attention on estimating a fluency score for the speaker, which is highly correlated with the subjective score given with the assessors. There have been very few algorithms which deal with this problem. One of these is presented in [12] . A set of prosodic and lexical features (listed in [Table 2] and [Table 3] respectively) are extracted from the speech signal and its human transcription. A database of speech recordings from 112 call center agents is used, where each agent has been assigned a subjective fluency score by human assessors. Classification experiments are performed using the Weka [22] implementations of Support Vector Machine (SVM). The four classes correspond to fluency scores of 1, 2, 3 and 4 (where 4 represents the highest fluency). Results are also reported for 3-class and 2-class classification. In the former, the classes 2 and 3 are combined into a single class, while the latter consider the two extreme classes (1 and 4) only. Classification accuracies of 84 . 2%, 71 . 4% and 53 . 6% are reported for the 2, 3 and 4 class cases respectively. Clearly, these results are much better than chance accuracy rates, and strengthen the idea that a combination of both prosodic and lexical features is essential for automatically evaluating fluency.


   4. Summary and Research Challenges Top


This paper reviewed some works in the domain of ­automatic evaluation of spoken language fluency. Various types of features (both prosodic and lexical) proposed in literature were discussed. Two classes of algorithms were presented. The first one included those whose main task was the detection of the exact location of the disfluent events. Algorithms from the second class were focussed on objectively rating the fluency of the speaker. There are several research challenges in this upcoming field. First, the lexical features are based on the ASR generated transcriptions, which normally contain many word errors. This affects the accuracy of the derived features. One interesting area of research would be to robustly derive lexical features from the ASR transcriptions. Another direction of further study could be to incorporate the psychological processes which guide the decisions made by the assessors into the automatic algorithms. For example, human assessors don't pay equal attention to all parts of a sentence. Some sections of the utterance are more salient than ­others [23] . Modeling of such attention-driven phenomena is expected to yield interesting insights. Work can also be done on deriving new prosodic features, which are not dependent on the output of the ASR system.[24]

Author


Kartik Audhkhasi completed his B.Tech. in Electrical Engineering and M.Tech. in Information and Communication Technology from Indian Institute of Technology, Delhi (IITD) in 2008. At present, he is a Ph.D. student at the Signal Analysis and Interpretation Laboratory (SAIL) within the Electrical Engineering Department at the University of Southern California (USC). He is broadly interested in signal processing and machine learning, with an emphasis on speech processing, recognition and human language technologies.

 
   References Top

1.M. Black, J. Tepperman, S. Lee, P. Price, and S. Narayanan, "Automatic detection and classification of disfluent reading miscues in young children′s speech for the purpose of assessment", In Proc. Interspeech ICSLP, Antwerp, Belgium , 2007.  Back to cited text no. 1    
2.W. J. M. Levelt, "Monitoring and self-repair in speech", pp. 41-104, Cognition , 1983.  Back to cited text no. 2    
3.Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper, "Enriching speech recognition with automatic detection of sentence boundaries and disfluencies", IEEE Transactions on Speech and Audio Processing , vol. 14, no. 5, pp. 1526-40, September 2006.  Back to cited text no. 3    
4.M. Snover, B. Dorr, and R. Schwartz, "A lexically-driven algorithm for disfluency detection", In Proc. North American Chapter of the Association of Computational Linguistics, Boston , 2004, pp. 157-160.  Back to cited text no. 4    
5.A. Stolcke, E. Shriberg, R. Bates, M. Ostendorf, D. Hakkani, M. Plauche, G. Tur, and Y. Lu, "Automatic detection of sentence boundaries and disfluencies based on recognized words", In Proc. ICSLP, Sydney, Australia , 1998, pp. 2247-2250.  Back to cited text no. 5    
6.K. Audhkhasi, K. Kandhway, O. D. Deshmukh, and A. Verma, "Formant-based technique for automatic filled-pause detection in spontaneous spoken English", Accepted in ICASSP, Taiwan , 2009.  Back to cited text no. 6    
7.M. Goto, "A real-time filled pause detection system for spontaneous speech recognition", In Proc. Eurospeech, Budapest, Hungary , Sept. 1999, pp. 227-230.  Back to cited text no. 7    
8.L. Rabiner, and B. H. Juang, Fundamentals of speech recognition, Prentice Hall , 1993.  Back to cited text no. 8    
9.H. Jiang, "Confidence measure for speech recognition: A survey", Speech Communication, vol. 45, no. 5, pp. 455-470, 2005.  Back to cited text no. 9    
10.T. J. Hazen, S. Seneff, and J. Polifroni, "Recognition condfidence scoring and its use in speech understanding systems", Computer Speech and Language , vol. 16, pp. 49-67, 2002.  Back to cited text no. 10    
11.J. Feng, and A. Sears, "Using confidence scores to improve handsfree speech based navigation in continuous dictation systems", ACM Transactions on Computer-Human Interaction , vol. 11, no. 4, pp. 329-356, 2004.  Back to cited text no. 11    
12.O. D. Deshmukh, K. Kandhway, A. Verma, and K. Audhkhasi, "Automatic evaluation of spoken English fluency", Accepted in ICASSP, Taiwan , 2009.  Back to cited text no. 12    
13.K. Zechner, and I. Bejar, "Towards automatic scoring of non-native spontaneous speech", In Proc. Human Language Technologies Conference, New York , 2006, pp. 216-223.  Back to cited text no. 13    
14.C. Cucchiarini, H. Strik, and L. Boves, "Quantitative assessment of second language learners′ fluency by means of automatic speech recognition technology", Journal of Acoustic Society of America , vol. 107, no. 2, pp. 989-999, Feb. 2000.  Back to cited text no. 14    
15.E. Shriberg, "Phonetic consequences of speech disfluency", In International Conference on Phonetics Sciences, San Francisco , 1999, pp. 619-622.  Back to cited text no. 15    
16.E. Charniak, "Statistical techniques for natural language processing", AI Magazine , vol. 18, pp. 33-44, 1997.  Back to cited text no. 16    
17.J. Ramirez, J. M. Gorriz, and J. C. Segura, "Voice Activity Detection: Fundamentals and Speech Recognition System Robustness", in M. Grimm, and K. Kroschel, Robust Speech Recognition and Understanding, pp. 1-22, I-Tech, Vienna, Austria , June 2007.  Back to cited text no. 17    
18.K. Sjolander, and J. Beskow, "Wavesurfer-An open source speech tool", Proc. ICSLP, Beijing, China , 2000, pp. 464-467.  Back to cited text no. 18    
19.P. Boersma, and D. Weenink, "Praat: A system for doing phonetics by computer", http://www.praat.org.  Back to cited text no. 19    
20.F. Stouten, and J-P. Martens, "A feature-based filled pause detection technique for Dutch", In IEEE Intl. Worskhop on ASRU, St. Thomas, US Virgin Islands , 2003, pp. 309-314.  Back to cited text no. 20    
21.M. F. Porter, "An algorithm for suffix stripping", Program , pp. 130-137, 1980.  Back to cited text no. 21    
22.S. Garner, "Weka: The waikato environment for knowledge analysis", In Proc. New Zealand Computer Science Research Students Conference , 1995, pp. 57-64.  Back to cited text no. 22    
23.O. Kalinli, and S. Narayanan, "A saliency-based auditory attention model with applications to unsupervised prominent syllable detection in speech", In Proc. Interspeech, Antwerp, Belgium , 2007, pp. 1941-44.  Back to cited text no. 23    
24.P. A. Heeman, and J. F. Allen, "Speech repairs, intonational phrases and discourse markers: Modeling speakers′ utterances in spoken dialogue", Computational Linguistics , vol. 25, no. 4, pp. 527-571, 1999.  Back to cited text no. 24    


    Figures

  [Figure 1], [Figure 2]
 
 
    Tables

  [Table 1], [Table 2], [Table 3]



 

 
Top
Previous Article Next Article

    

 
  Search
 
  
    Access Statistics
    Email Alert *
    Add to My List *
* Registration required (free)  

 
  In this article
    Abstract
    1. Introduction
    2. Problem Defin...
    3. Techniques
    4. Summary and R...
    References
    Article Figures
    Article Tables

   Article Access Statistics
    Viewed528    
    Printed18    
    Emailed1    
    PDF Downloaded113    
    Comments [Add]    

Recommend this journal