File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-3032_metho.xml
Size: 9,419 bytes
Last Modified: 2025-10-06 14:12:32
<?xml version="1.0" standalone="yes"?> <Paper uid="C90-3032"> <Title>SYNTACTIC NORMALIZATION OF SPONTANEOUS SPEECH*</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> MORPHO-SYNTACTIC DEVIATIONS </SectionTitle> <Paragraph position="0"> Morpho-syntactic deviations make up a considerable proportion of errors both in spoken and written German (German has a much more complex inflectional morphology than English).</Paragraph> <Paragraph position="1"> The basic principle of this approach to normalization is as follows: 'Fry to find out which properties of a given input string make a parse fail and use the given grammatical knowledge to alter the input string minimally so that it is as similar as l~ssible to its initial state but without the properties that caused the thilure.</Paragraph> <Paragraph position="2"> What is meant by that can easily be seen if we consider an example where the property that makes a parse fail is evident, e.g. the string 'John sleep', which lacks tile NP-VP-agreement concerning person and mtmber that is required by the following rule:</Paragraph> <Paragraph position="4"> This rule is not applicable to 'John sleep', since there are no lexieal entries for 'John' and 'sleep', respectively, that have unifiable specifications for person and number, and this makes the whole parse fail.</Paragraph> <Paragraph position="5"> The strategy to account for strings like 'John sleep' consists of three steps: Step 1: Collect all lexical entries that match with the words of the input string and generalize them by substituting variables for their morpho-syntaetic specifications (ease, number, gender etc.).</Paragraph> <Paragraph position="6"> Step2: Parse the string using the generalized lexical entries instead of tim completely specified entries.</Paragraph> <Paragraph position="7"> Step3: If the parse with generalized specifications is successful, the problem with the input string is morpho-syntactie (agreement error or ease-assignment violation). Collect all preterminal categories (most of them still contain variable morpho-syntactic specifications) and try to unify them with full-specified lexical entries. At least one matching entry will belong to some item different from the corresponding word in the input string. In that case replace the original word by the matching item. If there are many different sets of matching entries choose the one that requires the least number of substitutions and output it as the default normalization (if there are many sets of matching entries that require the same least number of substitutions the normalization is ambigous. In that case output all of them).</Paragraph> <Paragraph position="8"> Returning to our example string 'John sleep', let us assmne that the grammar consists just of the rule stated above and the following lexical entries: Jol)n: sleeF.</Paragraph> <Paragraph position="9"> sleeps: person = 3, num= so, cat = rip, case =norn person = 3, num ,, pl. cat = vp person = 3, num= sg, cat = vp Generalizing the lexical entries for the input string 'John sleep' will produce two new entries: John: sleep: person = MAR 1, hum = VAR 2, cat = np, case = VAR 3 person = VAR 4, num = VAR 5, cat = vp A parse using these entries will be successfld. The application of the rule unifies the variable specifica-</Paragraph> <Paragraph position="11"> dons for nmnber and person and instantiates case nominative in the NP. The preterminal categories resulting from the parse are:</Paragraph> <Paragraph position="13"> Though the crucial specifications (person and num) are still variable the difference is now that there are the same variables in both categories. The (only) set of lexical entries that match with these preterminal categories requires the replacement of 'sleep' by 'sleeps' and thus 'John sleeps' is the normalization of 'John sleep'.</Paragraph> <Paragraph position="14"> Note that this strategy is not, in principle, limited to morpho-syntactic features. It might be useful for phonological and semantic normalization, as well.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> EXPLICZT REPAIR </SectionTitle> <Paragraph position="0"> When people detect an error during an utterance they often try to correct it immediately. This, in general, makes the utterance as a whole ungrammatical. The structure of an utterance containing a self repair is often: Left context - reparandum - repair indicator - reparans right context.</Paragraph> <Paragraph position="1"> The reparandum is the part of the utterance that is to be corrected by the reparans. Typical repair indicators are interjections like 'uh no', 'nonsense', 'sorry' etc. The following example from our corpus shows that structure (note that the left context is empty in the original German version): Den linken oh ~uatsch_ den roton stellst du links hin rel~arandum indicator reparans right contox~ You ,put the !ef~ one eh nonsense the red one to the left left c. reparandum indicator reparans right context A plausible normalization of this utterance would be 'Den roten stellst du links hin' ('You put the red one to the left3. This normalization differs from the original utterance in that the reparandum and the repair indicators have been deleted. The strategy to cover this type of repair is to scan the input string w~w v..w. until a repair indicator sequence w~w~/r..wj is found (1 < i < j < n). If there is such an explicit signal, then there probably is something wrong immediately before the repair sequence. But it is not clear what the reparandum is. Possibly the reparandum is just the word immediately before the repair indicator sequence or a longer substring or even the whole substring w~ wv..w~_ ~. Which deletion of a substring WkWk+~...W j gives a grammatical sentence can only be decided by the grammar. Thus it is necessary to parse the results of the alternative deletions beginning with wl...w~. 2 wj+t...w . and incrementing the length of the deleted suhstring until the parse succeeds. If the deletion of a substring wkw~+,..wj makes a parse successful and if there is no other deletion of a substring w~w~+l...wj such that k < 1 then wtw2...wk_~wj+~wi42...w n is the normalization of the input string.</Paragraph> <Paragraph position="2"> If applied to the utterance 'You put the left one eh nonsense the red one to the left' the first deletion gives 'You put the left the red one to the left' which is not accepted by the parer. The second alternative tried ('You put the the red one to the left') fails, too. But the third attempt ('You put the red one to the left') is accepted by the parser and thus considered as the normalization of the original utterance.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> UNGRAMMATICAL REPETITIONS </SectionTitle> <Paragraph position="0"> Ungrammatical repetitions of single words or longer stretches occur quite frequently in spontaneous speech.</Paragraph> <Paragraph position="1"> As long as a sequence is repeated completely and without any alteration it is easy to detect the redundant duplication and remove it from the input string to get a normalized version. The problem is with incomplete repetitions and repetitions that introduce new lexical items: Some blocks some red blocks are small, \ / \ / part 1 part 2 Some red some blue blocks are small.</Paragraph> <Paragraph position="2"> \__/ \__ / part I part 2 The deletion of the substrings indicated as 'part 1' in the utterances above, respectively, would yield a suitable normalization. Utterances of this kind are in many respects like the explicit repairs discussed above, but they lack indicators. Typically, part 2 is similar to part 1 in that at least some words occur in both substrings. Moreover, part 1 and part 2 often belong to the same category (e.g. NP in the utterances above). This similarity motivates the following heuristic: null The input string wlw2...w ~ is scanned for two different occurrences, say w~ and wj (1 _< w I < w i < w,), of the same lexical item. w~ and wj are permitted to differ in their inflectional properties, since an unsuitable inflection of w~ might have been the reason to repeat it in proper inflexion as wj (e.g. 'He takes took a block'). If such a repetition is 182 3 fbund the substring beginning with the first occurrence up to the word immediately before the second occurrence (i.e. w~w~+,...wj.~) is parsed. If the parse is~ succesful and yields some category C for the substring, the next step is to find a prefix of wjwj+a...w, that belongs to the same category C. If such a prefix exists and wtw2 ...w~_twjwj+~...w, is accepted as a grammatical sentence it is considered to be the suitable normalization.</Paragraph> <Paragraph position="3"> Let us apply this strategy to the utterance 'Some blocks some red blocks are small'. Scanning this input string from the left to the right will immediately find the repeated lexical item 'some'. The parse of the substring 'Some blocks' results in an NP and thus a prefix of 'some red blocks are small' is searched for which is also an NP. Such a prefix is found (i.e.</Paragraph> <Paragraph position="4"> 'some red blocks') and therefore 'some red blocks are small' is tested if it is a grammatical sentence and, indeed, it is.</Paragraph> </Section> class="xml-element"></Paper>