File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/ackno/90/j90-1003_ackno.xml
Size: 2,328 bytes
Last Modified: 2025-10-06 13:51:46
<?xml version="1.0" standalone="yes"?> <Paper uid="J90-1003"> <Title>X and Y Separation Relation Word x Word y Mean Variance</Title> <Section position="11" start_page="0" end_page="0" type="ackno"> <SectionTitle> NOTES </SectionTitle> <Paragraph position="0"> 1. This statistic has also been used by the IBM speech group (Jelinek 1982) for constructing language models for applications in speech recognition.</Paragraph> <Paragraph position="1"> 2. Smadja (in press) discusses the separation between collocates in a very similar way.</Paragraph> <Paragraph position="2"> 3. This definition fw(x,y) uses a rectangular window. It might be interesting to consider alternatives (e.g. a triangular window or a decaying exponential) that would weight words less and less as they are separated by more and more words. Other windows are also possible. For example, Hindle (Church et al. 1989) has used a syntactic parser to select words in certain constructions of interest. 4. Although the Good-Turing Method (Good 1953) is more than 35 years old, it is still heavily cited. For example, Katz (1987) uses the method in order to estimate trigram probabilities in the IBM speech recognizer. The Good-Turing Method is helpful for trigrams that have not been seen very often in the training corpus.</Paragraph> <Paragraph position="3"> 5. The last unclassified line .... save shoppers anywhere from $50... raises interesting problems. Syntactic &quot;chunking&quot; shows that, in spite of its co-occurrence of from with save, this line does not belong here. An intriguing exercise, given the lookup table we are trying to construct, is how to guard against false inferences such as that since shoppers is tagged \[PERSON\], $50 to $500 must here count as either BAD or a LOCATION. Accidental coincidences of this kind do not have a significant effect on the measure, however, although they do serve as a reminder of the probabilistic nature of the findings.</Paragraph> <Paragraph position="4"> 6. The word time itself also occurs significantly in the table, but on closer examination it is clear that this use of time (e.g. to save time) counts as something like a commodity or resource, not as part of a time adjunct. Such are the pitfalls of lexicography (obvious when they are pointed out).</Paragraph> </Section> class="xml-element"></Paper>