File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/98/w98-1426_concl.xml

Size: 4,843 bytes

Last Modified: 2025-10-06 13:58:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1426">
  <Title>THE PRACTICAL VALUE OF N'GRAN IS IN GENERATION</Title>
  <Section position="5" start_page="253" end_page="253" type="concl">
    <SectionTitle>
3 Discussion
</SectionTitle>
    <Paragraph position="0"> The strength of Nitrogen's generation is in its simplicity and robustness. Basic unigram and bigram frequen* cies capture much of the linguistic information relevant to generation. Yet there are also inherent limitations.</Paragraph>
    <Paragraph position="1"> One is that dependencies between non-contiguous words cannot be captured, nor call dependencies between more than two items (in a bigram-based system).</Paragraph>
    <Paragraph position="3"> As mentioned above, trigrams are one way of capturing part of these dependencies. However, they will only capture the contiguous ones, and moreover, practical experience suggests that switching to a trigram-based system might pose more problems than it solves. Trigrams exponentially increase the amount of data that must be stored, and require more extensive smoothing to prevent zero scores. Even. in a bigram-based System the sparseness problem is very prominent. Many reasonable bigrams have zero raft counts, and many individual words may not appear in the corpus at all.</Paragraph>
    <Paragraph position="4"> A more subtle disadvantage of trigrams is that many linguistic dependencies are only binary or unary relationships. A trigram-based system would not represent them efficiently; and due to Sparseness, often not rank them correctly relative to each other. For example, the two phrases &amp;quot;An American admire(s) Mount Fuji&amp;quot; would not be correctly ranked by a trigram system since the relevant trigrams have raw zero counts, and would be smoothed relative to the unigram counts, which in this case favor &amp;quot;admire&amp;quot;, as shown above. However, the raw bigram counts are not all zero, and favor &amp;quot;admires&amp;quot;. So, although ternary relationships do need to be represented to improve the system's quality, it seems better to selectively apply them instead.</Paragraph>
    <Paragraph position="5"> In conjunction with this, we are developing a more detailed lattice representation to include part-of-speech and syntactic bracketing information. This will give us a handle for weighing certain ngrams more heavily (such as the one between a noun and its conjugated verb) and for representing non-contiguous relationships.</Paragraph>
    <Paragraph position="6"> Syntactic tags will alleviate the blurring problem that occurs when a lexical item has more than one part of speech, (e.g., in example 1 where the frequency count for &amp;quot;trusts&amp;quot; includes both its use as a noun and as a verb). We plan to use the Penn Treebank corpus (Marcus et al., 1994) in collecting this data.</Paragraph>
    <Paragraph position="7"> Thus for now the symbolic half of Nitrogen's generation mainly provides word order information and mappings from semantic relationships to general syntactic ones (like noun or verb phrases). Detailed syntactic information and agreement rules are omitted. It is still unclear how much more detail the symbolic rules need to encode. Long distance agreement is one area that our current system is Weak in, but we expect to solve this problem statistically with more structured corpus data.</Paragraph>
    <Paragraph position="8"> One important job for symbolic processing that we are working to improve is our system's paraphrasing ability--its ability to express the same meaning with different syntactic constructions..This wilt facilitate the generation of longer, more complex meanings. For example an agent-pa'cient input might need to be expressed as a noun phrase, rather than a sentence, to be more fluently expressed as the object of some external matrix verb.</Paragraph>
    <Paragraph position="9"> Finally, a few rules of thumb we have learned: in collecting statistical data, it is necessary to distinguish proper nouns from each other, rather than lumping them all together. Otherwise it is impossible to prefer &amp;quot;in Japan&amp;quot; over &amp;quot;to Japan&amp;quot; because the only information available is &amp;quot;in PROPER&amp;quot; versus &amp;quot;to PROPER.&amp;quot; This also holds for numbers.</Paragraph>
    <Paragraph position="10"> We also notice that some kind of length heuristic is necessary; otherwise th e straightforward bigrams prefer sequences of simple words like &amp;quot;was&amp;quot;, &amp;quot;the&amp;quot;, &amp;quot;of&amp;quot;, to more concise renditions of a meaning. Yet our current heuristic sometime goes a little too far, dropping articles even when they are grammatically necessary. We expect the improvements we are working on will allow us to manage this problem better as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML