File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1010_metho.xml

Size: 23,663 bytes

Last Modified: 2025-10-06 14:14:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1010">
  <Title>A Memory-Based Approach to Learning Shallow Natural Language Patterns</Title>
  <Section position="3" start_page="67" end_page="67" type="metho">
    <SectionTitle>
DT ADJ ADJ NN NNP
</SectionTitle>
    <Paragraph position="0"> is a noun phrase (NP) by comparing it to the training corpus. A good match would be if the entire sequence appears as-is several times in the corpus.</Paragraph>
    <Paragraph position="1"> However, due to data sparseness, an exact match cannot always be expected.</Paragraph>
    <Paragraph position="2"> A somewhat weaker match may be obtained if we consider sub-parts of the candidate sequence (called tiles). For example, suppose the corpus contains noun phrase instances with the following structures:  (i) DT ADJ ADJ NN NN (2) DT ADJ NN NNP  The first structure provides positive evidence that the sequence &amp;quot;DT ADJ ADJ NN&amp;quot; is a possible NP prefix while the second structure provides evidence for &amp;quot;ADJ NN NNP&amp;quot; being an NP suffix. Together, these two training instances provide positive evidence that covers the entire candidate. Considering evidence for sub-parts of the pattern enables us to generalize over the exact structures that are present in the corpus. Similarly, we also consider the negative evidence for such sub-parts by noting where they occur in the corpus without being a corresponding part of a target instance.</Paragraph>
    <Paragraph position="3"> The proposed method, as described in detail in the next section, formalizes this type of reasoning. It searches specialized data structures for both positive and negative evidence for sub-parts of the candidate structure, and considers additional factors such as context and evidence overlap. Section 3 presents experimental results for three target syntactic patterns in English, and Section 4 describes related work.</Paragraph>
  </Section>
  <Section position="4" start_page="67" end_page="25024" type="metho">
    <SectionTitle>
2 The Algorithm
</SectionTitle>
    <Paragraph position="0"> The input to the Memory-Based Sequence Learning (MBSL) algorithm is a sentence represented as a sequence of POS tags, and its output is a bracketed sentence, indicating which subsequences of the sentence are to be considered instances of the target pattern (target instances). MBSL determines the bracketing by first considering each subsequence of the sentence as a candidate to be a target instance.</Paragraph>
    <Paragraph position="1"> It computes a score for each candidate by comparing it to the training corpus, which consists of a set of pre-bracketed sentences. The algorithm then finds a consistent bracketing for the input sentence, giving preference to high scoring subsequences. In the remainder of this section we describe the scoring and bracketing methods in more detail.</Paragraph>
    <Section position="1" start_page="67" end_page="69" type="sub_section">
      <SectionTitle>
2.1 Scoring candidates
</SectionTitle>
      <Paragraph position="0"> We first describe the mechanism for scoring an individual candidate. The input is a candidate subsequence, along with its context, i.e., the other tags in the input sentence. The method is presented at two levels: a general memory-based learning schema and a particular instantiation of it. Further instantiations of the schema are expected in future work.</Paragraph>
      <Paragraph position="1">  The MBSL scoring algorithm works by considering situated candidates. A situated candidate is a sentence containing one pair of brackets, indicating a candidate to be a target instance. The portion of the sentence between the brackets is the candidate (as above), while the portion before and after the candidate is its context. (Although we describe the algorithm here for the general case of unlimited context, for computational reasons our implementation only considers a limited amount of context on either side of the candidate.) This subsection describes how to compute the score of a situated candidate from the training corpus.</Paragraph>
      <Paragraph position="2"> The idea of the MBSL scoring algorithm is to construct a tiling of subsequences of a situated candidate which covers the entire candidate. We consider as tiles subsequences of the situated candidate which contain a bracket. (We thus consider only tiles within or adjacent to the candidate that also include a candidate boundary.) Each tile is assigned a score based on its occurrence in the training memory. Since brackets correspond to the boundaries of potential target instances, it is important to consider how the bracket positions in the tile correspond to those in the training memory.</Paragraph>
      <Paragraph position="3"> For example, consider the training sentence \[ NN \] VB \[ ADJ NN NN \] ADV PP \[ NN \] We may now examine the occurrence in this sentence of several possible tiles:</Paragraph>
      <Paragraph position="5"> tence, since the bracket does not correspond.</Paragraph>
      <Paragraph position="6"> The positive evidence for a tile is measured by its positive count, the number of times the tile (including brackets) occurs in the training memory with corresponding brackets. Similarly, the negative evidence for a tile is measured by its negative count, the number of times that the POS sequence of the tile occurs in the training memory with noncorresponding brackets (either brackets in the training where they do not occur in the tile, or vice versa). The total count of a tile is its positive count plus its negative count, that is, the total count of the POS sequence of the tile, regardless of bracket position.</Paragraph>
      <Paragraph position="7"> The score \](t) of a tile t is a function of its positive and negative counts.</Paragraph>
      <Paragraph position="8">  Candidate: NN VB \[ ADJ NN NN \] ADV MTile I: VB \[ ADJ NN NN \] MTile 2: VB \[ ADJ MTile 3: \[ ADJ NN MTile 4: NN NN \] MTile 5: NN \] ADV  context, and 5 matching tiles found in the training corpus.</Paragraph>
      <Paragraph position="9"> The overall score of a situated candidate is generally a function of the scores of all the tiles for the candidate, as well as the relations between the tiles' positions. These relations include tile adjacency, overlap between tiles, the amount of context in a tile, and so on.</Paragraph>
      <Paragraph position="10">  In our instantiation of the MBSL schema, we define the score fit) of a tile t as the ratio of its positive count pos(t) and its total count total(t):</Paragraph>
      <Paragraph position="12"> for a predefined threshold O. Tiles with a score of 1, and so with sufficient positive evidence, are called matching tiles.</Paragraph>
      <Paragraph position="13"> Each matching tile gives supporting evidence that a part of the candidate can be a part of a target instance. In order to combine this evidence, we try to cover the entire candidate by a set of matching tiles, with no gaps. Such a covering constitutes evidence that the entire candidate is a target instance. For example, consider the matching tiles shown for the candidate in Figure 1. The set of matching tiles 2, 4, and 5 covers the candidate, as does the set of tiles 1 and 5. Also note that tile 1 constitutes a cover on its own.</Paragraph>
      <Paragraph position="14"> To make this precise, we first say that a tile T1 connects to a tile T2 if (i) T2 starts after T1 starts, (ii) there is no gap between the end of T1 and the start of T2 (there may be some overlap), and (iii) T2 ends after T1 (neither tile includes the other). For example, tiles 2 and 4 in the figure connect, while tiles 2 and 5 do not, and neither do tiles 1 and 4 (since tile 1 includes tile 4 as a subsequence).</Paragraph>
      <Paragraph position="15"> A cover for a situated candidate c is a sequence of matching tiles which collectively cover the entire candidate, including the boundary brackets, and possibly some context, such that each tile connects to the following one. A cover thus provides positive evidence for the entire sequence of tags in the candidate.</Paragraph>
      <Paragraph position="16"> The set of all the covers for a candidate summarizes all of the evidence for the candidate being a target instance. We therefore compute the score of a candidate as a function of some statistics of the set of all its covers. For example, if a candidate has many different covers, it is more likely to be a target instance, since many different pieces of evidence can be brought to bear.</Paragraph>
      <Paragraph position="17"> We have empirically found several statistics of the cover set to be useful. These include, for each cover, the number of tiles it contains, the total number of context tags it contains, and the number of positions which more than one tile covers (the amount of overlap). We thus compute, for the set of all covers of a candidate c, the  * Total number of different covers, num(c), * Minimum number of matches in any cover, minsize(c), * Maximum amount of context in any cover, maxcontext(c), and * Maximum total overlap between tiles for any cover, maxoverlap(c).</Paragraph>
      <Paragraph position="18">  Each of these items gives an indication regarding the overall strength of the cover-based evidence for the candidate.</Paragraph>
      <Paragraph position="19"> The score of the candidate is a linear function of</Paragraph>
      <Paragraph position="21"> If candidate c has no covers, we set f(c) = O. Note that minsize is weighted negatively, since a cover with fewer tiles provides stronger evidence for the candidate.</Paragraph>
      <Paragraph position="22"> In the current implementation, the weights were chosen so as to give a lexicographic ordering, preferring first candidates with more covers, then those with covers containing fewer tiles, then those with larger contexts, and finally, when all else is equal, preferring candidates with more overlap between tiles. We plan to investigate in the future a data-driven approach (based on the Winnow algorithm) for optimal selection and weighting of statistical features of the score.</Paragraph>
      <Paragraph position="23"> We compute a candidate's statistics efficiently by performing a depth-first traversal of the cover graph of the candidate. The cover graph is a directed acyclic graph (DAG) whose nodes represent matching tiles of the candidate, such that an arc exists between nodes n and n', if tile n connects to n'. A special start node is added as the root of the DAG, that connects to all of the nodes (tiles) that contain an open bracket. There is a cover corresponding to each path from the start node to a node (tile) that contains a close bracket. Thus the statistics of all the covers may be efficiently computed by traversing the cover graph.</Paragraph>
      <Paragraph position="24">  Given a candidate sequence and its context (a situ- null ated candidate): 1. Consider all the subsequences of the situated candidate which include a bracket as tiles; 2. Compute a tile score as a function of its positive count and total counts, by searching the training corpus. Determine which tiles are matching tiles; 3. Construct the set of all possible covers for the candidate, that is, sequences of connected matching tiles that cover the entire candidate; 4. Compute the candidate score based on the  statistics of its covers.</Paragraph>
    </Section>
    <Section position="2" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
2.2 Searching the training memory
</SectionTitle>
      <Paragraph position="0"> The MBSL scoring algorithm searches the training corpus for each subsequence of the sentence in order to find matching tiles. Implementing this search efficiently is therefore of prime importance. We do so by encoding the training corpus using suffix trees (Edward and McCreight, 1976), which provide string searching in time which is linear in the length of the searched string.</Paragraph>
      <Paragraph position="1"> Inspired by Satta (1997), we build two suffix trees for retrieving the positive and total counts for a tile. The first suffix tree holds all pattern instances from the training corpus surrounded by bracket symbols and a fixed amount of context. Searching a given tile (which includes a bracket symbol) in this tree yields the positive count for the tile. The second suffix tree holds an unbracketed version of the entire training corpus. This tree is used for searching the POS sequence of a tile, with brackets omitted, yielding the total count for the tile (recall that the negative count is the difference between the total and positive counts).</Paragraph>
    </Section>
    <Section position="3" start_page="69" end_page="25024" type="sub_section">
      <SectionTitle>
2.3 Selecting candidates
</SectionTitle>
      <Paragraph position="0"> After the above procedure, each situated candidate is assigned a score. In order to select a bracketing for the input sentence, we assume that target instances are non-overlapping (this is usually the case for the types of patterns with which we experimented). We use a simple constraint propagation algorithm that finds the best choice of non-overlapping candidates in an input sentence:  1. Examine each situated candidate c with f(c) &gt; 0, in descending order of f(c): (a) Add c's brackets to the sentence; (b) Remove all situated candidates overlapping with c which have not yet been examined.</Paragraph>
      <Paragraph position="1"> 2. Return the bracketed sentence.</Paragraph>
      <Paragraph position="2">  ber of patterns and average length in the training data.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="25024" end_page="25024" type="metho">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="25024" end_page="25024" type="sub_section">
      <SectionTitle>
3.1 The Data
</SectionTitle>
      <Paragraph position="0"> We have tested our algorithm in recognizing three syntactic patterns: noun phrase sequences (NP), verb-object (VO), and subject-verb (SV) relations.</Paragraph>
      <Paragraph position="1"> The NP patterns were delimited by ' \[' and '\]' symbols at the borders of the phrase. For VO patterns, we have put the starting delimiter before the main verb and the ending delimiter after the object head, thus covering the whole noun phrase comprising the object; for example: ... investigators started to \[ view the lower price levels \] as attractive ...</Paragraph>
      <Paragraph position="2"> We used a similar policy for SV patterns, defining the start of the pattern at the start of the subject noun phrase and the end at the first verb encountered (not including auxiliaries and medals); for ex- null SV; 0.1 &lt; 8 &lt; 0.99 The subject and object noun-phrase borders were those specified by the annotators, phrases which contain conjunctions or appositives were not further analyzed. null The training and testing data were derived from the Penn TreeBank. We used the NP data prepared by Ramshaw and Marcus (1995), hereafter RM95.</Paragraph>
      <Paragraph position="3"> The SV and VO data were obtained using T (Tree-Bank's search script language) scripts. 2 Table 1 summarizes the sizes of the training and test data sets and the number of examples in each.</Paragraph>
      <Paragraph position="4"> The T scripts did not attempt to match dependencies over very complex structures, since we are concerned with shallow, or local, patterns. Table 2 shows the distribution of pattern length in the train data. We also did not attempt to extract passive-voice VO relations.</Paragraph>
    </Section>
    <Section position="2" start_page="25024" end_page="25024" type="sub_section">
      <SectionTitle>
3.2 Testing Methodology
</SectionTitle>
      <Paragraph position="0"> The test procedure has two parameters: (a) maximum context size of a candidate, which limits what queries are performed on the memory, and (b) the threshold 8 used for establishing a matching tile, which determines how to make use of the query results. null Recall and precision figures were obtained for various parameter values. F~ (van Rijsbergen, 1979), a common measure in information retrieval, was used 2The scripts may be found at the URL http://www.cs.biu.ac.il/,-~yuvalk/MBSL.</Paragraph>
      <Paragraph position="1"> as a single-figure measure of performance:</Paragraph>
      <Paragraph position="3"> We use ~ = 1 which gives no preference to either recall or precision.</Paragraph>
    </Section>
    <Section position="3" start_page="25024" end_page="25024" type="sub_section">
      <SectionTitle>
3.3 Results
</SectionTitle>
      <Paragraph position="0"> Table 3 summarizes the optimal parameter settings and results for NP, VO, and SV on the test set. In order to find the optimal values of the context size and threshold, we tried 0.1 &lt; t~ &lt; 0.95, and maximum context sizes of 1,2, and 3. Our experiments used 5-fold cross-validation on the training data to determine the optimal parameter settings.</Paragraph>
      <Paragraph position="1"> In experimenting with the maximum context size parameter, we found that the difference between the values of F~ for context sizes of 2 and 3 is less than 0.5% for the optimal threshold. Scores for a context size of 1 yielded F~ values smaller by more than 1% than the values for the larger contexts.</Paragraph>
      <Paragraph position="2"> Figure 2 shows recall/precision curves for the three data sets, obtained by varying 8 while keeping the maximum context size at its optimal value. The difference between F~=I values for different thresholds was always less than 2%.</Paragraph>
      <Paragraph position="3"> Performance may be measured also on a word-by word basis, counting as a success any word which was identified correctly as being part of the target pattern. That method was employed, along with recall/precision, by RM95. We preferred to measure performance by recall and precision for complete patterns. Most errors involved identifications of slightly shifted, shorter or longer sequences. Given a pattern consisting of five words, for example, identifying only a four-word portion of this pattern would yield both a recall and precision errors. Tagassignment scoring, on the other hand, will give it a score of 80%. We hold the view that such an identification is an error, rather than a partial success. We used the datasets created by RM95 for NP learning; their results are shown in Table 3. 3 The F~ difference is small (0.4%), yet they use a richer feature set, which incorporates lexicai information as well. The method of Ramshaw and Marcus makes a decision per word, relying on predefined rule templates. The method presented here makes decisions on sequences and uses sequences as its memory, thereby attaining a dynamic perspective of the SNotice that our results, as well as those we cite from RM95, pertains to a training set of 229,000 words. RM95 report also results for a larger training set, of 950,000 words, for which recall/precision is 93.5%/93.1%, correspondingly (F~=93.3%). Our system needs to be further optimized in order to handle that amount of data, though our major concern in future work is to reduce the overall amount of labeled training data.</Paragraph>
      <Paragraph position="4">  last line shows the results of Ramshaw and Marcus (1995) (recognizing NP's) with the same train/test data. The optimal parameters were obtained by 5-fold cross-validation.</Paragraph>
      <Paragraph position="6"> examples (left) and words (right) pattern structure. We aim to incorporate lexical information as well in the future, it is still unclear whether that will improve the results.</Paragraph>
      <Paragraph position="7"> Figure 3 shows the learning curves by amount of training examples and number of words in the training data, for particular parameter settings.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="25024" end_page="25024" type="metho">
    <SectionTitle>
4 Related Work
</SectionTitle>
    <Paragraph position="0"> Two previous methods for learning local syntactic patterns follow the transformation-based paradigm introduced by Brill (1992). Vilain and Day (1996) identify (and classify) name phrases such as company names, locations, etc. Ramshaw and Marcus (1995) detect noun phrases, by classifying each word as being inside a phrase, outside or on the boundary between phrases.</Paragraph>
    <Paragraph position="1"> Finite state machines (FSMs) are a natural formalism for learning linear sequences. It was used for learning linguistic structures other than shallow syntax. Gold (1978) showed that learning regular languages from positive examples is undecidable in the limit. Recently, however, several learning methods have been proposed for restricted classes of FSM. OSTIA (Onward Subsequential Transducer Inference Algorithm; Oncina, Garcia, and Vidal 1993), learns a subsequential transducer in the limit. This algorithm was used for natural-language tasks by Vilar, Marzal, and Vidal (1994) for learning translation of a limited-domain language, as well as by Gildea and Jurafsky (1994) for learning phonological rules.</Paragraph>
    <Paragraph position="2"> Ahonen et al. (1994) describe an algorithm for learning (k,h)-contextual regular languages, which they use for learning the structure of SGML documents.</Paragraph>
    <Paragraph position="3"> Apart from deterministic FSMs, there are a number of algorithms for learning stochastic models, eg., (Stolcke and Omohundro, 1992; Carrasco and Oncina, 1994; Ron et al., 1995). These algorithms differ mainly by their state-merging strategies, used for generalizing from the training data.</Paragraph>
    <Paragraph position="4"> A major difference between the abovementioned learning methods and our memory-based approach is that the former employ generalized models that were created at training time while the latter uses the training corpus as-is and generalizes only at recognition time.</Paragraph>
    <Paragraph position="5"> Much work aimed at learning models for full parsing, i.e., learning hierarchical structures. We refer here only to the DOP (Data Oriented Parsing) method (Bod, 1992) which, like the present work, is a memory-based approach. This method constructs parse alternatives for a sentence based on combinations of subtrees in the training corpus. The MBSL approach may be viewed as a linear analogy to DOP in that it constructs a cover for a candidate based  on subsequences of training instances.</Paragraph>
    <Paragraph position="6"> Other implementations of the memory-based paradigm for NLP tasks include Daelemans et al.</Paragraph>
    <Paragraph position="7"> (1996), for POS tagging; Cardie (1993), for syntactic and semantic tagging; and Stanfill and Waltz (1986), for word pronunciation. In all these works, examples are represented as sets of features and the deduction is carried out by finding the most similar cases. The method presented here is radically different in that it makes use of the raw sequential form of the data, and generalizes by reconstructing test examples from different pieces of the training data.</Paragraph>
  </Section>
  <Section position="7" start_page="25024" end_page="25024" type="metho">
    <SectionTitle>
5 Conclusions
</SectionTitle>
    <Paragraph position="0"> We have presented a novel general schema and a particular instantiation of it for learning sequential patterns. Applying the method to three syntactic patterns in English yielded positive results, suggesting its applicability for recognizing local linguistic patterns. In future work we plan to investigate a data-driven approach for optimal selection and weighting of statistical features of candidate scores, as well as to apply the method to syntactic patterns of Hebrew and to domain-specific patterns for information extraction. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML