File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1009_metho.xml

Size: 11,189 bytes

Last Modified: 2025-10-06 14:15:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1009">
  <Title>Man* vs. Machine: A Case Study in Base Noun Phrase Learning</Title>
  <Section position="4" start_page="65" end_page="67" type="metho">
    <SectionTitle>
3 Manual Rule Acquisition
</SectionTitle>
    <Paragraph position="0"> R&amp;M framed the base NP annotation problem as a word tagging problem. We chose instead to use regular expressions on words and part of speech tags to characterize the NPs, as well as the context surrounding the NPs, because this is both a more powerful representational language and more intuitive to a person. A person can more easily consider potential phrases as a sequence of words and tags, rather than looking at each individual word and deciding whether it is part of a phrase or not. The rule actions we allow are: 2 Add Add a base NP (bracket a sequence of words as a base NP) Kill Delete a base NP (remove a pair of parentheses) Transform Transform a base NP (move one or both parentheses to extend/contract a base NP) Merge Merge two base NPs As an example, we consider an actual rule from our experiments: Bracket all sequences of words of: one determiner (DT), zero or more adjectives (JJ, JJR, JJS), and one or more nouns (NN, NNP, NNS, NNPS), if they are followed by a verb (VB, VBD, VBG, VBN, VBP, VBZ).</Paragraph>
    <Paragraph position="1"> In our language, the rule is written thus: 3 A</Paragraph>
    <Paragraph position="3"> The first line denotes the action, in this case, Add a bracketing. The second line defines the context preceding the sequence we want to have bracketed -- in this case, we do not care what this sequence is. The third line defines the sequence which we want bracketed, and the last 2The rule types we have chosen are similar to those used by Vilain and Day (1996) in transformation-based parsing, but are more powerful.</Paragraph>
    <Paragraph position="4"> SA full description of the rule language can be found</Paragraph>
    <Paragraph position="6"> line defines the context following the bracketed sequence.</Paragraph>
    <Paragraph position="7"> Internally, the software then translates this rule into the more unwieldy Perl regular expression: null s( ( ( \['\s_\] +__DT\s+) ( \['\s_\] +__JJ \[RS\] \s+)* The actual system is located at http://nlp, cs. jhu. edu/~basenp/chunking. A screenshot of this system is shown in figure 4. The correct base NPs are enclosed in parentheses and those annotated by the human's rules in brackets.</Paragraph>
    <Paragraph position="9"> The base NP annotation system created by the humans is essentially a transformation-based system with hand-written rules. The user manually creates an ordered list of rules. A rule list can be edited by adding a rule at any position, deleting a rule, or modifying a rule.</Paragraph>
    <Paragraph position="10"> The user begins with an empty rule list. Rules are derived by studying the training corpus and NPs that the rules have not yet bracketed, as well as NPs that the rules have incorrectly bracketed. Whenever the rule list is edited, the efficacy of the changes can be checked by running the new rule list on the training set and seeing how the modified rule list compares to the unmodified list. Based on this feedback, the user decides whether, to accept or reject the changes that were made. One nice prop-erty of transformation-based learning is that in appending a rule to the end of a rule list, the user need not be concerned about how that rule may interact with other rules on the list. This is much easier than writing a CFG, for instance, where rules interact in a way that may not be readily apparent to a human rule writer.</Paragraph>
    <Paragraph position="11"> To make it easy for people to study the training set, word sequences are presented in one of four colors indicating that they:  1. are not part of an NP either in the truth or in the output of the person's rule set 2. consist of an NP both in the truth and in the output of the person's rule set (i.e. they constitute a base NP that the person's rules correctly annotated) 3. consist of an NP in the truth but not in the output of the person's rule set (i.e. they constitute a recall error) 4. consist of an NP in the output of the person's rule set but not in the truth (i.e. they constitute a precision error)</Paragraph>
    <Section position="1" start_page="65" end_page="67" type="sub_section">
      <SectionTitle>
Experimental Set-Up and Results
</SectionTitle>
      <Paragraph position="0"> The experiment of writing rule lists for base NP annotation was assigned as a homework set to a group of 11 undergraduate and graduate students in an introductory natural language processing course. 4 The corpus that the students were given from which to derive and validate rules is a 25k word subset of the R&amp;M training set, approximately ! the size of the full R&amp;M training set. The 8 reason we used a downsized training set was that we believed humans could generalize better from less data, and we thought that it might be possible to meet or surpass R&amp;M's results with a much smaller training set.</Paragraph>
      <Paragraph position="1"> Figure 1 shows the final precision, recall, F-measure and precision+recall numbers on the training and test corpora for the students.</Paragraph>
      <Paragraph position="2"> There was very little difference in performance on the training set compared to the test set.</Paragraph>
      <Paragraph position="3"> This indicates that people, unlike machines, seem immune to overtraining. The time the students spent on the problem ranged from less than 3 hours to almost 10 hours, with an average of about 6 hours. While it was certainly the case that the students with the worst results spent the least amount of time on the problem, it was not true that those with the best results spent the most time -- indeed, the average amount of time spent by the top three students was a little less than the overall average -- slightly over 5 hours. On average, people achieved 90% of their final performance after half of the total time they spent in rule writing. The number of rules in the final rule lists also varied, from as few as 16 rules to as many as 61 rules, with an average of 35.6 rules. Again, the average number for the top three subjects was a little under the average for everybody: 30.3 rules.</Paragraph>
      <Paragraph position="4"> 4These 11 students were a subset of the entire class. Students were given an option of participating in this experiment or doing a much more challenging final project. Thus, as a population, they tended to be the less motivated students.</Paragraph>
      <Paragraph position="5">  In the beginning, we believed that the students would be able to match or better the R&amp;M system's results, which are shown in figure 2. It can be seen that when the same training corpus is used, the best students do achieve performances which are close to the R&amp;M system's -- on average, the top 3 students' performances come within 0.5% precision and 1.1% recall of the machine's. In the following section, we will examine the output of both the manual and automatic systems for differences.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="67" end_page="69" type="metho">
    <SectionTitle>
5 Analysis
</SectionTitle>
    <Paragraph position="0"> Before we started the analysis of the test set, we hypothesized that the manually derived systems would have more difficulty with potential rifles that are effective, but fix only a very small number of mistakes in the training set.</Paragraph>
    <Paragraph position="1"> The distribution of noun phrase types, identified by their part of speech sequence, roughly obeys Zipf's Law (Zipf, 1935): there is a large tail of noun phrase types that occur very infrequently in the corpus. Assuming there is not a rule that can generalize across a large number of these low-frequency noun phrases, the only way noun phrases in the tail of the distribution can be learned is by learning low-count rules: in other words, rules that will only positively affect a small number of instances in the training corpus.</Paragraph>
    <Paragraph position="2"> Van der Dosch and Daelemans (1998) show that not ignoring the low count instances is often crucial to performance in machine learning systems for natural language. Do the human-written rules suffer from failing to learn these infrequent phrases? To explore the hypothesis that a primary difference between the accuracy of human and machine is the machine's ability to capture the low frequency noun phrases, we observed how the accuracy of noun phrase annotation of both human and machine derived rules is affected by the frequency of occurrence of the noun phrases in the training corpus. We reduced each base NP in the test set to its POS tag sequence as assigned by the POS tagger. For each POS tag sequence, we then counted the number of times it appeared in the training set and the recall achieved on the test set.</Paragraph>
    <Paragraph position="3"> The plot of the test set recall vs. the number of appearances in the training set of each tag sequence for the machine and the mean of the top 3 students is shown in figure 3. For instance, for base NPs in the test set with tag sequences that appeared 5 times in the training corpus, the students achieved an average recall of 63.6% while the machine achieved a recall of 83.5%.</Paragraph>
    <Paragraph position="4"> For base NPs with tag sequences that appear less than 6 times in the training set, the machine outperforms the students by a recall of 62.8% vs. 54.8%. However, for the rest of the base NPs -- those that appear 6 or more times -the performances of the machine and students are almost identical: 93.7% for the machine vs.</Paragraph>
    <Paragraph position="5"> 93.5% for the 3 students, a difference that is not statistically significant.</Paragraph>
    <Paragraph position="6"> The recall graph clearly shows that for the top 3 students, performance is comparable to the machine's on all but the low frequency constituents. This can be explained by the human's  reluctance or inability to write a rule that will only capture a small number of new base NPs in the training set. Whereas a machine can easily learn a few hundred rules, each of which makes a very small improvement to accuracy, this is a tedious task for a person, and a task which apparently none of our human subjects was willing or able to take on.</Paragraph>
    <Paragraph position="7"> There is one anomalous point in figure 3. For base NPs with POS tag sequences that appear 3 times in the training set, there is a large decrease in recall for the machine, but a large increase in recall for the students. When we looked at the POS tag sequences in question and their corresponding base NPs, we found that this was caused by one single POS tag sequence -- that of two successive numbers (CD). The  test set happened to include many sentences containing sequences of the type:</Paragraph>
    <Paragraph position="9"> while the training set had none. The machine ended up bracketing the entire sequence I/CD -~/CD to/T0 51/CD 1/2/CD as a base NP. None of the students, however, made this mistake.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML