XML Viewer - p06-2115

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2115_metho.xml
Size: 18,199 bytes
Last Modified: 2025-10-06 14:10:29
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2115">
  <Title>From Prosodic Trees to Syntactic Trees</Title>
  <Section position="3" start_page="0" end_page="899" type="metho">
    <SectionTitle>
1 The cantillation treebank
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="899" type="sub_section">
      <SectionTitle>
1.1 Cantillation marks
</SectionTitle>
      <Paragraph position="0"> The text of HB has been systematically annotated for more than a thousand years. By the end of the  th century, a group of Jewish scholars known as the Masoretes had developed a system for  marking the structures of the Bible verses. The system contains a set of cantillation marks  which indicate the division and subdivision of each verse, very much like the punctuation marks or the brackets we use to mark constituent structures. At that time, those cantillation marks were intended to record the correct way of reading or chanting the Hebrew text: how to group words into phrases and where to put pauses between intonational units. In the eyes of modern linguists, the hierarchical structures thus marked can be best understood as a prosodic representation of the verses (Dresher 1994). There are two types of cantillation marks: the conjunctive marks which group multiple words into single units and the disjunctive marks which divide and subdivide a verse in a binary fashion. The marking of Genesis 1:1, for example, is equivalent to the bracketing shown below.</Paragraph>
      <Paragraph position="1"> (English words are used here in place of Hebrew to make it easier for non-Hebrew-speakers to understand. OM stands for object marker.)</Paragraph>
      <Paragraph position="3"> This analysis resembles the prosodic structure in Selkirk (1984) and the performance structure in Gee and Grosjean (1983).</Paragraph>
      <Paragraph position="4"> 1.2. Parsing the prosodic structure The cantillation system in the Mesoretic text is a very complex one with dozens of diacritic symbols and complicated annotation rules. As a result, only a few trained scholars can decipher them and their practical use has been very limited. In order to make the information encoded by this system more accessible to both humans and  The cantillation marks show how a text is to be sung. See http://en.wikipedia.org/wiki/Cantillation. machines, we built a treebank where the prosodic structures of HB verses are explicitly represented as trees in XML format (Wu &amp; Lowery, 2006).</Paragraph>
      <Paragraph position="5"> There have been quite a few studies of the Masoretic cantillation system. After reviewing the existing analyses, such as Wickes (1881), Price (1990), Richter (2004) and Jacobson (2002), we adopted the binary analysis of British and Foreign Bible Society (BFBS 2002) which is based on the principle of dichotomy of Wicks (1881). The binary trees thus generated are best for extracting brackets that are syntactically significant.</Paragraph>
      <Paragraph position="6"> We found all the binary rules that underlie the annotation and coded them in a context-free grammar. This CFG was then used by the parser to automatically generate the prosodic trees. The input text to the parser was the MORPH database developed by Groves &amp; Lowery (2006) where the the cantillation marks are represented as numbers in its Michigan Code text.</Paragraph>
      <Paragraph position="7"> The following is the prosodic tree generated for Genesis 1:1, displayed in English glosses in the tree editor (going right-to-left according to the writing convention of Hebrew): Figure 1 The node labels &amp;quot;athnach&amp;quot;, &amp;quot;tiphcha&amp;quot;, &amp;quot;mereka&amp;quot; and &amp;quot;munach&amp;quot; in this tree are names of the cantillation marks that indicate the types of boundaries between the two chunks they dominate. Different types of boundaries have different (relative) boundary strengths. The &amp;quot;m&amp;quot; nodes are morphemes and the &amp;quot;w&amp;quot; nodes are words.</Paragraph>
      <Paragraph position="8">  1.3. A complete prosodic treebank Since the Mesoretic annotation is supposed to mark the structure of every verse unambiguously, we expect to parse every verse successfully with exactly one tree assigned to it, given that (1) the annotation is perfectly correct and (2) the CFG grammars correctly encoded the annotation rules. The actual results were close to our expectation: all the 23213 verses were successfully parsed, of which 23099 received exactly one complete tree. The success rate is 99.5 percent. The 174 verses that received multiple parse trees all have words that carry more than one cantillation mark. This can of course create boundary ambiguities and result in multiple parse trees. We have good reasons to believe that the grammars we used are correct. We would have failed to parse some verses if the grammars had been incomplete and we would have gotten multiple trees for a much greater number of verses if the grammars had been ambiguous.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="899" end_page="901" type="metho">
    <SectionTitle>
2 Phrase boundary extraction
</SectionTitle>
    <Paragraph position="0"> Now that a cantillation treebank is available, we can get brackets from those trees and use them in syntactic parsing. Although prosodic structures are not syntactic structures, they do correspond to each other in some systematic ways. Just as there are ways to transform syntactic structures to prosodic structures (e.g. Abney 1992), prosodic structures can also provide clues to syntactic structures. As we have discovered, some of the brackets in the cantillation trees can be directly mapped to syntactic boundaries, some can be mapped after some adjustment, and some have no syntactic significance at all.</Paragraph>
    <Section position="1" start_page="899" end_page="900" type="sub_section">
      <SectionTitle>
2.1 Direct correspondences
</SectionTitle>
      <Paragraph position="0"> Direct correspondences are most likely to be found at the clausal level. Almost all the clause boundaries can be found in the cantillation trees.</Paragraph>
      <Paragraph position="1"> Take Genesis 1:3 as an example: Figure 2 Here, the verse is first divided into two clauses: &amp;quot;God said let there be light&amp;quot; and &amp;quot;there was light&amp;quot;. The first clause is further divided into &amp;quot;God said&amp;quot; and &amp;quot;let there be light&amp;quot;. Such bracketing will prevent the wrong analysis where &amp;quot;let there be light&amp;quot; and &amp;quot;there was light&amp;quot; are conjoined to serve as the object of &amp;quot;God said&amp;quot;. Given the fact that there are no punctuation marks in HB, it is very difficult for the parser to rule out the wrong parse without the help of the cantillatioin information.</Paragraph>
      <Paragraph position="2"> Coordination is another area where the cantillation brackets are of great help. The syntactic ambiguity associated with coordination is well-known, but the ambiguity can often be resolved with help of prosodic cues. This is indeed what we find in the cantillation treebank.</Paragraph>
      <Paragraph position="3"> In Genesis 24:35, for example, we find the following sequence of words: &amp;quot;male servants and maid servants and camels and donkeys&amp;quot;.</Paragraph>
      <Paragraph position="4"> Common sense tells us that there are only two possible analyses for this sequence: (1) a flat structure where the four NPs are sisters, or (2) &amp;quot;male servant&amp;quot; conjoins with &amp;quot;maid servant, &amp;quot;camels&amp;quot; conjoins with &amp;quot;donkeys&amp;quot;, and then the two conjoined NPs are further conjoined as sisters. However, the computer is faced with 14 different choices. Fortunately, the cantillation tree can help us pick the correct structure:  Figure 3 The brackets extracted from this tree will force the parser to produce only the second analysis above.</Paragraph>
      <Paragraph position="5"> Good correspondences are also found for most base NPs and PPs. Here is an example from Genesis 1:4, which means &amp;quot;God separated the light from the darkness&amp;quot;: Figure 4 As we can see, the noun phrases and prepositional phrases all have corresponding brackets in this tree.</Paragraph>
    </Section>
    <Section position="2" start_page="900" end_page="901" type="sub_section">
      <SectionTitle>
2.2 Indirect correspondences
</SectionTitle>
      <Paragraph position="0"> Now we turn to prosodic structures that can be adjusted to correspond to syntactic structures.</Paragraph>
      <Paragraph position="1"> They usually involve the use of function words such as conjunctions, prepositions and determiners. Syntactically, these words are supposed to be attached to complete NPs, often resulting in trees where those single words are sisters to large NP chunks. Such &amp;quot;unbalanced&amp;quot; trees are rarely found in prosodic structures, however, where a sentence tends to be divided into chunks of similar length for better rhythm and flow of speech.</Paragraph>
      <Paragraph position="2"> This is certainly the case in the HB cantillation treebank. It must have already been noticed in the example trees we have seen so far that the conjunction &amp;quot;and&amp;quot; is always attached to the word that immediately follows it. As a matter of fact, the conjunction and the following word are often treated as a single word for phonological reasons. Prepositions are also traditionally treated as part of the following word. It is therefore not a surprise to find trees of the following kind: Figure 5 In this tree, the preposition &amp;quot;over&amp;quot; is attached to &amp;quot;surface of&amp;quot; instead of &amp;quot;surface of the waters&amp;quot;. We also see the conjunction &amp;quot;and&amp;quot; is attached to &amp;quot;spirit of&amp;quot; rather than to the whole clause. A similar situation is found for determiners, as can be seen in this sub-tree where &amp;quot;every of&amp;quot; is attached to &amp;quot;crawler of&amp;quot; instead of &amp;quot;crawler of the ground&amp;quot;.</Paragraph>
      <Paragraph position="3"> Figure 6  In all these cases, the differences between prosodic structures and syntactic structures are systematic and predictable. All of them can be adjusted to correspond better to syntactic structures by raising the function words out of their current positions and re-attach them to some higher nodes.</Paragraph>
    </Section>
    <Section position="3" start_page="901" end_page="901" type="sub_section">
      <SectionTitle>
2.3 Extracting the boundaries
</SectionTitle>
      <Paragraph position="0"> In the bracket extraction phase, we go through all the sub-trees and get their beginning and ending positions in the form of (begin, end). Given the tree in Figure 6, for example, we can extract the following brackets: (n, n+3), (n, n+1), (n+2, n+3), where n is the position of the first word in the sub-tree.</Paragraph>
      <Paragraph position="1"> For cases of indirect correspondence discussed in 2.2, we automatically adjust the brackets by removing the ones around the function word and its following word and adding a pair of brackets that start from the word following the function word and end in the last word of the phrase. After this adjustment, the brackets extracted from  For trees that start with &amp;quot;and&amp;quot;, we detach &amp;quot;and&amp;quot; and re-attach it to the highest node that covers the phrase starting with &amp;quot;and&amp;quot;. After this and other adjustments, the brackets we extract from Figure</Paragraph>
      <Paragraph position="3"> The cantillation trees also contain brackets that are not related to syntactic structures at all. Since it is difficult to identify those useless brackets automatically, we just leave them alone and let them be extracted anyway. Fortunately, as we will see in the next section, the parser does not depend completely on the extracted bracketing information. The useless brackets can simply be ignored in the parsing process.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="901" end_page="903" type="metho">
    <SectionTitle>
3 Building a syntactic treebank
</SectionTitle>
    <Paragraph position="0"> As we mentioned earlier, we use a parser to generate the treebank. This parser uses an augmented context-free grammar that encodes the grammatical knowledge of Biblical Hebrew.</Paragraph>
    <Paragraph position="1"> Each rule in this grammar has a number of grammatical conditions which must be satisfied in order for the rule to apply. In addition, it may have a bracketing condition which can either block the application of a rule or force a rule to apply.</Paragraph>
    <Paragraph position="2"> Besides serving as conditions in rule application,  the bracketing information is also used to rank trees in cases where more than one tree is generated.</Paragraph>
    <Section position="1" start_page="902" end_page="902" type="sub_section">
      <SectionTitle>
3.1 Brackets as rule conditions
</SectionTitle>
      <Paragraph position="0"> Bracketing information is used in some grammar rules to guide the parser in making syntactic decisions. In those rules, we have conditions that look at the beginning position and ending position of the sub-tree to be produced by the rule and check to see if those bracket positions are found in our phrase boundary database. The sub-tree will be built only if the bracketing conditions are satisfied.</Paragraph>
      <Paragraph position="1"> There are two types of bracketing conditions.</Paragraph>
      <Paragraph position="2"> One type serves as the necessary and sufficient condition for rule application. These conditions work in disjunction with grammatical conditions.</Paragraph>
      <Paragraph position="3"> A rule will apply when either the grammatical conditions or the bracketing conditions are satisfied. This is where the bracketing condition can force a rule to apply regardless of the grammatical conditions. The brackets consulted by this kind of conditions must be the manually approved ones or the automatically extracted ones that are highly reliable. Such conditions make it possible for us to override grammatical conditions that are too strict and build the structures that are known to be correct.</Paragraph>
      <Paragraph position="4"> The other type of bracketing conditions serves as the necessary conditions only. They work in conjunction with grammatical conditions to determine the applicability of a rule. The main function of those bracketing conditions is to block structures that the grammatical conditions fail to block because of lack of information.</Paragraph>
      <Paragraph position="5"> However, they cannot force a rule to apply. The sub-tree to be produced the rule will be built only if both the grammatical conditions and the bracketing conditions are met.</Paragraph>
      <Paragraph position="6"> The overall design of the rules and conditions are meant to build a linguistically motivated Hebrew grammar that is independent of the cantillation treebank while making use of its prosodic information.</Paragraph>
    </Section>
    <Section position="2" start_page="902" end_page="902" type="sub_section">
      <SectionTitle>
3.2 Brackets for tree ranking
</SectionTitle>
      <Paragraph position="0"> The use of bracketing conditions greatly reduces the number of trees the parser generates. In fact, many verses yield a single parse only. However, there are still cases where multiple trees are generated. In those cases, we use the bracketing information to help rank the trees.</Paragraph>
      <Paragraph position="1"> During tree ranking, the brackets of each tree are compared with the brackets in the cantillation trees to find the number of mismatches. Trees that have fewer mismatches are ranked higher than trees that have more mismatches. In most cases, the top-ranking tree is the correct parse.</Paragraph>
      <Paragraph position="2"> Theoretically, it should be possible to remove all the bracketing conditions from the rules, let the parser produce all possible trees, and use the bracketing information solely at the tree-ranking stage to select the correct trees. We can even use machine learning techniques to build a statistical parser. However, a Treebank of the Bible requires 100% accuracy but none of the statistical models are capable of that standard yet. As long as 100% accuracy is not guaranteed, manual checking will be required to fix all the individual errors. Such case-by-case fixes are easy to do in our current approach but are very difficult in statistical models.</Paragraph>
    </Section>
    <Section position="3" start_page="902" end_page="903" type="sub_section">
      <SectionTitle>
3.3 Evaluation
</SectionTitle>
      <Paragraph position="0"> Since only a very small fraction of the trees generated by our parser have been manually verified, there is not yet a complete golden standard to objectively evaluate the accuracy of the parser. However, some observations are obvious: (1) The parsing process can become intractable without the bracketing conditions. We tried parsing with those conditions removed from the rules to see how many more trees we will get. It turned out that parsing became so slow that we had to terminate it before it was finished. This shows that the bracketing conditions are playing an indispensable role in making syntactic decisions.</Paragraph>
      <Paragraph position="1">  (2) The number of edits needed to correct the trees in manual checking is minimal. Most trees generated by the machine are basically correct and only a few touches are necessary to make them perfect.</Paragraph>
      <Paragraph position="2"> (3) The boundary information extracted from the cantillation tree could take a long time to create if done by hand, and a great deal of manual work is saved by using the brackets from the cantillation treebank.</Paragraph>
      <Paragraph position="3"> Conclusion In this paper, we have demonstrated the use of prosodic information in syntactic parsing in a treebanking project. There are correlations between prosodic structures and syntactic structures. By using a parser that consults the prosodic phrase boundaries, the cost of building the treebank can be minimized.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML