File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2912_intro.xml
Size: 2,988 bytes
Last Modified: 2025-10-06 14:04:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2912"> <Title>Unsupervised Parsing with U-DOP</Title> <Section position="4" start_page="88" end_page="89" type="intro"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="88" end_page="89" type="sub_section"> <SectionTitle> 3.1 Comparing U-DOP to previous work </SectionTitle> <Paragraph position="0"> Using the method described above, our parsing experiment with all p-o-s strings from the WSJ10 results in an f-score of 78.5%. We next tested U-DOP on two additional domains from Chinese and German which were also used in Klein and Manning (2002, 2004): the Chinese treebank (Xue et al. 2002) and the NEGRA corpus (Skut et al.</Paragraph> <Paragraph position="1"> 1997). The CTB10 is the subset of p-o-s strings from the Penn Chinese treebank containing 10 words or less after removal of punctuation (2437 strings). The NEGRA10 is the subset of p-o-s strings of the same length from the NEGRA corpus using the supplied converson into Penn treebank format (2175 strings). Table 1 shows the results of U-DOP in terms of UP, UR and F1 compared to the results of the CCM model by Klein and Manning (2002), the DMV dependency learning model by Klein and Manning (2004) together with their combined model DMV+CCM.</Paragraph> <Paragraph position="2"> better than Klein and Manning's combined DMV+CCM model, although the differences are small (note that for Chinese the single DMV model scores better than the combined model and slightly better than U-DOP). But where Klein and Manning's combined model is based on both a constituency and a dependency model, U-DOP is, like CCM, only based on a notion of constituency. Compared to CCM alone, the all-subtrees approach employed by U-DOP shows a clear improvement (except perhaps for Chinese). It thus seems to pay off to use all subtrees rather than just all (contiguous) substrings in bootstrapping constituency. It would be interesting to investigate an extension of U-DOP towards dependency parsing, which we will leave for future research. It is also noteworthy that U-DOP does not employ a separate class for non-constituents, so-called distituents, while CCM does. Thus good results can be obtained without keeping track of distituents but by simply assigning all binary trees to the strings and letting the DOP model decide which substrings are most likely to form constituents.</Paragraph> <Paragraph position="3"> To give an idea of the constituents learned by U-DOP for the WSJ10, table 2 shows the 10 most frequently constituents in the trees induced by U-DOP together with the 10 actually most frequently occurring constituents in the WSJ10 and the 10 most frequently occurring part-of-speech sequences (bigrams) in the WSJ10.</Paragraph> <Paragraph position="4"> Rank Most frequent Most Frequent Most frequentU-DOP constituents WSJ10 constituents WSJ10 substrings</Paragraph> </Section> </Section> class="xml-element"></Paper>