File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/92/c92-3141_concl.xml
Size: 8,653 bytes
Last Modified: 2025-10-06 13:56:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-3141"> <Title>HIGH-PROBABILITY SYNTACTIC LINKS</Title> <Section position="6" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Experimcn|s </SectionTitle> <Paragraph position="0"> At present, after preliminary debugging and tuning of tile rules, we have begun to carry out regahn' experi~ merits with it homogeneous flow of Russian texts. The experiments make use of a Coluputer-olicnted conlbinatorial dictionary of Russian compiled by a group of linguists under ttle guidance of Ju.D.Apresjan (see Apresjan et al. 1992). It contains over' 10,000 entries, mainly general scicnlific vcxzabulary and terms horn computer science and e\]tx:trical engineering.</Paragraph> <Paragraph position="1"> The number of rules in lhc system is now about 100. Total number of arcs in their transition graphs is about 2,000.</Paragraph> <Paragraph position="2"> As a source of texts, we have taken several issues of the journal Computer Science Abstracts (Referativnyj zhurnal Vyehislitel'nyje Nauki, in Russian). Sentences are chosen at raodom. Sentences with formulas, occasional abbreviations, and non-Cyrillic words are excluded. Words absent in the dictionaries (aboul 8% of all word occureuces in these texts) are replaced by &quot;dummy&quot; words that have syntactic properties most probable for the given category. At present, about 300 sentences have been processed.</Paragraph> <Paragraph position="3"> On the average, fraginr:nts produced by partial parsing include 3 - 4 words. It is not infrequent that they have 8 - 10 or store words, or present complete structures of sentences. On the other hand, a substantial parl of fragments are isolated homonyms.</Paragraph> <Paragraph position="4"> For instance, subordinate conjunctions remain isolated in most eases because, as a rule, their links wilh other words are not considered having high probability. null Frequently enough morphoh~gieal, lexical, and structural ambiguity results ill building 2 - 4 different fragments on tile same segnlellt, Sometimes their number is 8 - 12 and more, but such cases are relatively rare. The record is now equal to 72 fragments on a segment of 9 words. For such cases, packing techniques can be developed similar to those described by Tomita (1987). Another possible method is to employ ntnnelical estimates of syntactic preference (set, for example, Tsejtin 1975; Kulagiua 1987, 1990; Tsujii et al. 1988).</Paragraph> <Paragraph position="5"> On the avecage, the nmubcr of established links is 70 - 80 % of the total nunlber of syntactic links in tile sentence. These figm'es include links present both in the fragmenls built ;trl0 ill tile semantically COl rC/ct structm'e of the sentence; &quot;extra&quot; links that arise due to ambiguity of fragments are not included.</Paragraph> <Paragraph position="6"> Sometimes the fragments overlap, that is, their segments intersect. It happens approximately in one tenth of sentences. As a rule, in such cases the correct resnlt is a combination of one of the overlapping fragments with its &quot;truncated&quot; competitor.</Paragraph> <Paragraph position="7"> A fragment is called correct for a given sentence if it is a subtree of the semantically correct dependency trek of this sentence (or of one of such trees, in the rare cases of real semantic ambiguity like (3) - (4)).</Paragraph> <Paragraph position="8"> A h'agment is called feasible if it is a subtree of some dependency tree of some sentence of the given language. The algmSthm makes an error in the following lwo cases: (a) if a non-feasible fragment is built; (b) if all fragments built on some segment are feasible but none is correct. (Here we do not take into account semantically abnormal sentences or the possibility of overlapping; these situations would require more accurate definitions.) hi roost cases, all error means that some link of a fragment is established erroneously, while all the others arc correct. Ttre experiments have shown that tile frequency of errors for the algorithm described is fairly snmll. For tile lasl 100 sentences, 12 errors were nmde (9 of the first type and 3 of the second), which is less than 1% of the total number of links eslablished in correct fragments. A stable estimate is not yet obtained because at this stage of experiments tuning of tire rules is emllinued, and the error frequeocy decreases steadily.</Paragraph> <Paragraph position="9"> Error s of tire first type are caused by inaccuracy of the lexicographic descriplious and imperfection of the rules. In the presence of adequate lexicographie information, these errors in principle are avoidable, as the rules may fully control internal properties of the fragments being created.</Paragraph> <Paragraph position="10"> The second type of error is intrinsic to our approach. The rules employed are local in two respects: they take no (or almost no) account of the context outside the fragments being adjoined, and they take no account of a very large part of syntax that concerns less probable links. The first restrictiou means that fragments may appear which are grammatically feasible but do not agree with the context. The second one implies that wc do not intend to obtain complete structures of sentences, and therefore shall no\[ be able to reject a fragment for the reason that it is not engaged in any complete structure.</Paragraph> <Paragraph position="11"> In general, it is not at all snrprising that a certain part of syntactic links can be reliably revealed by local mcchanisrns. Any flow of texts in any language must contain chains of words the parse of which weakly depends on the context (&quot;weakly&quot; can be understood here in the statistical sense: the share of those occurences for which tile parse differs from the most probable one is small). The possibility of examining fragments in any detail permits to avoid situations iu which the risk of creating a non-feasible fragment is too large.</Paragraph> <Paragraph position="12"> A more surprising fact is that the number of reliably established links is rather high ~ about 75 %. For the most part, these are links typical of the basic, most frequent syntactic constructions such as &quot;adjec-ACIES DI,;COLlN(l-92, NANIV;S, 23-28 AOt-ll 1992 93 3 l'aoC. OV COLING-92, NAhqES, AUG, 23-28, 1992 tire + noun&quot;, &quot;preposition + noun&quot;, &quot;numeral + noun&quot;, &quot;adverb + verb&quot;, and also a large group of links connecting predicate words with their arguments. As regards the last type, preference for the predicate-argument interpretation of word combinations was orlon noted in the literature (this preference is a particular case of the Most Restrictive Context Principle proposed by Hobhs and Bear (1990)).</Paragraph> <Paragraph position="13"> Observations show that the number of established high-probability links noticeably depends on the type of text. The general trend is as follows: the more &quot;formal&quot; the text is, the more links are established. From this point of view, the language of scientific abstracts suits the given approach quite well.</Paragraph> <Paragraph position="14"> As regards comparative frequency of high-probability links in different languages, it would be natural to expect these links to be more typical of languages with rich morphology than of analytical ones (such as English). Nevertheless, preliminary experiments have shown no substantial difference in this respect between English and Russian scientific texts.</Paragraph> <Paragraph position="15"> We suppose that in case of high-probability links, the efficiency of local approach is additionally augmented due to factors &quot;of the second order&quot; concerning general mechanisms of text comprehension and generation. This opinion is based on the following assumptions. If someone reading a text sees that a high-probability link is possible between certain words and this link is compatible with the previous part of the text, then he makes a conjecture that this link is correct; such conjecture is abandoned only if some counter-evidence is obtained. When people generate texts, they take into account this property of the comprehension mechanism and tend not to disappoint expectations of the readers. In other words, they are careful not to create high-probability links that would prove to be incorrect. This can be regarded as an instance of cooperation in language performance (cf. the Cooperative Principle in pragmatics formulated by Grice (1975)).</Paragraph> </Section> class="xml-element"></Paper>