XML Viewer - w00-1103

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1103_metho.xml
Size: 14,294 bytes
Last Modified: 2025-10-06 14:07:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1103">
  <Title>Use of Dependency Tree Structures for the Microcontext Extraction</Title>
  <Section position="3" start_page="23" end_page="24" type="metho">
    <SectionTitle>
2 Significance of word contexts
</SectionTitle>
    <Paragraph position="0"> Word senses are not something given a priori.</Paragraph>
    <Paragraph position="1"> Humans create word senses in the process of thinking and using language. Thinking forms language and language influences thinking. It is impossible to separate them. Word senses are products of their interaction. In our opinion, the effort to represent word senses as fixed elements in a textual information system is a methodological mistake.</Paragraph>
    <Paragraph position="2"> Many researchers consider the sense of a word as an average of its linguistic uses. Then, the investigation of sense distinctions is based on the knowledge of contexts in which a word appears in a text corpus. Sense representations are computed as groups of similar contexts. For instance, Schiitze (1998) creates sense clusters from a corpus rather than relying on a pre-established sense list. He makes up the clusters as the sets of contextually similar occurrences of an ambiguous word. These clusters are then interpreted as senses.</Paragraph>
    <Paragraph position="3"> According to how wide vicinity of the target word we include into the context we can speak about the local context and the topical context. The local or &amp;quot;micro&amp;quot;context is generally considered to be some small window of words surrounding a word occurrence in a text, from a few words of context to the entire sentence in which the target word appears. The topical context includes substantive words that co-occur  with a given word, usually within a window of several sentences. In contrast with the topical context, the microcontext may include information on word order, distance, grammatical inflections and syntactic strncture. In one study, Miller and Charles (1991) found evidence that human subjects determine the semandc similarity of words from the similarity of the contexts they are used in. They surnmarised this result in the so-called strong contextual hypothesis: Two words are semantically similar to the extent that their contextual representations are similar.</Paragraph>
    <Paragraph position="4"> The contextual representation of a word has been defined as a characterisation of the linguistic context in which a word appears.</Paragraph>
    <Paragraph position="5"> Leacock, Towell and Voorhees (1996) demonstrated that contextual representations consisting of both local and topical components are effective for resolving word senses and can be automatically extracted from sample texts. No doubt information from both microcontext and topical context contributes to sense selection, but the relative roles and importance of information from different contexts, and their interrelations, are not well understood yet.</Paragraph>
    <Paragraph position="6"> Not only computers but even humans learn, realise, get to know and understand the meanings of words from the contexts in which they meet them. The investigation of word contexts is the most important, essential, unique and indispensable means of understanding the sense of words and texts.</Paragraph>
  </Section>
  <Section position="4" start_page="24" end_page="28" type="metho">
    <SectionTitle>
3 Analysing Czech texts
</SectionTitle>
    <Paragraph position="0"> results of the analysis are the two target structures: the dependency microcontext structure (DMCS) which we use for the microcontext extraction and the tectogrammatical tree structure (TGTS) which represents the underlying syntactic structure of the sentence. As the main intention of this paper is to describe the DMCS, building of the TGTS is distinguished by dashed line in Figure 1; we mention it here only for completeness and for comparison with the DMCS.</Paragraph>
    <Paragraph position="1">  Key algorithms used in the process of the analysis are based on empirical methods and on previous statistical processing of training data, i.e. natural &amp;quot; language corpora providing statistically significant sample of correct decisions. Consequently, the ability of these procedures to provide a correct output has a stochastic character. These procedures were developed during the past years in the process of the Czech National Corpus and Prague Dependency Treebank creation. For a detailed descriptions see Haji6 (1998), Hladkfi (2000) and Collins, Haji6, Ram~haw, Tillmann (1999).</Paragraph>
    <Paragraph position="2"> As shown in Figure 1, the first procedure is tokenizafion. The output of tokenization is the text divided into lexical atoms or tokens, i.e. words, numbers, punctuation marks and special graphical symbols. At the same time the boundaries of sentences and paragraphs are determined.</Paragraph>
    <Paragraph position="3"> The following procedure, i.e. morphological tagging and lexical disambiguation, works in two stages. The first is the morphological analysis, which assigns each word its lemma, i.e. its basic word form, and its morphological tag. Since we often meet lexical ambiguity (i.e. it is not possible to determine the lemma and the tag uniquely without the knowledge of the word context), the morphological analyser often provides several alternatives. In the second stage, the result of the analysis is further used as an input for the lexical disambiguation assigning a given word form its unique lemma and morphological tag.</Paragraph>
    <Paragraph position="4"> The next procedures work with syntactic tree stn~ctures. This process is described in the following subsection.</Paragraph>
    <Section position="1" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
3.1 Syntactical analysis
</SectionTitle>
      <Paragraph position="0"> The first step of the syntactic tagging consists in the building of the anatytic tree structure (ATS) representing the surface syntactic dependency relations in the sentence. We use the statistical Collins's parser to create the stnlcture of the tree and then a statistical procedure to assign words their syntactic functions. Two examples of the ATS are given in figures  The automatically created ATS is a labelled oriented acyclic graph with a single root (dependency tree). In the ATS every word form and punctuation mark is explicitly represented as a node of the tree. Each node of the tree is annotated by a set of attribute-value pairs. One of the attributes is the analytic function that expresses the syntactic function of the word. The number of nodes in the graph is equal to the number of word form tokens in the sentence plus that of punctuation signs and a symbol for the sentence as such (the root of the tree). The graph edges represent surface syntactic relations within the sentence as defined in B6movfi et al (1997).</Paragraph>
      <Paragraph position="1"> The created ATS is further transformed either to the TGTS or to the DMCS. In the Prague Dependency Treebank annotation, the transduction of the ATS to the TGTS is performed (see BShmovfi and I-Iaji~ovfi 1999).</Paragraph>
      <Paragraph position="2"> For the sake of the construction of word contexts, we use the lemmas of word forms, their part of speech, their analytic function and we adapted the algorithms aiming towards the TGTS to build a similar structure, DMCS. Since, in comparison with the ATS, in both the TGTS and the DMCS only autosemantic words have nodes of their own, the first stage of this transformation (i.e. the pruning of the tree structure) is common.</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
3.2 From ATS towards DMCS
</SectionTitle>
      <Paragraph position="0"> The transduction of the ATS to the DMCS consists of the four procedures:  1. Pruning of the tree structure, i.e. elimination of the auxiliary nodes and joining the complex word forms into one node.</Paragraph>
      <Paragraph position="1"> 2. Transformation of the structures of coordinations and appositions.</Paragraph>
      <Paragraph position="2"> 3. Transformation of the nominal predicates.</Paragraph>
      <Paragraph position="3"> 4. Transformation of the complements.  The first step of the transformation of the ATS to the respective DMCS is deletion of the auxiliary nodes. By the auxiliary nodes we understand nodes for prepositions, subordinate conjunctions, rhematizers (including negation) and punctuation. In case the deleted node is not a leaf of the tree, we reorganise the tree. For the IR purposes the auxiliary verbs do not carry any sense, so the analytical verb forms are treated as one single node with the lernma of the main verb. The purpose of the next three procedures is to obtain the context relations among words from the sentence, so we call them context transformations.</Paragraph>
      <Paragraph position="4"> The constructions of coordination and apposition are represented by a special node (usually the node of the coordinating conjunction or other expression) that is the governor of the coordinated subtrees and their common complementation in the ATS. The heads of the coordinated subtrees are marked by a special feature. In case of coordinated attributes, the transformation algorithm deletes the special node, which means that a separate microcontext (X, Atr, Y) is extracted for each member of coordination. The same procedure is used for adverbials, objects and subjects. If two clauses occur coordinated, the special node remains in the structure, as the clauses are handled separately.</Paragraph>
      <Paragraph position="5">  Probably the main difference from the syntactic analysis is the way we are dealing with the nominal predicate. We consider the nominal predicate to act as a normal predicate, though not expressed by a verb. This way of understanding a predicate is very close to predicate logic, where the sentence &amp;quot;The grass is green&amp;quot; is considered to express a formula such as &amp;quot;green(grass)&amp;quot;. In the ATS the complement (word syntactically depending both on the verb and the noun) is placed as a daughter node of the noun and marked by the analytical function of Atv. In the DMCS this node is copied and its analytical function is changed to Attr for the occurrence of the daughter of the noun and Adv for the new token of the daughter of the governing verb. As we cannot go into details here, we illustrate the DMCS by two examples given in figures 4 and 5. The nodes of the trees represent semantically significant words. The edges of the graphs are labelled by so called dependency types (see below).</Paragraph>
    </Section>
    <Section position="3" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
3.3 Extraction of microcontexts from the
DMCS
</SectionTitle>
      <Paragraph position="0"> There are I0 parts of speech in Czech and 18 types of analytic function in ATSs. However, we will consider only four parts of speech, namely nouns (N), adjectives (A), verbs (V) and adverbs (D), and four types of analytic function, namely subject (Sb), object (Obj), adverbial (Adv) and attribute (Attr), because only these are significant for the purpose of retrieval.</Paragraph>
      <Paragraph position="1"> The construction of the dependency microcontext is based on the identification of significant dependency relationships (SDRs) in the sentence. An SDR consists of two words and a dependency type. An SDR is a triple \[wl, DT, w2\], where wl is a head word (lexical unit), DT is a dependency type and w2 is a depending word (lexical unit). A dependency type is a triple (P1, AF, P2), where Pi is the part of speech of the head word, AF is an analytic function and P2 is the part of speech of the depending word.</Paragraph>
      <Paragraph position="2"> For example, (A, Adv, D) is a dependency type expressing the relationship between words in expression &amp;quot;very large&amp;quot; where &amp;quot;very&amp;quot; is a depending adverb and &amp;quot;large&amp;quot; is a head adjective. \[large, (A, Adv, D), very\] is an example of an SDR.</Paragraph>
      <Paragraph position="3"> Considering 4 significant parts of speech and 4 analytic functions, we have 64 (= 4x4x4) possible distinct dependency types. In Czech, however, only 28 of them really occur. Thus, we have 28 distinct dependency types shown in Table 1. Table 2 surnmarises the number of dependency types for each part of speech. The dependency types marked by an asterisk are not the usual syntactic relations in Czech, they were added on account of the transformation of the nominal predicate.</Paragraph>
      <Paragraph position="4"> The number of SDRs extracted from one sentence is always only a little smaller than the number of significant, autosemantic words in the sentence, because almost all these words are depending on another word and make an SDR with it.</Paragraph>
      <Paragraph position="5"> Now we define the dependency word microcontext (DMC). A DMC of a given word w is a list of its microcontext elements (MCEs). An MCE is a pair consisting of a word and a dependency type. If a word w occurs in a sentence and forms an SDR with another word wl, i.e. if there is an SDR \[w, DT, wd or \[wl, DT', w\], then w~ and the dependency type DT or DT', respectively, constitute a mierocontext element \[DT, wd or \[wl, DT'\], respectively, of the word w. The first case implies that w is a head word in the SDR and in the second case the word w is a dependant.</Paragraph>
      <Paragraph position="6"> Thus, each SDR \[wl, DT, w2\] in a text produces two MCEs: \[w~, DT\] is an dement of the context of Wz and \[DT, w2\] is an element of the context of w~.</Paragraph>
      <Paragraph position="7">  In the following Table 3 we exemplify the microcontexts extracted from the sentences used in the examples above.</Paragraph>
      <Paragraph position="8"> Dependency types (N, Atr, N) (V, Sb, N) (V, Obj, N) (N, Atr, A) (V, Sb, V) (V, Obj, V) (N, Atr, V) (V, Sb, A) (A, Obj, A) (N, Adv, N)* (N, Sb, N)* (D, Obj, N) (N, Adv, V)&amp;quot; (N, Sb, A)* (A, Adv, A) (N, Adv, D)&amp;quot; (IN, Sb, V)* (A, Adv, D) (V, Adv, N) (A, Sb, N)* (A, Adv, N)* (V, Adv, V) (A, Sb, A)* (A, Adv, V)* (V, Adv, D) (A, Sb, V)* (D, Adv, D) (D, Adv, N)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML