File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-4016_metho.xml

Size: 9,108 bytes

Last Modified: 2025-10-06 14:10:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-4016">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics TwicPen : Hand-held Scanner and Translation Software for non-Native Readers</Title>
  <Section position="4" start_page="61" end_page="61" type="metho">
    <SectionTitle>
2 Overview of TwicPen
</SectionTitle>
    <Paragraph position="0"> The TwicPen system is a natural follow-up of TWiC (Translation of Words in Context), (see Wehrli, 2003, 2004), which is a system for on-line terminological help based on a full linguistic analysis of the source material. TwicPen uses a very similar technology, but is available on personal computers (or even PDAs) and uses a hand-held scanner to get the input material. In other words, TwicPen consists of (i) a simple hand-held scanner and (ii) parsing and translation software.</Paragraph>
    <Paragraph position="1"> TwicPen functions as follows : * The user scans a fragment of text, which can be as short as one word or as long as a whole sentence or even a whole paragraph.</Paragraph>
    <Paragraph position="2"> * The text appears in the user interface of the TwicPen system and is immediately parsed and tagged by the Fips parser described in the next section.</Paragraph>
    <Paragraph position="3"> * The user can either position the cursor on the specific word for which help is requested, or navigate word by word in the sentence.</Paragraph>
    <Paragraph position="4"> * For each word, the system retrieves from the tagged information the relevant lexeme and consults a bilingual dictionary to get one or several translations, which are then displayed in the user interface.</Paragraph>
    <Paragraph position="5"> Figure 1 shows the user interface. The input text is the well-known German compound discussed by Kay et al. (1994) reproduced in (1):  Notice that the word Versicherungsgesellschaft (English insurance company and French compagnie d'assurance), which is a compound, has not been analyzed. This is due to the fact that, like many common compounds, it has been lexicalized. null</Paragraph>
  </Section>
  <Section position="5" start_page="61" end_page="62" type="metho">
    <SectionTitle>
3 The Fips parser
</SectionTitle>
    <Paragraph position="0"> Fips is a robust multilingual parser which is based on generative grammar concepts for its linguistic component and object-oriented design for its implementation. It uses a bottom-up parsing algorithm with parallel treatment of alternatives, as well as heuristics to rank alternatives (and cut their numbers when necessary).</Paragraph>
    <Paragraph position="1"> The syntactic structures built by Fips are all of the same pattern, that is : [ XP L X R ], where L stands for the possibly empty list of left constituents, X for the (possibly empty) head of the phrase and R for the (possibly empty) list of right constituents. The possible values for X are the usual part of speech Adverb, Adjective, Noun, Determiner, Verb, Tense, Preposition, Complementizer, Interjection.</Paragraph>
    <Paragraph position="2"> The parser makes use of 3 fundamental mechanisms : projection, merge and move.</Paragraph>
    <Section position="1" start_page="61" end_page="62" type="sub_section">
      <SectionTitle>
3.1 Projection
</SectionTitle>
      <Paragraph position="0"> The projection mechanism assigns a fully developed structure to each incoming word, based on their category and other inherent properties. Thus, a common noun is directly projected to an NP  structure, with the noun as its head, an adjective to an AP structure, a preposition to a PP structure, and so on. We assume that pronouns and, in some languages proper nouns, project to a DP structure (as illustrated in (2a). Furthermore, the occurrence of a tensed verb triggers a more elaborate projection, since a whole TP-VP structure will be assigned. For instance, in French, tensed verbs occur in T position, as illustrated in (2b): (2)a. [DP Paul ], [DP elle ] b. [TP mangesi [VP ei ] ]</Paragraph>
    </Section>
    <Section position="2" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
3.2 Merge
</SectionTitle>
      <Paragraph position="0"> The merge mechanism combines two adjacent constituents, A and B, either by attaching constituent A as a left constituent of B, or by attaching B as a right constituent of any active node of A (an active node is one that can still accept subconstituents). null Merge operations are constrained by various, mostly language-specific, conditions which can be described by means of procedural rules. Those rules are stated in a pseudo formalism which attempts to be both intuitive for linguists and relatively straightforward to code (for the time being, this is done manually). The conditions take the form of boolean functions, as described in (3) for left attachments and in (4) for right attachments, where a and b refer, respectively, to the first and to the second constituent of a merge operation.</Paragraph>
      <Paragraph position="2"> Rule 3 states that a DP constituent (ie. a traditional noun phrase) can (left-)merge with a TP constituent (ie. an inflected verb phrase constituent) if (i) both constituents agree in number and person and (ii) the DP constituent can be interpreted as the subject of the TP constituent.</Paragraph>
      <Paragraph position="4"> (right-)attached to a determiner phrase, under the conditions (i) that the head of the DP bears the selectional feature [+Ncomplement] (ie. the determiner selects a noun), and (ii) the determiner and the noun agree in gender and number. Finally, rule (4b) allows the attachment of a DP as a right subconstituent of a verb (i) if the verb is not an auxiliary or modal (ie. it is a main verb) and (ii) if the DP can be interpreted as a direct object argument of the verb.</Paragraph>
    </Section>
    <Section position="3" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
3.3 Move
</SectionTitle>
      <Paragraph position="0"> Although the general architecture of surface structures results from the combination of projection and merge operations, an additional mechanism is necessary to handle so-called extraposed elements and link them to empty constituents (noted e in the structural representation below) in canonical positions, thereby creating a chain between the base (canonical) position and the surface (extraposed) position of the &amp;quot;moved&amp;quot; constituent as illustrated in the following example: (5)a. who did you invite ? b. [ CP[DP who]ididj [ TP [ DP you ] ej [ VP invite[DP e]i ] ] ]</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="62" end_page="63" type="metho">
    <SectionTitle>
4 Multi-word expressions
</SectionTitle>
    <Paragraph position="0"> Perhaps the most advanced feature of TwicPen is its ability to handle multiword expressions (idioms, collocations), including those in which the elements of the expression are not immediately adjacent to each other. Consider the French verb-object collocation battre-record (break-record), illustrated in (6a, b), as well as in the figure 3.</Paragraph>
    <Paragraph position="1"> (6)a. Paul a battu le record national.</Paragraph>
    <Paragraph position="2"> Paul broke the national record b. L'ancien record de Bob Hayes a finalement 'et'e battu.</Paragraph>
    <Paragraph position="3"> Bob Hayes' old record was finally broken.</Paragraph>
    <Paragraph position="4"> The collocation is relatively easy to identify in (6a), where the verb and the direct object noun are almost adjacent and occur in the expected order. It is of course much harder to spot in the (6b) sentence, where the order is reversed (due to passivization) and the distance between the two elements of the collocation is seven words. Nevertheless, as Figure 3 shows, TwicPen is capable of identifying the collocation.</Paragraph>
    <Paragraph position="5"> The screenshot given in Figure 3 shows that the user selected the word battu, which is a form of  the transitive verb battre, as indicated in the base form field of the user interface. This lexeme is commonly translated into English as to beat, to bang, to rattle, etc.. However, the collocation field shows that battu in that sentence is part of the collocation battre-record which is translated as break-record.</Paragraph>
    <Paragraph position="6"> The ability of TwicPen to handle expressions comes from the quality of the linguistic analysis provided by the multilingual Fips parser and of the collocation knowledge base (Seretan et al., 2004). A sample analysis is given in (7b), showing how extraposed elements are connected with canonical empty positions, as assumed by generative linguists. null (7)a. The record that John broke was old.</Paragraph>
    <Paragraph position="7"> b. [ TP [ DP the [ NP recordi [ CP thati [ TP [ DP John ] [ VP broke [DP e]i ] ] ] ] ] [ VP was [AP old ] ] ] In this analysis, notice that the noun record is coindexed with the relative pronoun that, which in turn is coindexed with the empty direct object of the verb broke. Given this antecedent-trace chain, it is relatively easy for the system to identify the verb-object collocation break-record.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML