File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2410_metho.xml

Size: 11,613 bytes

Last Modified: 2025-10-06 14:10:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2410">
  <Title>Multiword Units in an MT Lexicon</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The multi-word unit continuum
</SectionTitle>
    <Paragraph position="0"> In order to develop first an intuitive grasp of the phenomena, consider the following examples.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="73" type="metho">
    <SectionTitle>
1) English-speaking population
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="73" type="sub_section">
      <SectionTitle>
French-speaking clients
Spanish-speaking students
</SectionTitle>
      <Paragraph position="0"> It would not be difficult to carry on with further examples, each embodying a pattern &lt;language-name&gt; speaking &lt;person&gt; or &lt;group of persons&gt;. It is a prototypical example for our purposes because the words are interdependent yet they admit of open-choice in the selection of lexical items for certain positions. The phrases *speaking students, English-speaking, or English population are either not well-formed or does not mean the same as the full expression. The meaning of the phrase is predominantly, if perhaps not wholly, compositional and for native language speakers the structure may seem entirely transparent. However, in a bilingual context this transparency does not necessarily carry over to the other language. For example, the phrases in  The Hungarian equivalent bears the same characteristics of semantic compositionality and structural transparency and is open-ended in the same points as the corresponding slots in the English  pattern. It would be extremely wasteful to capture the bilingual correspondences in an itemized manner, particularly as the set of expressions on both sides are open-ended anyway.</Paragraph>
      <Paragraph position="1"> At the other end of the scale in terms of productivity and compositionality one finds phrases like those listed in 3) 3) English breakfast French fries German measles Purely from a formal point of view, the phrases in 3) could be captured in the pattern &lt;language name&gt;&lt;noun&gt; but the co-occurrence relations between items in the two sets are limited to the extreme so that once they are defined properly, we are practically thrown back to the particular one-to-one combinations listed in 3).</Paragraph>
      <Paragraph position="2"> Note that if we had a set like 4), where one element is shared it would still not make sense make sense to factorize the shared word French because it enters into idiomatic semantic relations. In other words, the multi-word expressions are semantically non-compositional even in terms of English alone.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="73" end_page="73" type="metho">
    <SectionTitle>
4) French bread
</SectionTitle>
    <Paragraph position="0"> French horn French dressing The set of terms in 5) exemplifies the other end of the scale in terms of compositionality and syntactic transparency. They are adduced here to exemplify fully regular combinations of words in their literal meaning.</Paragraph>
    <Paragraph position="1">  In between the wholly idiosyncratic expressions which need to be listed in the lexicon and the set of completely open-choice expressions which form the province of syntax, there is a whole gamut of expressions that seem to straddle the lexicon-syntax divide. They are non-compositional in meaning to some extent and they also include elements that come from a more or less open set. Some of these open-choice slots in the expressions may be filled with items from sets that are either infinite (like numbers) or numerous enough to render them hopeless or wasteful for listing in a dictionary. For this reason, they are typically not fully specified in dictionaries, which have no of means of representing them explicitely in any other way than by listing. For want of anything better, lexicographers rely on the linguistic intelligence of their readers to infer from a partial list the correct set of items that a given lexical unit applies to. Bolinger (Bolinger 1965) elegantly sums up this approach as Dictionaries do not exist to define, but to help people grasp meaning, and for this purpose their main task is to supply a series of hints and associations that will relate the unknown to something known.</Paragraph>
    <Paragraph position="2"> Adroit use of this technique may be quite successful with human readers but is obviously not viable for NLP purposes. What is needed is some algorithmic module in order to model the encoding/decoding processing that humans do in applying their mental lexicon. The most economical and sometimes the only viable means to achieve this goal is to integrate some kind of rule-based mechanism that would support the recognition as well as generation of all the lexical units that conventional dictionaries evoke through well-chosen partial set of data.</Paragraph>
  </Section>
  <Section position="6" start_page="73" end_page="74" type="metho">
    <SectionTitle>
3 Local grammars
</SectionTitle>
    <Paragraph position="0"> Local Grammars, developed by Maurice Gross (Gross 1997), are heavily lexicalized finite state grammars devised to capture the intricacies of local syntactic or semantic phenomena. In the mid-nineties a very efficient tool, INTEX was developed at LADL, Paris VII, (Silberztein 1999) which has two components that are of primary importance to us: it contains a complex lexical component (Silberztein 1993) and a graphical interface which supports the development of finite state transducers in the form of graphs (Silberztein 1999).</Paragraph>
    <Paragraph position="1"> Local grammars are typically defined in graphs which are compiled into efficient finite state automata or transducers. Both the lexicon and the grammar are implemented in finite state transducers. This fact gives us the ideal tool to implement the very kind of lexicon we have been arguing for, one that includes both static entries and lexical grammars.</Paragraph>
    <Paragraph position="2"> The set of expressions discussed in 1) can be captured with the graph in Figure 1. It shows a simple finite state automaton of a single with through three nodes along the way from the initial symbol on the left to the end symbol on the right. It represents all the expressions that match as the graph is traversed between the two points. Words in angle brackets stand for the lemma form, the shaded box represent a subgraph that can freely be embedded in graphs. The facility of</Paragraph>
    <Section position="1" start_page="74" end_page="74" type="sub_section">
      <SectionTitle>
lish-speaking students
</SectionTitle>
      <Paragraph position="0"> graph embedding has the practical convenience that it allows the reuse of the subgraph in other contexts. At a more theoretical level, it introduces the power of recursion into grammars.</Paragraph>
      <Paragraph position="1"> Subgraphs may also be used to represent a semantic class, such as language name in the present case, and can be encoded in the dictionary with a semantic feature like +LANGNAME. IN-TEX/NOOJ dictionaries allow an arbitrary number of semantic features to be represented in the lexical entries and they can be used in the definition of local grammars as well. An alternative grammar using semantic features is displayed in  tic features Note that to render expressions like in 2) we use local grammars containing nodes that range from specific word forms through lemmas, lists of words, words defined by a semantic class in an ontology to syntactic class or even the completely general placeholder for any word. Such flexibility allows us to apply the constraint defined at the right level of generality required to cover exactly the set of expressions without overgeneration.</Paragraph>
      <Paragraph position="2"> The local grammars defining the kind of partially productive multi-word units that the present paper focuses on can typically be defined with the nodes being defined in terms of some natural semantic class such as the language names of examples 2) or names of colours or body parts illustrated in 6) 6a) the lady in black 6b) a fekete ruhas holgy the black clad lady The English expression in 6a) can be implemented with the graph in Figure 3, its Hungarian equivalent 6b) is displayed in Figure 4.</Paragraph>
      <Paragraph position="3"> Figure 3 Local grammar to cover the expressions like 6a) Figure 4 Local grammar to cover the expressions like 6b) The use of semantic features is merely the first step in building an efficient lexicon. At a more advanced level, the lexicon would include a system of semantic features arranged into typed hierarchy, which would allow use of multiple inheritence. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="74" end_page="75" type="metho">
    <SectionTitle>
4 Application of local grammars
</SectionTitle>
    <Paragraph position="0"> In the present section we provide some examples of how rendering multi-word units with local grammars can enhance a multi-lingual application. null</Paragraph>
    <Section position="1" start_page="74" end_page="75" type="sub_section">
      <SectionTitle>
4.1 Semantic disambiguation
</SectionTitle>
      <Paragraph position="0"> The use of transducers in INTEX/NOOJ provides an intuitive and user-friendly means of semantic disambiguation as illustrated in Figure 5. Here the appropriate meaning of the specific node is defined by its Hungarian equivalent, but of course one might just as well have used mono-lingual tags for the same purpose.</Paragraph>
    </Section>
    <Section position="2" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
4.2 Partial automatic translation
</SectionTitle>
      <Paragraph position="0"> On the analogy of shallow parsing, we may compile transducers that produce as output the target language equivalent of the chunks recognized. This is illustrated in Figure 6 where the expressions &amp;quot;trade/trading in dollar/yen&amp;quot; etc. are rendered as &amp;quot;dollarkereskedelem, jenkereskedelem&amp;quot; etc. whereas &amp;quot;trade/trading in Tokyo/London&amp;quot; etc. are translated as &amp;quot;tokioi/londoni kereskedes&amp;quot;. Note that the recognized words are stored in a variable captured by the labelled brackets and used in the compilation of the output.</Paragraph>
      <Paragraph position="1"> Figure 5 Partial translation transducers using variables</Paragraph>
    </Section>
    <Section position="3" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
4.3 Automatic lexical acquisition
</SectionTitle>
      <Paragraph position="0"> Local grammars can be used not only for recognition and generation but also for automated lexical acquisition. This can be achieved by suitably relaxing the constraints on one or more of the nodes in a graph and apply it to a large corpus. The resulting hit expressions can then be manually processed to find the semantic feature underlying the expressions or establish further subclasses etc.</Paragraph>
      <Paragraph position="1"> As an example, consider Figure 7 containing a graph designed to capture expressions describing various kinds of graters in English. As Figure 6 shows the entry for grater in the Oxford Advanced dictionary (Wehmeier 2005) uses only hints through specific examples as to what sort of graters there may be in English Figure 6 Part of the dictionary entry GRATE from OALD7 The node &lt;MOT&gt; matches an arbitrary word in INTEX, the symbol &lt;E&gt; covers an empty ele-</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML