XML Viewer - c96-2180

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2180_metho.xml
Size: 10,522 bytes
Last Modified: 2025-10-06 14:14:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2180">
  <Title>United Kingdom</Title>
  <Section position="5" start_page="1028" end_page="1029" type="metho">
    <SectionTitle>
3 Extension for proper nouns :
</SectionTitle>
    <Paragraph position="0"> interactive tagging Proper nouns present another problem that falls under messy details. A small extract from the corpus used for tim English grammar showed a wide range of possible proper noun configurations : &amp;quot;James Sledz&amp;quot;, &amp;quot;Racketeer Influenced and Corrupt Organizations&amp;quot;, &amp;quot;Sam A. Call&amp;quot;, &amp;quot;Mr. Yasuda&amp;quot;, &amp;quot;Mr. Genji Yasuda&amp;quot;, ...</Paragraph>
    <Paragraph position="1"> Regular expressions can catch several of those cases, but it is difficult to get certainty, e.g. &amp;quot;Then Yasuda ...&amp;quot; vs &amp;quot;Genii Yasuda&amp;quot; : one can never be sure that an English word is not a name in another language. Since this is a pre-processing treatment, there is no disambiguating information present, and fully automatic tagging cannot be preceding expression, square brackets surround alternative characters (possible specified as a range, e.g. &amp;quot;\[0-9\] &amp;quot;).</Paragraph>
    <Paragraph position="2">  done, unless the program can have access to either some lookup facility and/or can iater~ct with a human user.</Paragraph>
    <Section position="1" start_page="1029" end_page="1029" type="sub_section">
      <SectionTitle>
3.1 Patterns for proper nouns
</SectionTitle>
      <Paragraph position="0"> For financial texts, the domain of our reference corpus, the proper nouns are company or institution names and person names. Product and company names can be very unconventional. Therefore the regular expressions need to be rather generous. The interaction with the user and the dictionaries will provide a way to tune the effect of these expressions.</Paragraph>
      <Paragraph position="1"> We defined the proper noun regular expression to be nearly anything, preceded by a capital. Per-son names can contain initials, and they might be modified by titles (&amp;quot;Mr&amp;quot;, ...) or functions, business names can be modified by some standard terminology (like &amp;quot;Ltd.&amp;quot;). Lower case words are allowed if they are not longer than three chm'acters (for nmnes containing &amp;quot;and&amp;quot; etc.).</Paragraph>
    </Section>
    <Section position="2" start_page="1029" end_page="1029" type="sub_section">
      <SectionTitle>
3.2 Interacting with the user
</SectionTitle>
      <Paragraph position="0"> Tagging proper nouns presents a special problem, since, unlike the case of numbers and dates, there is a great deal of uncertainty involved as to whether something is a proper noun or not. Therefore a natural extension to tagit was the implementation of an interactive capability for confirming certain tag types such as proper nouns. 2 If a proper noun is found, then the tagger first does some lookup to limit the number of interactions during the tagging. We used the two follow- null ing heuristics : 1. Has it already been tagged as a proper noun ? If so, do it again.</Paragraph>
      <Paragraph position="1"> 2. Has it already been offered as a proper noun,  but was it rejected ? If so, and if it occurs at the beginning of a sentence, reject it again. Those two checks are kept exclusively disjunctive. If a word occurs both as a proper noun and as a &amp;quot;non-proper noun&amp;quot;, the user will be asked if he or she wants it to be tagged. This allows one to use different name dictionaries for different texts. If the program itself is certain that a proper noun is found, then it tags it and goes on to a next match. Otherwise it asks the user what to do with the match that was found. There are two possible answers to this question :  been implemented in Tel/Tk.</Paragraph>
      <Paragraph position="2"> When the match is not entirely a proper noun, the matching string can be edited. This consists of removing the words before and/or after the first proper noun in the match. 3 The remaining substring of the match is tagged as a proper noun and stored. The words before the first word are skipped (and also stored); everything that comes after the tagged proper noun is resubmitted.</Paragraph>
      <Paragraph position="3"> 2. The user rejects the match that is offered. The program stores it (as a &amp;quot;non-proper noun&amp;quot;) and proceeds.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1029" end_page="1030" type="metho">
    <SectionTitle>
4 Integration with linguistic
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1029" end_page="1030" type="sub_section">
      <SectionTitle>
analysis
</SectionTitle>
      <Paragraph position="0"> The ALEP platform (Alshawi et al., 1991) provides the user with a Text Handling (TH) component which allows a &amp;quot;pre-processing&amp;quot; of input. An ASCII text will first go through a processing chain consisting in a SGML-based tagging of the elements of the input. The default setup of the system defines the following processing chain : the text is first converted to an EDIF (Eurotra Document Interchange Format) format. Then three recognition processes are provided : paragraph recognition, sentence recognition and word recognition. The output from those processes consist of the input decorated with tags for the recognized elements : 'p' for paragraphs, 'S' for sentences, 'W' for words (in case of morphological analysis, the tag '/4' is provided for morphemes) and 'PT' for punctuation signs. Some specialized features are also provided for the tagged words, allowing to characterize them more precisely, so for exmnple 'ACR0' for acronyms and so on.</Paragraph>
      <Paragraph position="1"> So the single input &amp;quot;John sees Mary.&amp;quot; after being processed by the TH component will take the  &lt;P&gt; and &lt;/P&gt; mark the beginning and the respective ending of the recognized paragraph structure. The other tags must be interpreted analogously. null In the default case, it this this kind of information which is the input to the TH-LS component (Text-Handling to Linguistic Structure) of the system. Within this component, one specifies so-called 'tsAs' (text structure to linguistic structure) rules, which transfornl the TH output into 3To extend the matches, the user would need to change the regular expressions.</Paragraph>
      <Paragraph position="2">  partial linguistic structure (in ALEP terminology, this conversion is called lifting). The syntax of these lift rules is the following : ts_is_rule( &lt;id&gt;, &lt;tag_name&gt;, \[&lt;features&gt;f, &lt;tag content&gt; ) where : &lt;ld&gt; is a Linguistic Description (LD); &lt;tag_name&gt; is the name of an SGML tag (e.g. 'S', 'W'); &lt;features&gt; is a list of feature-value descrip-tions of the tag's features; &lt;tag content&gt; is tile atomic content of tile string within the tag (optional in the lift rule).</Paragraph>
      <Paragraph position="3"> This kind of mapping rule allows a flow of information between text structures and linguistic structures. So if the input is one already having PoS information (as the result of a corpus tagging), tim TH-LS is the appropriate place to assure the flow of information. This allows a considerable improvement of parse time, since some information is already instantiated before the parse starts.</Paragraph>
      <Paragraph position="4"> The TIt component of the ALEP platform also foresees the integration of user-defined tags. The tag &lt;USR&gt; is used if the text is tagged by a user-defined tagger, as is done when processing messy details.</Paragraph>
      <Paragraph position="5"> When tagit matches a pattern against tim input, the matched string is replaced with an appropriate USR tag. Thus &amp;quot;l)reiundvierzig Milliarden l)ollm'&amp;quot; is matched by the pattern measure (see above), and is replaced by the SGML markup &lt;USR  Note that tile matched sequence is copied into the attribute VAL and that in the data content spaces are replaced by underscores. For some pattern types, a generalized representation of the matched sequence is computed and stored in an attribute CONY. For instance, when the pattern for dates matches the input &amp;quot;March 15, 1995&amp;quot;, CONV is assigned a standardized version, i.e. CONV=&amp;quot;95/03/15&amp;quot;.</Paragraph>
      <Paragraph position="6"> This version with USR tags inserted is then processed by the set of lift rules. The \[bllowing general lift rule does the conversion for all USR tags : ts_is rule(  Id:{ sign =&gt; sign:{ string =&gt; STRING, synsem =&gt; synsem:{ syn =&gt; syn:{ constype =&gt; morphol:{ lemma =&gt; VALvaluo, lu =&gt; TYPEvalue } } } } }, 'USR ' , \[ 'TYPE' =&gt;TYPEvalue, ' VAL' =&gt;VALvalue\] ,</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="1030" end_page="1030" type="metho">
    <SectionTitle>
STRING ).
</SectionTitle>
    <Paragraph position="0"> Here we (:an see tile mapping of inforlnation between the user-defined tlS/t tag (the attributes of which are listed in tim last line of this rule) and the linguistic description ('ld'--'linguistic description', a structured type within tile Typed Feature System), using the rule-internal variable TYPEvalue: the value of the attribute TYPE is assigned to the lexical unit ('lu') value of the linguistic description. After applying this rule to the result of matching &amp;quot;Dreiundvierzig Dollar&amp;quot;, the ld is the following : id:{ sign =&gt; sign:{  string =&gt; 'Dreiundvierzig_Dollar', synsem =&gt; synsem:{ syn =&gt; syn:{ constype =&gt; morphol :{ lemma =&gt; 'Dreiundvierzig Dollar' lu =&gt; 'M~ASURE' } } } } }  Although the original input sequence is available as the value of the feature lemma, further processing is based solely on the lu value ' MEASURE', thus making it possible to have a single lexical entry for handling all sequences matched by the pattern measure shown above. The definition of such generic entries in the lexicon keeps the lexicon smaller by dealing with what otherwise couhl only be coded with an infinite number of entries. In addition, treating such word constrncts as a sire gle unit gives a significant improvement in parsing runtirne, since only the string 'MEASURE' is used as a basis for further processing, instead of the original sequence of three words. Finally, runtinm is also improved and development eased by the fact that no grammar rules need be defined for parsing such sequences.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML