File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2019_metho.xml
Size: 12,073 bytes
Last Modified: 2025-10-06 14:08:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2019"> <Title>Integrating Information Extraction and Automatic Hyperlinking</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Integrating Typed Feature Structures </SectionTitle> <Paragraph position="0"> and Finite State Machines The main motivation for developing SProUT comes from the need to have a system that (i) allows a flexible integration of different processing modules and (ii) to find a good trade-off between processing efficiency and linguistic expressiveness. On the one hand, very efficient finite state devices have been successfully applied to real-world applications. On the other hand, unification-based grammars (UBGs) are designed to capture fine-grained syntactic and semantic constraints, resulting in better descriptions of natural language phenomena. In contrast to finite state devices, unification-based grammars are also assumed to be more transparent and more easily modifiable.</Paragraph> <Paragraph position="1"> SProUT's mission is to take the best from these two worlds, having a finite state machine that operates on typed feature structures (TFSs). I.e., transduction rules in SProUT do not rely on simple atomic symbols, but instead on TFSs, where the left-hand side of a rule is a regular expression over TFSs, representing the recognition pattern, and the right-hand side is a sequence of TFSs, specifying the output structure. Consequently, equality of atomic symbols is replaced by unifiability of TFSs and the output is constructed using TFS unification w.r.t. a type hierarchy. Such rules not only recognize and classify patterns, but also extract fragments embedded in the patterns and fill output templates with them.</Paragraph> <Paragraph position="2"> Standard finite state techniques such as minimization and determinization are no longer applicable here, due to the fact that edges in our automata are annotated by TFSs, instead of atomic symbols.</Paragraph> <Paragraph position="3"> However, not every outgoing edge in such an automaton must be analyzed, since TFS annotations can be arranged under subsumption, and the failure of a general edge automatically causes the failure of several, more specialized edges, without applying the unifiability test. Such information can in fact be precompiled. This and other optimization techniques are described in (Krieger and Piskorski, 2003).</Paragraph> <Paragraph position="4"> When compared to symbol-based finite state approaches, our method leads to smaller grammars and automata, which usually better approximate a given language.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 XTDL - The Formalism in SProUT </SectionTitle> <Paragraph position="0"> XTDL combines two well-known frameworks, viz., typed feature structures and regular expressions. XTDL is defined on top of TDL, a definition language for TFSs (Krieger and Schafer, 1994) that is used as a descriptive device in several grammar systems (LKB, PAGE, PET).</Paragraph> <Paragraph position="1"> Apart from the integration into the rule definitions, we also employ TDL in SProUT for the establishment of a type hierarchy of linguistic entities. In the example definition below, the morph type inherits from sign and introduces three more morphologically motivated attributes with the corresponding typed values: morph := sign & [ POS atom, STEM atom, INFL infl ]. A rule in XTDL is straightforwardly defined as a recognition pattern on the left-hand side, written as a regular expression, and an output description on the right-hand side. A named label serves as a handle to the rule. Regular expressions over TFSs describe sequential successions of linguistic signs. We provide a couple of standard operators. Concatenation is expressed by consecutive items. Disjunction, Kleene star, Kleene plus, and optionality are represented by the operators |, *, +, and ?, resp. {n} after an expression denotes an n-fold repetition. {m,n} repeats at least m times and at most n times.</Paragraph> <Paragraph position="2"> The XTDL grammar rule below may illustrate the syntax. It describes a sequence of morphologically analyzed tokens (of type morph). The first TFS matches one or zero items (?) with part-of-speech Determiner. Then, zero or more Adjective items are matched (*). Finally, one or two Noun items ({1,2}) are consumed. The use of a variable (e.g., #1) in different places establishes a coreference between features. This example enforces agreement in case, number, and gender for the matched items. Eventually, the description on the RHS creates a feature structure of type phrase, where the category is coreferent with the category Noun of the right-most token(s), and the agreement features corefer to features of the morph tokens.</Paragraph> <Paragraph position="3"> The choice of TDL has a couple of advantages.</Paragraph> <Paragraph position="4"> TFSs as such provide a rich descriptive language over linguistic structures and allow for a fine-grained inspection of input items. They represent a generalization over pure atomic symbols. Unifiability as a test criterion in a transition is a generalization over symbol equality. Coreferences in feature structures express structural identity. Their properties are exploited in two ways. They provide a stronger expressiveness, since they create dynamic value assignments on the automaton transitions and thus exceed the strict locality of constraints in an atomic symbol approach. Furthermore, coreferences serve as a means of information transport into the output description on the RHS of the rule. Finally, the choice of feature structures as primary citizens of the information domain makes composition of modules very simple, since input and output are all of the same abstract data type.</Paragraph> <Paragraph position="5"> Functional (in contrast to regular) operators are a door to the outside world of SProUT. They either serve as predicates, helping to locate complex tests that might cancel a rule application, or they construct new material, involving pieces of information from the LHS of a rule. The sketch of a rule below transfers numerals into their corresponding digits using the functional operator normalize() that is defined externally. For instance, &quot;one&quot; is mapped onto &quot;1&quot;, &quot;two&quot; onto &quot;2&quot;, etc. ... numeral & [ SURFACE #surf, ... ] .... -> digit & [ ID #id, ... ], where #id = normalize(#surf).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 The SProUT System </SectionTitle> <Paragraph position="0"> The core of SProUT comprises of the following components: (i) a finite-state machine toolkit for building, combining, and optimizing finite-state devices; (ii) a flexible XML-based regular compiler for converting regular patterns into their corresponding compressed finite-state representation (Piskorski et al., 2002); (iii) a JTFS package which provides standard operations for constructing and manipulating TFSs; and (iv) an XTDL grammar interpreter.</Paragraph> <Paragraph position="1"> Currently, SProUT offers three online components: a tokenizer, a gazetteer, and a morphological analyzer. The tokenizer maps character sequences to tokens and performs fine-grained token classification. The gazetteer recognizes named entities based on static named entity lexica.</Paragraph> <Paragraph position="2"> The morphology unit provides lexical resources for English, German (equipped with online shallow compound recognition), French, Italian, and Spanish, which were compiled from the full form lexica of MMorph (Petitpierre and Russell, 1995).</Paragraph> <Paragraph position="3"> Considering Slavic languages, a component for Czech presented in (Haji , 2001), and Morfeusz (Przepiorkowski and Wolinski, 2003) for Polish.</Paragraph> <Paragraph position="4"> For Asian languages, we integrated Chasen (Asahara and Matsumoto, 2000) for Japanese and Shanxi (Liu, 2000) for Chinese.</Paragraph> <Paragraph position="5"> The XTDL-based grammar engineering platform has been used to define grammars for English, German, French, Spanish, Chinese and Japanese allowing for named entity recognition and extraction. To guarantee a comparable coverage, and to ease evaluation, an extension of the MUC-7 standard for entities has been adopted.</Paragraph> <Paragraph position="7"> DESCRIPTOR string ].</Paragraph> <Paragraph position="8"> Given the expressiveness of XTDL expressions, MUC-7/MET-2 named entity types can be enhanced with more complex internal structures. For instance, a person name ne-person is defined as a subtype of enamex with the above structure. The named entity grammars can handle types such as person, location, organization, time point, time span (instead of date and time defined by MUC), percentage, and currency.</Paragraph> <Paragraph position="9"> The core system together with the grammars forms a basis for developing applications. SProUT is being used by several sites in both research and industrial contexts.</Paragraph> <Paragraph position="10"> A component for resolving coreferent named entities disambiguates and classifies incomplete named entities via dynamic lexicon search, e.g., Microsoft is coreferent with Microsoft corporation and is thus correctly classified as an organization.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 ExtraLink: Integrating Information </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Extraction and Automatic Hyperlinking </SectionTitle> <Paragraph position="0"> A methodology for automatically enriching web documents with typed hyperlinks has been developed and applied to several domains, among them the domain of tourism information. A core component is a domain ontology describing tourist sites in terms of sights, accommodations, restaurants, cultural events, etc. The ontology was specialized for major European tourism sites and regions (see web document is associated to (Isle of Capri) is shown on the left, together with neighboring concepts in the ontology, which the user can navigate through.</Paragraph> <Paragraph position="1"> link targets gathered, intellectually selected and continuously verified. Although language technology could also be employed to prime target selection, for most applications quality requirements demand the expertise of a domain specialist. In the case of the tourism domain, the selection was performed by a travel business professional.</Paragraph> <Paragraph position="2"> The system is equipped with an XML interface and accessible as a server.</Paragraph> <Paragraph position="3"> The ExtraLink GUI marks the relevant entities (usually locations) identified by SProUT (see second window on the left in Figure 2). Clicking on a marked expression causes a query related to the entity being shipped to the server. Coreferent concepts are handled as expanded queries. The server returns a set of links structured according to the ontology, which is presented in the ExtraLink GUI (Figure 2). The user can choose to visualize any link target in a new browser window that also shows the respective subsection of the ontology in an indented tree notation (see Figure 1).</Paragraph> <Paragraph position="4"> window are generated after clicking on the marked named entity for Lisbon (marked in dark). The bottom left window shows the SProUT result for &quot;Lissabon&quot;. The ExtraLink demonstrator has been implemented in Java and C++, and runs under both MS Windows and Linux. It is operational for German, but it can easily be extended to other languages covered by SProUT. This involves the adaptation of the mapping into the ontology and a multi-lingual presentation of the ontology in the link target page.</Paragraph> </Section> </Section> class="xml-element"></Paper>