File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1204_metho.xml

Size: 28,634 bytes

Last Modified: 2025-10-06 14:15:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1204">
  <Title>A Lexically-Intensive Algorithm for Domain-Specific Knowlegde Acquisition Rend Schneider * Text Understanding Systems</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
INPUT OUTPUT
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> Therefore, the development of OCR systems and the improvement of their efficiency is still a major task in the area of document processing. But even with high quality scanners, the promised 99.9% recognition rate is difficult to achieve (Taghva et al., 1994) and remains the ideal case due to e.g. the use of different fonts, low quality print or paper, a low resolution etc.</Paragraph>
    <Paragraph position="3"> Besides the mistakes caused by OCR, a considerable number of documents include typographical or grammatical mistakes (misspellings, wrong inflection or word order), unusal expressions etc., which shows that natural language processing (NLP) needs more than just a grammar for grammatical expressions but indeed has to be fault-tolerant to process ~real-world&amp;quot; utterances. Though natural language itsdf has a lot to do with exceptions and irregularities, all these nuisances amplify the problems NLP is occupied with, but -- as a glance at text samples shows -- IE,-systems are faced with a considerable number of these additional irregularities that occur * as a result of low grammatical competence, e.g.</Paragraph>
    <Paragraph position="4"> whenever a non-native speaker is obliged to write a document or message in a second language; null * as careless slips, e.g. misspellings, missing punctuations etc.</Paragraph>
    <Paragraph position="5"> However, the most of all occurring errors are produced by OCR 1 and can be classified as follows: ZA brief example of an OCR-text: I/e would be vBry pleasd ifyou could send two 1992 annuai reports and a product brochure to: -..</Paragraph>
    <Paragraph position="6">  * Incorrect character recognition: - Merging or Splitting: Two or more characters are represented as one and vice versa.</Paragraph>
    <Paragraph position="7"> - Replacement: Characters are confused, e.g.</Paragraph>
    <Paragraph position="8"> 1 and 1.</Paragraph>
    <Paragraph position="9"> - Deletion: Characters are dropped (e.g. due to low print quality).</Paragraph>
    <Paragraph position="10"> -Insertion: Non-existing characters are added.</Paragraph>
    <Paragraph position="11"> * Incorrect word boundary recognition: - Agglutination: Two or more word bound null aries are not recognized, and distinct words are linked with each other.</Paragraph>
    <Paragraph position="12">  - Separation: A single word is split into two or more fragments.</Paragraph>
    <Paragraph position="13">  Therefore, in this study one of the principle goals was to find a new methodology that enables the computer to learn automatically from a very small data set with examples of both grammatically incorrect and orthographically ill-formed text.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Machine Learning in Information
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Extraction
3.1 Statistical Language Learning
</SectionTitle>
      <Paragraph position="0"> Machine learning techniques have been developed to acquire factual and conceptual knowledge automatically and all of them have been applied to natural language processing. The different techniques were derived from the fields of symbolic, connectionist, statistical and evolutionary computing and their application depends on the specific problem. Recent developments show that the consecutive or simultaneous combination of different learning approaches, i.e. hybrid strategies, often leads to better results than the single use of one. The methodology most frequently used to support other learning strategies are statistics, but in several occasion they are also used exclusively, esp. when the a priori knowledge about the content and the structure of the data is very low (Vapnik, 1995).</Paragraph>
      <Paragraph position="1"> In such cases, all that is needed to start with, is the knowledge about some functional properties of the data to deduce their dependencies. Simply speaking, an unordered or hidden structure is transformed into a systematic structure revealing the properties, relations and processes of the data. In the ideal case the discovery of these dependencies leads to the formulation of general principles or laws.</Paragraph>
      <Paragraph position="2"> In NLP, statistics are used to describe the processe of language acquisition, language change MIROMIR, an independant financial and economic research society, is making a study about Leasing in Europe. In order to make a prvsentation of your company, we would like to recieve your commorcial documents and your last snnual roports (from 1988 to 1991) in er~9~PSsh. If you have a mailing ~st would you kindly include our name for future issues of annual repords and information on your company. With our grateful thanks,  formation (OCR-text) or variation (Abney, 1996) using the methods of information- and probability theory (Charniak, 1993). Thus, the starting point of every investigation discovering these processes in order to &amp;quot;learn&amp;quot; a language or acquire knowledge about some language with statistical techniques is the hkelihood of words and their derivable distributions and functions.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Domain-Specific and Text-Relevant
Knowledge
</SectionTitle>
      <Paragraph position="0"> Besides that, the formulation of what has to be learned needs to be formulated and described precisely, esp. in IE where the different elements of the whole data set are not regarded with the same degree-of-interest and only a very small part of the whole information is extracted. Hence, the system has to learn to divide between the important or interesting and the unimportant or less interesting information. In case of OCR-errors, it has to be able to clean the text from noisy parts and restore those parts appropriately.</Paragraph>
      <Paragraph position="1"> The interesting parts of a text or a message, which have a high significance for IF_~-systems, can be divided into domain-specific and text-relevant data (or high level and low level patterns (Yangarber and Grishman, 1997)) as illustrated in Figure 3, where the domain-specific words are represented in bold and the corresponding text-relevant information in cursive letters. The domain-specific words can be seen as distinctive from all other words since they describe the domain and general purpose the text has been written for, whereas the text-relevant words stand in a close relation to the domain--specific data because they usually do not appear alone but determine exactly the meaning of the domain-specific words. In the case of our example in Figure 3 the domain-specific informa-Schneider 21 Lexically-lntensive Algorithm tion is represented by the words recieve, annual roports and include, mailing list. The text-relevant information MIROFIIR, we, from 1988 to 1991, english, our name specifies the numbers, years and language of the annual reports requested and of course the sender (which in the case of we and our name has to be unriddled by anapher resolution) that should be included into the mailing list.</Paragraph>
      <Paragraph position="2"> To illustrate the relationship between domain-specific and text-relevant information, their functions may be compared to those of constants and variables in a mathematical equation with the domain-specific words (representin.g the unvariable and basic components of the equation) and the text-relevant information representing the variables (as unstable and characteristic elements of the equation). Thinking in categories of natural language, the domain-specific information represents the pragmatic meaning and uses verbs and specific nouns to describe specific events while the text-relevant information is represented through names, numbers, dates etc. In any case it has to be considered that this distinction depends very much on the sharpness of the domain the IF_c-System is built for. Generally speaking, the more specific a domain is, the better does this distinction work and thus facilitates both the construction of the output structure (templates) and the extraction of the relevant text features.</Paragraph>
      <Paragraph position="3"> Unfortunately -- as will be seen in the next section -- text-relevant information is very difficult to learn automatically, particularly when the texts that are analyzed have been damaged by OCR: e.g.</Paragraph>
      <Paragraph position="4"> the differences of report and roport can easily be detected and resolved, whereas names of persons, streets etc. themselves have several spelling variants and the change of a single letter changes the whole meaning, as it happens for numbers, too.</Paragraph>
      <Paragraph position="5"> Therefore the main focus is on the the detection of domain-specific information with statistical methods that leads in a following step to the text-relevant information, i.e. the major task of the algorithm is to build automatically a knowledge base for the crucial words and the kernel phrases that represent the salient information of a given text.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="163" type="metho">
    <SectionTitle>
4 Lexically-Intensive Knowledge
Acquisition
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 An Outline of the Algorithm
</SectionTitle>
      <Paragraph position="0"> The algorithm as illustrated in Figure 4 is divided into the following major steps: First, a frequency list is computed from the training data, i.e. the raw text of a limited number of texts belonging to the same domain. Then, all word forms are compared with each other and the word forms with a low distance are grouped together. The results from these two procedures are combined and lead to the construction of a very compact core lexicon that consists of a limited number of entries with lexical prototypes and automatically assigned variants of the corpus' word forms. Afterwards the training data is trans-Figure 4: Building a Domain-Specific Lexicon formed so that it only consists of the automatically derived lexical prototypes. Then the most frequent syntagmatic patterns from a length of two to five lemmata are collected and weighted. In the last but one step similar patterns having at least one domain-specific lexeme in Common are collected to reveal the neighbourhoods of the most important words. The degree-of-interest of a word is computed from its frequency and the number of variants the word has.</Paragraph>
      <Paragraph position="1"> Finally the entries of the core lexicon are connected with one another and compressed into weighted regular expressions. The result is a domain-specific lexicon that is represented as a net of lexical entries covering the correct word and their variants and some of the possible incorrect variants and the syntactical relations that are commonly used in texts of a certain domain.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="163" type="sub_section">
      <SectionTitle>
4.2 Acquisition of Lexical Knowledge
</SectionTitle>
      <Paragraph position="0"> The construction of the core lexicon is based on the combination of a frequency list and a comparision of the distances of all word forms given in a corpus of  A computation of the relative size of unknown and already known word (see Figure 5) shows that after a very low number of texts generally 80 % of the information is confirmed, i.e. it appeared already in one of the former texts. These 80 % cover generally the functional words such as articles, conjunctions etc. and of course the domain-specific information.</Paragraph>
      <Paragraph position="1"> The residual 20 % consist of text-relevant information, unimportant and less interesting information, misspellings, and -- in OCR-texts -- noisy information. null  A closer look at the frequency lists strengthens this impression and allows the postulation of the following hypothesis: Hypothesis 1 The more frequent a word appears in a number of consecutively ordered texts or messages of a limited domain, the more probable will it represent the &amp;quot;lezical prototype&amp;quot; for a wordform and in OCR-texts the correct form of a prototype (or a lemma).</Paragraph>
      <Paragraph position="2"> To find out which possible variants exist for the whole number of word forms, the similarities (or distances) of the word forms are computed. An effective method for the measurement of word distances is the Levenshtein distance in combination with an  number of only 7,078 word forms distributed over 100 texts. Notice that the average size cannot be regarded as statistical significant due to the standard deviation of 42.32 and the text sizes ranging between 15 and 256 tokens.</Paragraph>
      <Paragraph position="3">  since the operations that are done to calculate this distance cover most of the phenomena (see 2.2) that occur through OCR. Any two words are compared with each other in a distance matrix, which measures the least effort of transforming one word into the other. Least effort means the lowest number of insertions, deletions, or replacements (as a combination of deletion and insertion). The effort is norrealized to the length of the longest word in order to obtain a ratio-scaled value. Table 1 gives the example of an unordered lexicon entry for the word form rbport with all similar words that were found in the corpus, having a Levenshtein distance lower than 0.9 3. As already postulated in Hypothesis 1, the number of correct and &amp;quot;deflected&amp;quot; forms is always higher than those of typical OCR-mistakes. In fact it must be asked, whether typical OCR mistakes exist at all due to the different types of reasons for these mistakes and the multitude of effects they may have.</Paragraph>
      <Paragraph position="4"> For every word with one or more similar words as determined by a threshold value of 0.9, a preliminary entry was created as illustrated in Table 1, covering the most important morphological deriva-STo facilitate and shorten the work of the algorithm, the alphabet was divided into interpretable signs (a-Z, 0-9, punctuation) and non-interpretable signs (like *, -, ^ etc.) which were converted into a middle point (,) . A word is considered to be everything between two empty spaces. A text or text body is everything that remained on the document after the elimination of head and foot structures (e.g. sender, address, signature, etc. ) Schneider 23 Lexically-Intensive Algorithm tions and graphemic alternations, whereas in none of the entries a distinction between lemma and variants is made so that the unordered lexicon bears a huge burden of redundant information. To diminish this redundancy, it is necessary * to drop those words having a high distance and showing no linguistic relation to the other words in the entries and * to make a clear distinction between a lemma and its variants.</Paragraph>
      <Paragraph position="5"> Therefore the multitude of preliminary lexical entries was reduced to a very compact core lexicon as exampled in Table 2 and described as follows.</Paragraph>
      <Paragraph position="6"> The algorithm processes successively through the frequency list, starting with the most frequent word and finishing with the last hapax legomenon. Each word that can be found in the frequency list is considered as the top of a new lexicon entry or lemma. Afterwards, the algorithm looks for the word forms in the preliminary lexicon, that are similar to this word (having a distance smaller than 0.4), assigns them as variants in the new entry and recursively looks for all variants of the previously assigned variants (having a distance smaller than 0.7). Each one of these variants can no longer be regarded as top of another entry and consequently is taken out of the frequency lists, that simultaneously shrinks more and more. The variants' frequency is added to that of the lemma.</Paragraph>
      <Paragraph position="7"> The results of the algorithm depend a lot on an a priori specified treshold value for the Levenshtein distances. In our tests, good results are achieved with a value of 0.4 for direct similarity and 0.7 for indirect similarity, meaning the newly computed distance of variants of a variant to a given lemma. The threshold value may depend on the language and the domains that are used. This aspect will be further investigated.</Paragraph>
      <Paragraph position="8"> The result of this process is a core lexicon that consists of * high frequent synsemantica or function words having no variants, * high frequent, domain specific autosemantica or content words and most of their occuring variants, null * middle and low frequency words and their variants, and * one single entry for all the remaining hapax legomena having no similarity to one of the preceding words lower than 0.4,  in order of their summarized frequencies. Hence, the number of entries in the core lexicon is at about one third of the total number of types 4. Table 2 shows the entry for report and the assigned variants.</Paragraph>
      <Paragraph position="9"> As follows, many of the wrongly analyzed combinations of e.g. your annual report that formerly lead to a rejection of the text, now can be transformed into their correct forms. This increases the number of documents that can be analyzed by the IE-system considerably. The wrong assignment of sports as a variant of report shows the domain dependency of the algorithm, but it has to be considered that the frequency of such wrong assignments generally is 1 and can be compensated by the extraction of syntactical patterns.</Paragraph>
    </Section>
    <Section position="3" start_page="163" end_page="163" type="sub_section">
      <SectionTitle>
4.3 Acquisition of Syntactical Knowledge
</SectionTitle>
      <Paragraph position="0"> The core lexicon bears the basic lexical knowledge that is needed for a morphological text analysis and furthermore can be used to &amp;quot;clean&amp;quot; documents from noisy sequences but it does not store any information about the syntagmatic relations or dependencies that exist in texts of a given domain. To reveal these dependencies, the original corpus was transformed into a lemmatized version, consisting only of the earlier derived prototypes and &amp;quot;Weighted Ranks&amp;quot; for words with the frequency 1 having no similarity to other words. Figure 6 shows the example text (see Figure 3) after the transformation into lemmata and &amp;quot;jokers&amp;quot;. As can be seen in Figure 6, the algorithm 4In the case of the english requests for annual reports the core lexicon comprised 537 entries with a total number of 1758 types and 7078 tokens in the training corpus.</Paragraph>
      <Paragraph position="2"> - , an independant financial and economic research - , is making a study about leasing in european . in order to make a - of your company , we would like to receive your commorcial documents and your latest ~ual report from - to 1991 in english . if you have a mailing list would you kind include our name for future issues of ~nnual report and information on your company . with our grateful thank , your- .</Paragraph>
      <Paragraph position="3">  * suppresses (in this format) the Hapax-Legomena like Miromir, society, prvsentation, 1988 and faithfully; * corrects the OCR-errors of the most important words, like roports --+ report, repods --+ report, information -+ information; * corrects misspelled words, like recieve -+ receive; * lemmatizes several less frequent words to their more frequent prototypes, like ropods and repods as plural forms of report, last -+ latest, kindly -+ kind, tb~ks -+ thank, yours --4 your 5.</Paragraph>
      <Paragraph position="4"> To enhance the importance of the lexical protoypes or lemmata, their frequencies were multiplicared with the number of their assigned variants as a result of the following hypothesis: Hypothesis 2 The more often a word appears in texts of a restricted category and the more morphological and graphemic variants it has, the more probable the word will represent some domain-specific information. null The multiplication of frequencies and the number of variants of a word (freqz varx) leads to a weighted frequency list (see Table 3) whose first ranks comprise the most relevant lemmata that are needed for the extraction of the salient syntactic patterns. Therefore, the texts are transformed parallely into a corpus of indices implying the ranks that are given to the lemmata after they have been weighted. Scommorcial is head of an entry including commercial as the single variant, both having the frequency 1. Thus, the distinction between stem and variant can not be done clearly by the algorithm (the same holds true for independant and independent). Nevertheless, the two forms and all newly occuring forms having a small distance value will be clustered together.</Paragraph>
      <Paragraph position="5">  The concluding analysis follows the Firthian notion of &amp;quot;knowing a word by the company it keeps&amp;quot; (Firth, 1957), a postulate which emphasizes the fact that certain words have a strong tendency to be used together. Thus, the algorithm retrieves all collocation patterns of different length (2 - 5) and matches them with one another. Repetitively the most frequent patterns are matched with the collocation patterns of a greater length (patterns of length 2 with patterns of length 3; patterns of length 3 with patterns of length 4 etc.) looking both left and right for high frequent lemmata in the neighbourhood of the already composed patterns. That means that the words from the top of the weighted frequency list are connected with the most common words that precede and succeed them. The result is a two--way finite-state automaton that may be analyzed using light parsing strategies (Grefenstette, 1996) with the salient words of the weighted frequency lists as starting points (see Figure 7).</Paragraph>
      <Paragraph position="6"> One attractive alternative to parse the text is a bottom-up island parser for the kernel phrases of a new text. Island Parsers are a useful tool especially in those cases where no sentence markers exist (as e.g. in speech recognition) or whenever they are not transmissed correctly or added (as in OCR-texts). Furthermore a full parse contradicts in a certain way Schneider 25 Lexically-lntensive Algorithm the real ambitions of IE-Systems (Grishman, 1996) and flat finite-state analyses are getting more and more popular and efficient (Bayer et al., 1997). Yet,  the statistical information that is represented in the ranks of the lexical stems should not be omitted, though they show evidence of the degree-of-interest that is needed for the parsing strategy. The neighbourhood of low indices, such as 3 +- 2 &lt;-- 1 (representing your annual report should be regarded as more representative for the corpus' syntax than e.g. 15 &lt;--- 3 -r6 representing of your company or even 95 e- 45 --+ 381 representing am currently doing. The weighted ranks represent the degree-of-interest that the words have for the IE-System. With the help of the weighted ranks, it is possible to compute a probabilistic value similar to transition likelihoods. Looking at a pattern or window of several words wi of a given pattern length n, we add up the ranks of the weighted frequency lists fw~ to ~, and compute the average rank. This value is divided by the overall frequency freq of the whole pattern (wl ..wn):</Paragraph>
      <Paragraph position="8"> The resulting value represents the weighted likelihood for the c(&gt;-occurrence C(wa..w.) of two (or more) words indicating how probable a word precedes or succeeds another word. To give an example the word pattern of two words like mailing list the equation is solved a follows:  or for longer patterns of a lower degree-of-interest, such as as any interim: ras -Jr rany -t- rinterim C ( as any interim) = 3 f r eq( as any interim) 53 + 87 + 28 - = 56.0 3-1 As already pointed out, the values for the co-occurrences of the different lemmata were only computed up to a length of 5 lemmata. Compared to other collocation measures, this value does not only take account of the words frequencies and the collocations frequencies (as e.g. Mutual Information (Church and P., 1990)) or their transition likelihood (as e.g. Markov chains (Thomason, 1986)) but combines these two properties with a third one: the word's different modalities as indicated by their number of variants, i.e. their weighted ranks. This last value weakens the influence of both less frequent and fimctional words and supports the degree-of-interest of domain-specific and correct words as determined in Hypothesis 1 and 2.</Paragraph>
      <Paragraph position="9"> The c(&gt;-occurrence values may be labeled to the arcs of the regular expressions that are generated during this acquisition process to make the parsing process more effective since a low transition value reflects a high significance or degree--of-interest in texts of a certain domain.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="163" end_page="163" type="metho">
    <SectionTitle>
5 Using the Domain-Specific Lexicon
</SectionTitle>
    <Paragraph position="0"> The connections that exist between the different lexical entries are also used to link the entries of the core lexicon, providing it with the syntactical information that is typical for a certain domain. The contents of the entries and their relations, i.e. the arcs connecting them, cover the essential statistical properties of lexemes and their syntactical relationship, enabling a robust lexical and syntactical analysis of new texts.</Paragraph>
    <Paragraph position="1"> First results show that word forms are deflected * -. in the past your companys report has been among those we collect . however , our records indicate we do not have a copy of your 1992 annual rbport .</Paragraph>
    <Paragraph position="2"> please help us complete our collection by sendhig a copy of your 1992 annual report to the followhig adess . .-.</Paragraph>
    <Paragraph position="3">  and corrected (as shown in Figure 6 and 9); kernel phrases are isolated by extracting the islands of the domain-specific words and their surroundings (as  * -. in the - your company ~epo~ has been among these we collection . however , our records indicating we do not have a copy of your 1992 ~nual ~epor~ please help us complete our collection by sending a copy of your 1992 annual ~eport to the following address .. null shown in Figure 7 and 10). Although some words are lemmatized in a quite strange way, as e.g. collect -+ collection or indicate -~ indicating due to their low frequence, the relevant patterns are converted to analyzable and weU formed strings. Given a new text with several occurences of a highly-ranked words, (see Figure 8), the text is lemmatized (see Figure 9 and browsed for the word with the highest degree--of-interest as indicated by the words' weighted ranks (in our example report).</Paragraph>
    <Paragraph position="4"> Afterwards the transition values for the three neighbourhoods of report are compared and ordered after the values of the weighted transition likelihood (see Figure 10 with immediate transition values i.e. a window length 2). In our case, the second phrase has the lowest transition values and would consequently extract and parse succesfully the most relevant phrase a copy of your 1992 annual report to the following address.</Paragraph>
    <Paragraph position="5">  The lexicon's dynamic structure enables the analysis of unknown texts and consequently updates the entries and the relations among them, i.e. whenever an unknown word or a new syntactical pattern appears, the Levenshtein-Distance to the already existing heads of the lexical entries is computed and the word either stored as a new variant or a new entry created. Similar to the lexical updating process the weights of the tokens that connect the lexical entries are either affirmed and strengthened with the repetition of every pattern that was already known to the system or the new pattern is added to the network.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML