File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1036_metho.xml
Size: 7,606 bytes
Last Modified: 2025-10-06 14:07:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1036"> <Title>Information Extraction with Term Frequenciesa0</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. TERM FREQUENCY ALGORITHM </SectionTitle> <Paragraph position="0"> The algorithm requires a set of passages that are likely to contain an answer, and a category for each question. This algorithm is similar to the information extraction technique used in the GuruQA system[8]. The key to the algorithm is using term frequencies to give individual terms a score. Important information is uncovered by looking at repeated terms in a set of passages. In addition, terms are scored based on their recurrence in the corpus. The system applies very simple patterns to discover individual words or numbers, allowing the evaluation of the term's frequency. This method proceeds in the following sequence: 1. Simplify the question category from the parser output.</Paragraph> <Paragraph position="1"> 2. Scan the passages for patterns matching the question cate null 3. Assign each possible answer term an initial weight based on its rareness.</Paragraph> <Paragraph position="2"> 4. Modify each term weight depending on its distance from the centre and rank of the passage.</Paragraph> <Paragraph position="3"> 5. Select the (50-byte or 250-byte TREC 9 format) answer that maximizes the sum of the terms' weight found within the passage.</Paragraph> <Paragraph position="4"> 6. Set all terms' weight in the selected answer to zero. 7. Repeat steps 5 and 6 until five answers are selected. The initial procedure simplifies the answer categories. The algorithm utilizes the question classification given by the parser in the following categories: Proper (person, name, company), Place (city, country, state), Time (date, time of day, weekday, month, duration, age), How (much, many, far, tall, etc.). The latter category is divided into sub-categories for monetary values, numbers, distances and other methods of measurement.</Paragraph> <Paragraph position="5"> Next, the passages are scanned using the patterns for the given question classification. The purpose of the patterns is to narrow the number of possible answers which will increases the performance. It is important to note that the patterns do not contribute to the terms' weight. These simple patterns are regular expressions that have been hand-coded. For example, the pattern for Proper is [^A-Za-z][A-Z][A-Za-z][^A-Za-z0-9], which matches a capital letter followed by one or more letters surrounded by white space or punctuation. Each word in the passage either matches a pattern or not. Patterns do not stretch over more than one word. In the passage &quot;Bank of America&quot; only &quot;Bank&quot; and &quot;America&quot; would be considered possible answers. The algorithm can find the correct answer &quot;Bank of America&quot; by determining that &quot;Bank&quot; and &quot;America&quot; should be in the answer. When question classification is unknown, the term frequency for all words in the passages is computed. The system was evaluated using no question classification and still achieved a MRR of 0.338. With no classification, only the term frequency equation is utilized to evaluate answers. This confirms the power of the term frequency equation (1). The patterns for each question classification are very naive so in theory, if the patterns were improved the entire system would also improve.</Paragraph> <Paragraph position="6"> Thirdly, the terms are differentiated by assigning each term a weight. The term weight is related to the term's rareness. The rarer the term, the higher the term's value. The power of the information extraction component is almost entirely derived from this step. Each term's weight is calculated by the following formula:</Paragraph> <Paragraph position="8"> where a60 a33 is the number of times the term is in the corpus, a49 a33 is the number of times the term is in the set of passages, and a57 is the total number of terms in the corpus. Knowing the term's corpus frequency is important; however, the strength of the formula is drawn from the multiple occurrences of terms appearing in the retrieved passages. An answer extract containing &quot;Bank of America&quot; will most likely be selected if &quot;Bank&quot; and &quot;America&quot; have high term frequency values. Essentially, this calculation employs the corpus term frequency in conjunction with a voting scheme. The equation will reveal the rarest term in the corpus that occurs most often in the passages retrieved.</Paragraph> <Paragraph position="9"> The fourth step modifies the term weight depending on its location. The centre of the passages is the centre of the query terms' locations. As a possible answer's distance from the centre increases its relation to the query, terms decrease. To utilize this information, the term weight is modified in conjunction with its distance from the centre of the passage. The farther from the centre, the more the term weight is decreased. The term value is then modified according to the passage ranking in which it was found; the lower the ranking, the more the term weight is decreased. Step four is important because it distinguishes duplicate terms depending on each term's position. This means that if there are many duplications of a possible answer each one will have a different term weight. For example, the term &quot;Bank&quot; found in the best passage would have a higher term weight than a &quot;Bank&quot; term found in a lower ranking passage.</Paragraph> <Paragraph position="10"> For TREC-9, the system was required to produce 50- and 250-byte substrings. Each substring is assigned a score equal to the sum of the terms' weight within it. The best answer is the substring of the required length with the highest score. The weight of all the terms appearing in the answer substring is reduced to zero (step six). The final step is the selection of the next best substring; this process repeats until the number of desired substrings is fulfilled.</Paragraph> <Paragraph position="11"> Reducing the terms' weight to zero allows for distinction between each of the answers, eliminating answers that are almost the same.</Paragraph> <Paragraph position="12"> When a term is part of a phrase like &quot;knowing is half the battle&quot; the terms in the phrase will usually appear together in the retrieved passages. This means the phrase would be selected if &quot;knowing&quot; , &quot;half&quot;, and &quot;battle&quot; scored highly.</Paragraph> <Paragraph position="13"> The idea behind the algorithm is to evaluate potential answers in the passages retrieved using the term frequency equation. The question classification patterns are used to limit the number of possible answers evaluated, which heightens accuracy. The algorithm will select phrases even if all the words are not possible answers.</Paragraph> <Paragraph position="14"> The term frequency algorithm does not need to know the answer classification to perform proficiently. This is a very robust method to extract answers, though knowing the question classification does improve the system's mean reciprocal rank considerably.</Paragraph> <Paragraph position="15"> In the future, term frequencies may be used in combination with Natural Language Processing (NLP) techniques such as a name entity tagging to further enhance the system's results.</Paragraph> </Section> class="xml-element"></Paper>