File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1057_metho.xml

Size: 30,696 bytes

Last Modified: 2025-10-06 14:15:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1057">
  <Title>Learning to Recognize Tables in Free Text Hwee Tou Ng</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Tables are present in many reai-world texts. Some information such as statistical data is bestpresented in tabular form. A check on the more than 100,000 Wall Street Journal (WSJ) documents collected in the ACL/DCI CD-ROM reveals that at least an estimated one in 30 documents contains tables.</Paragraph>
    <Paragraph position="1"> Tables present a unique challenge to information extraction systems. At the very least, the presence of tables must be detected so that they can be skipped over. Otherwise, processing the lines that constitute tables as if they are normal &amp;quot;sentences&amp;quot; is at best misleading and at worst may lead to erroneous analysis of the text.</Paragraph>
    <Paragraph position="2"> As tables contain important data and information, it is critical for an information extraction system to be able to extract the information embodied in tables. This can be accomplished only if the structure of a table, including its rows and columns, are identified. null That table recognition is an important step in information extraction has been recognized in (Appelt and Israel, 1997). Recently, there is also a greater realization within the computational linguistics community that the layout and types of information (such as tables) contained in a document are important considerations in text processing (see the call for participation (Power and Scott, 1999) for the 1999 AAAI Fail Symposium Series).</Paragraph>
    <Paragraph position="3"> However, despite the omnipresence of tables and their importance, there is surprisingly very little work in computational linguistics on algorithms to recognize tables. The only research that we are aware of is the work of (Hurst and Douglas, 1997; Douglas and Hurst, 1996; Douglas et al., 1995).</Paragraph>
    <Paragraph position="4"> Their method is essentially a deterministic algorithm that relies on spaces and special punctuation symbols to identify the presence and structure of tables. However, tables are notoriously idiosyncratic. The main difficulty in table recognition is that there axe so many different and varied ways in which tables can show up in real-world texts. Any deterministic algorithm based on a fixed set of conditions is bound to fail on tables with unforeseen layout and structure in some domains.</Paragraph>
    <Paragraph position="5"> In contrast, we present a new approach in this paper that learns to recognize tables in free text. As our approach is adaptive and trainable, it is more flexible and easily adapted to texts in different domains with different table characteristics.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="444" type="metho">
    <SectionTitle>
2 Task Definition
</SectionTitle>
    <Paragraph position="0"> The input to our table recognition program consists of plain texts in ASCII characters. Examples of input texts are shown in Figure I to 3. They are document fragments that contain tables. Figure 1 and 2 are taken from the Wall Street Journal documents in the ACL/DCI CD-ROM, whereas Figure 3 is taken from the patent documents in the TIPSTER IR Text Research Collection Volume 3 CD-ROM. 1 In Figure 1, we added horizontal 2-digit line numbers &amp;quot;Line nn:&amp;quot; and vertical single-digit line numbers &amp;quot;n&amp;quot; for ease of reference to any line in this document. We will use this document to illustrate the details of our learning approach throughout this paper. We refer to a horizontal line as hline and a vertical line as vline in the rest of this paper.</Paragraph>
    <Paragraph position="1"> Each input text may contain zerQ, one or more tables. A table consists of one or more hlines. For example, in Figure 1, hlines 13-18 constitute a table. Ear~ table is subdivided into columns and rows.</Paragraph>
    <Paragraph position="2">  Last week's output fell 9.5~ from the 1,804,000 tons produced a year earlier.</Paragraph>
    <Paragraph position="3"> The industry used 75.8X of its capability last week, compared with 71.9~ the previous week and 72.3~ a year earlier.</Paragraph>
    <Paragraph position="4">  11: The American Iron and Steel Institute reported: 12: 13: Net tons Capability 14: produced utilization 15: Week to March 14 .............. 1,633,000 75.8~ 16: Week to March 7 ............... 1,570,000 71.9~ 17: Year to date .................. 15,029,000 66.9~ 18: Year earlier to date .......... 18,431,000 70.8~ 19: The capability utilization rate is a calculation designed 20:to indicate at what percent of its production capability the 21:industry is operating in a given week.</Paragraph>
    <Paragraph position="5">  Each column of a table consists of one or more vlines. For example, there are three columns in the table in  of a table consists of one or more hlines. For example, there are five rows in the table in Figure 1: hlines 13-14, 15, 16, 17, and 18.</Paragraph>
    <Paragraph position="6"> More specifically, the task of table recognition is to identify the boundaries, columns and rows of tables within an input text. For example, given the input text in Figure 1, our table recognition program will identify one table with the following boundary, columns and rows:  I. Boundary: Mines 13-18 2. Columns: vlines 4-23, 36--45, and 48-58 3. Rows: hlines 13-14, 15, 16, 17, and 18  Figure 1 to 3 illustrate some of the dh~iculties of table recognition. The table in Figure I uses a string of contiguous punctuation symbols &amp;quot;.&amp;quot; instead of blank space characters in between two columns. In Figure 2, the rows of the table can contain caption or title information, like &amp;quot;How Some Highly Conditionai 'Bids' Fared&amp;quot;, or header information like &amp;quot;Stock's Initial Reaction***&amp;quot; and &amp;quot;Outcome&amp;quot;, or  side walls of the tray to provide even greater protection from convective heat transfer. Preferred construction materials are shown in Table 1: TABLE 1</Paragraph>
    <Section position="1" start_page="444" end_page="444" type="sub_section">
      <SectionTitle>
Component
Material
Stiffener
</SectionTitle>
      <Paragraph position="0"> Paperboard having a thickness of about 6 and 30 mil (between about 6 and 30 point chip board).</Paragraph>
      <Paragraph position="1"> Insulation Mineral wool, having a density of between 2.5 and 6.0 pounds per cubic foot and a thickness of between 1/4 and 1 and 1/4 inch.</Paragraph>
      <Paragraph position="2"> Plastic sheets Polyethylene, having a thickness of between 1 and 4 mil; coated with a reflective finish on the exterior surfaces, such as aluminum having a thickness of between 90 and 110 Angstroms applied using a standard technique such as vacuum deposition.</Paragraph>
      <Paragraph position="3"> The stiffener 96 makes a smaller contribution to the insulation properties of the blanket 92, than does the insulator 98. As stated above, the  body content information like &amp;quot;$52&amp;quot; and &amp;quot;+5 3/8 to 49 1/8&amp;quot;. Each row containing body content information consists of several hlines -- information on &amp;quot;Outcome&amp;quot; spans several hlines. In Figure 3, strings of contiguous dashes &amp;quot;-&amp;quot; occur within the table. Furthermore, the two columns within the table appear right next to each other -- there are no blank vlines separating the two columns. Worse still, some words from the first column like &amp;quot;Insulation&amp;quot; and &amp;quot;Plastic sheets&amp;quot; spill over to the second column. Notice that there may or may not be any blank lines or delimiters that immediately precede or follow a table within an input text.</Paragraph>
      <Paragraph position="4"> In this paper, we assume that our input texts are plain texts that do not contain any formatting codes, such as those found in an SGML or HTML document. There is a large number of documents that fall under the plain text category, and these are the kinds of texts that our approach to table recognition handles. The work of (Hurst and Douglas, 1997; Douglas and Hurst, 1996; Douglas et al., 1995) also deals with plain texts.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="444" end_page="446" type="metho">
    <SectionTitle>
3 Approach
</SectionTitle>
    <Paragraph position="0"> A table appearing in plain text is essentially a two dimensional entity. Typically, the author of the text uses the &lt;newline&gt; character to separate adjacent hlines and a row is formed from one or more of such hlines. Similarly, blank space characters or some document fragment special punctuation characters are used to delimit the columns. 2 However, the specifics of how exactly this is done can vary widely across texts, as exemplified by the tables in Figure 1 to 3.</Paragraph>
    <Paragraph position="1"> Instead of resorting to an ad-hoc method to recognize tables, we present a new approach in this paper that learns to recognize tables in plain text. Our learning method uses purely surface features like the proportion of the kinds of characters and their relative locations in a line and across lines to recognize tables. It is domain independent and does not rely on any domain-specific knowledge. We want to investigate how high an accuracy we can achieve based purely on such surface characteristics.</Paragraph>
    <Paragraph position="2"> The problem of table recognition is broken down into 3 subproblems: recognizing table boundary, column, and row, in that order. Our learning approach treats eac~ subproblem as a separate classification problem and relies on sample training texts in which the table boundaries, columns, and rows have been correctly identified. We built a graphical user interface in which such markup by human annotators can be readily done. With our X-window based GUI, a typical table can be annotated with its boundary, column, and row demarcation within a minute.</Paragraph>
    <Paragraph position="3"> From these sample annotated texts, training ex- null amples in the form of feature-value vectors with correctly assigned classes are generated. One set of training examples is generated for each subproblem of recognizing table boundary, column, and row. Machine learning algorithms are used to build classifters from the training examples, one classifier per subproblem. After training is completed, the table recognition program will use the learned classifiers to recognize tables in new, previously unseen input texts.</Paragraph>
    <Paragraph position="4"> We now describe in detail the feature extraction process, the learning algorithms, and how tables in new texts are recognized. The following classes of characters are referred to throughout the rest of this section:  * Space character: the character &amp;quot; &amp;quot; (i.e., the character obtained by typing the space bar on the keyboard).</Paragraph>
    <Paragraph position="5"> * Alphanumeric character: one of the following characters: &amp;quot;A&amp;quot; to &amp;quot;Z', &amp;quot;a&amp;quot; to &amp;quot;z', and &amp;quot;0&amp;quot; to &amp;quot;9&amp;quot;.</Paragraph>
    <Paragraph position="6"> * Special character: any character that is not a space character and not an alphanumeric character. null * Separator character: one of the following characters: &amp;quot;.&amp;quot;, &amp;quot;*', and %&amp;quot;.</Paragraph>
    <Section position="1" start_page="445" end_page="446" type="sub_section">
      <SectionTitle>
3.1 Feature Extraction
3.1.1 Boundary
</SectionTitle>
      <Paragraph position="0"> Every hline in an input text generates one training example for the subproblem of table boundary recognition. Every hline H within (outside) a table generates a positive (negative) example. Each training example consists of a set of 27 feature values.</Paragraph>
      <Paragraph position="1"> The first nine feature values are derived from the immediately preceding hline H-l, the second nine from the current hline Ho, and the last nine from the immediately following//1.3 For a given hline H, its nine features and their associated values are given in Table 1.</Paragraph>
      <Paragraph position="2"> To illustrate, the feature values of the training example generated by line 16 in Figure 1 are: f, 3, N, %, N, 4, 3, I, I, f, 3, N, %, N, 4, 3,1, 1, f, 3, N, %, N, 3, 3, I, 1 Line 16 generated the feature values f, 3, N, %, N, 4, 3,1, 1. Since line 16 does not consist of only space characters, the value of F1 is f. There are three space characters before the word 3For the purpose of generating the feature values for the first and last hline in a text, we assume that the text is padded with a line of blank space characters before the first line and after the last line.</Paragraph>
      <Paragraph position="3"> &amp;quot;Week&amp;quot; in line 16, so the value of F2 is 3. Since the first non-space character in line 16 is &amp;quot;W&amp;quot; and it is not one of the listed special characters, the value of F3 is &amp;quot;N&amp;quot;. The last non-space character in line 16 is &amp;quot;%&amp;quot;, which becomes the value of F4. Since line 16 does not consist of only special characters, the value of F5 is &amp;quot;N&amp;quot;. There are four segments in line 16 such that each segment consists of two or more contiguous space characters: a segment of three contiguous space characters before the word &amp;quot;Week&amp;quot;; a segment of two contiguous space characters after the punctuation characters &amp;quot;...&amp;quot; and before the number &amp;quot;1,570,000&amp;quot;; a segment of three contiguous space characters between the two numbers &amp;quot;1,570,000&amp;quot; and &amp;quot;71.9%&amp;quot;; and the last segment of contiguous space characters trailing the number &amp;quot;71.9%&amp;quot;. The values of the remaining features of line 16 are similarly determined. Finally, line 15 and 17 generated the feature values f,3,N,%,N,4,3,1,1 and f,3,N,%,N,3,3,1,1, respectively.</Paragraph>
      <Paragraph position="4"> The features attempt to capture some recurring characteristics of lines that constitute tables. Lines with only space characters or special characters tend to delimit tables or are part of tables. Lines within a table tend to begin with some number of leading space characters. Since columns within a table are separated by contiguous space characters or special characters, we use segments of such contiguous characters as features indicative of the presence of tables.  Every vline within a table generates one training example for the subproblem of table column recognition. Each vline can belong to exactly one of five classes:  1. Outside any column 2. First line of a column 3. Within a column (but neither the first nor last line) 4. Last line of a column 5. First and last line of a column (i.e., the column  consists of only one line) Note that it is possible for one column to immediately follow another (as is the case in Figure 3). Thus a two-class representation is not adequate here, since there would be no way to distinguish between two adjoining columns versus one contiguous column using only two classes. 4 The start and end of a column in a table is typically characterized by a transition from a vline of  characters: 0\[\]{}&lt;&gt; +-*/=~!@#$%A&amp; or N (if the first non-space character is not one of the above special characters).</Paragraph>
      <Paragraph position="5"> F4 The last non-space character in H. Possible values are the same as F3. F5 Whether H consists entirely of one special character only. Possible values are either one of the special characters listed in F3 (if H only consists of that special character) or N (if H does not consist of one special character only). F6 The number of segments in H with two or more contiguous space characters. F7 The number of segments in H with three or more contiguous space characters. F8 The number of segments in H with two or more contiguous separator characters. F9 The number of segments in H with three or more contiguous separator characters.  space (or special) characters to a vline with mixed alphanumeric and space characters. That is, the transition of character types across adjacent vlines gives an indication of the demarcation of table columns.</Paragraph>
      <Paragraph position="6"> Thus, we use character type transition as the features to identify table columns.</Paragraph>
      <Paragraph position="7"> Each training example consists of a set of six feature values. The first three feature values are derived from comparing the immediately preceding vline V-z and the current vline V0, while the last three feature values are derived from comparing V0 with the immediately following vline Vl.S Let Vj and Vj+ 1 be any two adjacent vlines.</Paragraph>
      <Paragraph position="8"> Suppose Vj = Clj...ci,j...c~,#, and Vj+I = Czj+l ... cij+l ... cm,j+z where m is the number of hlines that constitute a table.</Paragraph>
      <Paragraph position="9"> Then the three feature values that are derived from the two vlines Vj and 1~+1 are determined by counting the proportion of two horizontally adjacent characters c~,j and cij+l (1 &lt; i &lt; m) that satisfy some condition on the type of the two characters. The precise conditions on the three features are given in Table 2.</Paragraph>
      <Paragraph position="10"> To illustrate, the feature values of vline 4 in Figure 1 are: 0.333, 0, 0.667, 0.333, 0, 0 and its class is 2 (first line of a column). In deriving the feature values, only hlines 13-18, the lines that constitute the table, are considered (i.e., m = 6). For the first three feature values, F1 = 2/6 since there are two space-character-to-space-character transitions from vline 3 to 4 (namely, on hlines 13 and 14); F2 = 0 since there is no alphanumeric character or special character in vline 5For the purpose of generating the feature values for the first and last vline in a table, we assume that the table is padded with a vline of blank space characters before the first vline and after the last vline.</Paragraph>
      <Paragraph position="11"> 3; F3 = 4/6, since there are four space-character-toalphanumeric-character transitions from vline 3 to 4 (namely, on hlines 15-18). Similarly, the last 3 feature values are derived by examining the character transitions from vline 4 to 5.</Paragraph>
      <Paragraph position="12">  Every hline within a table generates one training example for the subproblem of table row recognition.</Paragraph>
      <Paragraph position="13"> Unlike table columns, every hline within a table belongs to some row in our formulation of the row recognition problem. As such, each hline belongs to exactly one of two classes:  1. First hline of a row 2. Subsequent hline of a row (not the first line)  The layout of a typical table is such that its rows tend to record repetitive or similar data or information. We use this clue in designing the features for table row recognition. Since the information within a row may span multiple hlines, as the &amp;quot;Outcome&amp;quot; information in Figure 2 illustrates, we use the first hline of a row as the basis for comparison across rows. If two hlines are similar, then they belong to two separate rows; otherwise, they belong to the same row. Similarity is measured by character type transitions, as in the case of table column recognition. null More specifically, to generate a training example for a hline H, we compare H with H ~, where H ~ is the first hline of the immediately preceding row if H is the first hline of the current row, and H ~ is the first hline of the current row if H is not the first hline of the current row. 6 Each training example consists of a set of four feature values F1,..., F4. F1, F2, and F3 are determined by comparing H and H ~ while F4 is determined solely from H. Let H = Ci,l ... cid.., ci,n</Paragraph>
      <Paragraph position="15"> cij is a space character and ei,jq_ 1 is a space character; or ci,j is a special character and ci,j+l is a special character cij is an alphanumeric character or a special character, and ci,j+l is a space char-</Paragraph>
      <Paragraph position="17"> and H' = Ci',1 ... Ci',j... Ci',n, where n is the number of vlines of the table. The values of F1,..., F3 are determined by counting the proportion of the pairs of characters ci, j and cl,j (1 _&lt; j &lt; n) that satisfy some condition on the type of the two characters, as listed in Table 3. Let ci,k be the first non-space character in H. Then the value of F4 is kin.</Paragraph>
      <Paragraph position="18"> To illustrate, the feature values of hline 16 in Figure 1 are: 0.236, 0.018, 0.018, 0.018 and its class is 1 (first line of a row). There are 55 vlines in the table, so n = 55. 7 Since hline 16 is the first line of a row, it is compared with hline 15, the first hline of the immediately preceding row, to generate F1, F2, and F3. F1 = 13/55 since there are 13 space-character-to-space-character transitions from hline 15 to 16. F2 = F3 = 1/55 since there is only one alphanumeric-character-to-space-character transition (&amp;quot;4&amp;quot; to space character in vline 19) and one space-character-to-special-character transition (space character to &amp;quot;.&amp;quot; in vline 20). The first non-space character is &amp;quot;W&amp;quot; in the first vline within the table, so k = 1.</Paragraph>
    </Section>
    <Section position="2" start_page="446" end_page="446" type="sub_section">
      <SectionTitle>
3.2 Learning Algorithms
</SectionTitle>
      <Paragraph position="0"> We used the C4.5 decision tree induction algorithm (Quinlan, 1993) and the backpropagation algorithm for artificial neural nets (Rumelhart et al., 1986) as the learning algorithms to generate the classifiers.</Paragraph>
      <Paragraph position="1"> Both algorithms are representative state-of-the-art learning algorithms for symbolic and connectionist learning.</Paragraph>
      <Paragraph position="2"> We used all the default learning parameters in the C4.5 package. For backpropagation, the learning parameters are: hidden units : 2, epochs = 1000, learning rate = 0.35 and momentum term = 0.5. We also used log n-bit encoding for the symbolic features and normalized the numeric features to \[0... 1\] for backpropagation.</Paragraph>
    </Section>
    <Section position="3" start_page="446" end_page="446" type="sub_section">
      <SectionTitle>
3.3 Recognizing Tables in New Texts
3.3.1 Boundary
</SectionTitle>
      <Paragraph position="0"> Every hline generates a test example and a classifier assigns the example as either positive (within a ~'In generating the feature values for table row recognition, only the vlines enclosed within the identified first and last column of the table are considered.</Paragraph>
      <Paragraph position="1"> table) or negative (outside a table).</Paragraph>
    </Section>
    <Section position="4" start_page="446" end_page="446" type="sub_section">
      <SectionTitle>
3.3.2 Column
</SectionTitle>
      <Paragraph position="0"> After the table boundary has been identified, classification proceeds from the first (leftmost) vline to the last (rightmost) vline in a table. For each vline, a classifier will return one of five classes for the test example generated from the current vline.</Paragraph>
      <Paragraph position="1"> Sometimes, the class assigned by a classifier to the current vline may not be logically consistent with the classes assigned up to that point. For instance, it is not logically consistent if the previous vline is of class 1 (outside any column) and the current vline is assigned class 4 (last line of a column). When this happens, for the backpropagation algorithm, the class which is logically consistent and has the highest score is assigned to the current vline; for C4.5, one of the logically consistent classes is randomly chosen.</Paragraph>
    </Section>
    <Section position="5" start_page="446" end_page="446" type="sub_section">
      <SectionTitle>
3.3.3 Row
</SectionTitle>
      <Paragraph position="0"> The first hline of a table always starts a new active row (class 1). Thereafter, for a given hline, it is compared with the first hline of the current active row. If the classifier returns class 1 (first hline of a row), then a new active row is started and the current hline is the first hline of this new row. If the classifier returns class 2 (subsequent hline of a row), then the current active row grows to include the current hline.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="446" end_page="448" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> To determine how well our learning approach performs on the task of table recognition, we selected 100 Wall Street Journal (WSJ) news documents from the ACL/DCI CD-ROM. After removing the SGML markups on the original documents, we manually annotated the plain-text documents with table boundary, column, and row information. The documents shown in Figure 1 and 2 are part of the 100 documents used for evaluation.</Paragraph>
    <Section position="1" start_page="446" end_page="448" type="sub_section">
      <SectionTitle>
4.1 Accuracy Definition
</SectionTitle>
      <Paragraph position="0"> To measure the accuracy of recognizing table boundary of a new text, we compare the class assigned by the human annotator to the class assigned by our table recognition program on every hline of the text.</Paragraph>
      <Paragraph position="1"> Let A be the number of hlines identified by the human annotator as being part of some table. Let B</Paragraph>
    </Section>
    <Section position="2" start_page="448" end_page="448" type="sub_section">
      <SectionTitle>
Feature Description
</SectionTitle>
      <Paragraph position="0"> be the number of Mines identified by the program as being part of some table. Let C be the number of Mines identified by both the human annotator and the program as being part of some table. Then recall R = C/A and precision P = C/B. The accuracy of table boundary recognition is defined as the F measure, where F = 2RP/(R + P). The accuracy of recognizing table column (row) is defined similarly, by comparing the class assigned by the human annotator and the program to every vline (hline) within a table.</Paragraph>
    </Section>
    <Section position="3" start_page="448" end_page="448" type="sub_section">
      <SectionTitle>
4.2 Deterministic Algorithms
</SectionTitle>
      <Paragraph position="0"> To determine how well our learning approach performs, we also implemented deterministic algorithms for recognizing table boundary, column, and row.</Paragraph>
      <Paragraph position="1"> The intent is to compare the accuracy achieved by our learning approach to that of the baseline deterministic algorithms. These deterministic algorithms are described below.</Paragraph>
      <Paragraph position="2">  A Mine is considered part of a table if at least one character of Mine is not a space character and if any of the following conditions is met: * The ratio of the position of the first non-space character in hline to the length of hline exceeds some pre-determined threshold (0.25) * Hline consists entirely of one special character. . Hline contains three or more segments, each consisting of two or more contiguous space characters. null * Hline contains two or more segments, each consisting of two or more contiguous separator characters.</Paragraph>
      <Paragraph position="3">  All vlines within a table that consist of entirely space characters are considered not part of any column. The remaining vlines within the table are then grouped together to form the columns.</Paragraph>
      <Paragraph position="4">  The deterministic algorithm to recognize table row is similar to the recognition algorithm of the learning approach given in Section 3.3.3, except that the classifier is replaced by one that computes the proportion of character type transitions. All characters in the two hlines under consideration are grouped into four types: space characters, special characters, alphabetic characters, or digits. If the proportion of characters that change type exceeds some pre-determined threshold (0.5), then the two Mines belong to the same row.</Paragraph>
    </Section>
    <Section position="4" start_page="448" end_page="448" type="sub_section">
      <SectionTitle>
4.3 Results
</SectionTitle>
      <Paragraph position="0"> We evaluated the accuracy of our learning approach on each subproblem of table boundary, column, and row recognition. For each subproblem, we conducted ten random trials and then averaged the accuracy over the ten trials. In each random trial, 20% of the texts are randomly chosen to serve as the texts for testing, and the remaining 80% texts are used for training. We plot the learning curve as each classifter is given increasing number of training texts. Figure 4 to 6 summarize the average accuracy over ten random trials for each subproblem. Besides the accuracy for the C4.5 and backpropagation classitiers, we also show the accuracy of the deterministic algorithms.</Paragraph>
      <Paragraph position="1"> The results indicate that our learning approach outperforms the deterministic algorithms for all subproblems. The accuracy of the deterministic algorithms is about 70%, whereas the maximum accuracy achieved by the learning approach ranges over 85% - 95%. No one learning algorithm clearly out-performs the other, with C4.5 giving higher accuracy on recognizing table boundary and column, and backpropagation performing better at recognizing table row.</Paragraph>
      <Paragraph position="2"> To test the generality of our learning approach, we also evaluated it on 50 technical patent documents from the TIPSTER Volume 3 CD-ROM. To test how well a classifier that is trained on one domain of texts will generalize to work on a different domain, we also tested the accuracy of our learning approach on patent texts after training on WSJ texts only, and vice versa. Space constraint does not permit us to present the detailed empirical results in this paper, but suffice to say that we found that our learning approach is able to generalize well to work on different domains of texts.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="448" end_page="449" type="metho">
    <SectionTitle>
5 Future Work
</SectionTitle>
    <Paragraph position="0"> Currently, our table row recognition does not distinguish among the different types of rows, such as title (or caption) row, header row, and content row.</Paragraph>
    <Paragraph position="1"> We would like to extend our method to make such  accuracy on WSJ texts distinction. We would also like to investigate the effectiveness of other learning algorithms, such as exemplar-based methods, on the task of table recognition. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML