File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2116_metho.xml

Size: 18,349 bytes

Last Modified: 2025-10-06 14:10:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2116">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Grammatical Approach to Understanding Textual Tables using Two-Dimensional SCFGs</Title>
  <Section position="5" start_page="906" end_page="909" type="metho">
    <SectionTitle>
2 Data Models for Specifying Semantic
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="906" end_page="906" type="sub_section">
      <SectionTitle>
Interpretations
</SectionTitle>
      <Paragraph position="0"> To begin, some formal basis is needed to facilitate  precisespecificationofthealternativesemanticinterpretations of a table, such that the exact logical relations between its elements are unambiguously specified. This will enable us to then design a table understanding model that attempts to map any given table (and recursively, its subregions) to alternative data models depending on which is most appropriate.</Paragraph>
      <Paragraph position="1"> The set of data models we define below is a more comprehensive and precise inventory than found in the previous table analysis models discussed in this paper. It describes all the common conventional patterns of logical relations we have foundinthecourseofempiricallyanalyzingtables from corpora. One advantage of this inventory of data models arises from our appropriation of relational database theory wherever possible to help describe the form of the data models (Silberschatz et al., 2002), allowing broad coverage of different table types without sacrificing precision as to the logical relations between entities.</Paragraph>
      <Paragraph position="2"> Each data model assigns a clear semantics in terms of logical relations between the table elements, thereby allowing extraction of relational facts. In contrast, previous work on table analysis tends to either classify a table using only one single limited data model (e.g., Hurst (2000)), or using data models which essentially are merely surface layout types whose semantics are vague and ambiguous (e.g., Yang (2002), Yang and Luk (2002), Wang et al. (2000), Yoshida et al. (2001)).</Paragraph>
      <Paragraph position="3"> A table is a logical view of a collection of inter-related items usually presented as a row-column structure such that the reader's ability to access and compare information can be enhanced, as also noted by Wang (1996). From a database manage- null sion, with the English words restored.</Paragraph>
      <Paragraph position="4">  ment system perspective, each table can be considered as a (tiny) database. Like a program, the reader accesses the data. As a result, we consider that every table must correspond to a data model, and this model determines how the reader extracts information from the table.</Paragraph>
      <Paragraph position="5"> Each data model has a schema which, as we shall see below, may or may not surface (partially or completely) as a subset of cells in the table that describe attributes. Recognizing the data models of a table correctly therefore also implies that both attribute-value pairings and table structures have been recognized.</Paragraph>
      <Paragraph position="6"> At the top level, we categorize the data models into three broad types: * Flat model: A table is interpreted as a database table in non-1NF normal relational model.</Paragraph>
      <Paragraph position="7"> * Nested model: A table is interpreted as a database table in an object-relational model, which allowcomplex types such asnested relations and concept hierarchy.</Paragraph>
      <Paragraph position="8"> * Dimensional data model: A table (usually cross-tabular) is interpreted as a data cube (multidimensional table) in a multidimensional data model.</Paragraph>
      <Paragraph position="9"> We now consider each of these types of data models in turn.</Paragraph>
    </Section>
    <Section position="2" start_page="906" end_page="907" type="sub_section">
      <SectionTitle>
2.1 Flat model
Aflatmodelisusedforthesemanticinterpretation
</SectionTitle>
      <Paragraph position="0"> of any table as a relational database table in non1NF. For example, tables such as Tables 2 and 3 are often interpreted by humans in terms of flat models. It is obvious that Table 3 can be viewed as a relational database table with a schema (Pos, Teams, Pld, Pts) and three records, because the table's surface form resembles how records are stored in a relational database tables. Similarly, Table 2 resembles a relational database table, but transposed to a vertical orientation, with the first  ically laid out in a flat relational model.</Paragraph>
      <Paragraph position="1"> Pos Teams Pld Pts 1. Chelsea 38 95 2. Arsenal 38 83 3. Man United 38 77  column as the schema (Date, Temp, RH, Weather) and other columns as data records.</Paragraph>
      <Paragraph position="2"> The flat model is closest to the 1-dimensional table approach used by the majority of previous models, butourapproachdesignatestheflatmodel as a semantic representation, in contrast to the previous models which see 1-dimensional tables merely as a syntactic surface form (e.g., Yang (2002), Yang and Luk (2002), Wang et al. (2000), Yoshida et al. (2001)). While such previous models only recognize tables that are physically laid outinthisform, ourapproachclearlydelineatesan explicit separation of syntax and semantics, which providesgreaterflexibilityallowinganytabletobe interpretedasaflatmodel, regardlessofitssurface form (though the flat model interpretation is more common for some surface forms than others).</Paragraph>
      <Paragraph position="3"> As an example showing that any kind of table can be categorized as flat model, consider Table 6. Even such a table can be semantically interpreted as a flat model because related attributes can join together to form a composite attribute, though humans would less naturally choose this semantic interpretation. Certainly there are hierarchical relationship between attributes; for example, Ass1 is a subtype of Assignments. However, it is also valid to consider the attributes along a hierarchical path as one composite attribute. For example, &amp;quot;Mark -&gt; Assignments -&gt; Ass1&amp;quot; becomes the single attribute &amp;quot;Mark-Assignments-Ass1&amp;quot;. Then the complete flat model schema is (Year, Team, Mark-</Paragraph>
    </Section>
    <Section position="3" start_page="907" end_page="907" type="sub_section">
      <SectionTitle>
2.2 Nested model
</SectionTitle>
      <Paragraph position="0"> With the exception of Hurst (2000), previous work has not generally considered nested models in explicit fashion. Hurst (2000)'s model is based on Wang (1996)'s abstract table model, in which attributes may be related in a hierarchical way. On the other hand, Wang et al. (2000) oversimplistically considers nested models as 1-dimensional, thus missing the correct relationships between attributes and values.</Paragraph>
      <Paragraph position="1"> A nested model can be seen as a generalization of the flat model, in which attributes may be related through composition or inheritance. Table 6 is naturally interpreted as a nested data model because the attributes have an inheritance relationship. The corresponding schema is (Year, Team,  A nested model is not appropriate for tables without hierarchical structure, such as Table 2 and</Paragraph>
    </Section>
    <Section position="4" start_page="907" end_page="909" type="sub_section">
      <SectionTitle>
2.3 Dimensional model
</SectionTitle>
      <Paragraph position="0"> Our approach also nicely handles dimensional models, which are generally handled quite weakly in previous models. A dimensional model refers to a table, such as the table in Table 4, that resembles a view of collection of data stored in multi-dimensional data model. A multidimensional data cube, as described in the database literature (e.g., Han and Kamber (2000), Chaudhuri and Dayal (1997)), consists of a set of numeric measures (though in fact the data need not be numeric), each of which is determined by a set of dimensions.</Paragraph>
      <Paragraph position="1"> Each dimension is described by a set of attributes.</Paragraph>
      <Paragraph position="2"> For example, Table 5 can be semantically interpreted using the multidimensional data model depicted in Figure 1. Likewise, the cross-tabular table in Table 4 can also be semantically interpreted using the same multidimensional data model in  Table 5 are the dimension attributes and the revenue values are the measures.</Paragraph>
      <Paragraph position="3"> In contrast, among previous models, Yang (2002) produces a semantically incorrect recognition of a multidimensional table that inappropriately presents the attributes in hierarchical structure. Yang and Luk (2002) and Wang et al.</Paragraph>
      <Paragraph position="4"> (2000) only recognize the simplest 2-dimensional case and apparently cannot handle 3 or more dimensions. Yoshida et al. (2001) only handle 1-dimensional cases.</Paragraph>
      <Paragraph position="5"> A dimensional model is an inappropriate interpretation for non-cross-tabular tables, such as Table 2 and Table 3. A dimensional model is also not valid for tables such as Table 6. Semantically, it is not possible for &amp;quot;Assignments&amp;quot; and &amp;quot;Midterm&amp;quot;  to belong to different dimensions because it is incorrect to determine the score by both &amp;quot;Assignments&amp;quot; and &amp;quot;Midterm&amp;quot;. Syntactically, the texts in the last attribute row of Table 6 are all unique; however, the last attribute row of the table in Table 4 is a repeating sequence of (&amp;quot;Phone&amp;quot;, &amp;quot;Computer&amp;quot;). Therefore, to a non-English reader, an English cross-tabular table which possess repeating sequences in the attribute rows is likely to be semantically interpreted as a dimensional model, while a cross-tabular table which does not have this property is likely to be interpreted as a nested  In this section, we will present our two-dimensional SCFG parsing model for table analysis which has several advantages over the ad hoc approaches. First, the probabilistic grammar approach permits a cleaner encapsulation and generalization of the kind of knowledge that previous models attempted to capture within their ad hoc heuristics. Most previous works (e.g. Yang (2002), Yang and Luk (2002), Hurst (2000), Hurst (2002)) gradually built up their ad hoc heuristics manually by inspecting some set of training samples. This approach may work if tables are from limited domains of similar nature. However, like text documents, the syntactic layout of textual tablesmaybedeterminedbyitscontextaswellasits null language. For instance, it is natural for an Arabic reader to read an Arabic table taking the rightmost column as the attribute column, instead of the left-most column. Yoshida et al. (2001) use machine-learning techniques to analyze nine types of table structures, all 1-dimensional. Our grammar-based approach allows the model to be readily adapted to different situations by applying different sets of grammar rules.</Paragraph>
      <Paragraph position="6"> Another advantage is that grammatical approach can make more accurate decisions while being simpler to implement, because it requires only a single integrated parsing process to complete the entire table analysis. This includes classifying the functions of each cell (as attribute or value), pairing attributes and values, and identifying the structure and the data model of a table. In contrast, previous works require several stages to complete the entire analysis, introducing complex  problems that are difficult to resolve, such as premature commitment to incorrect early-stage decisions. null ToourknowledgeWang et al. (2000)istheonly textual table analysis model that uses a grammar to describe table structures. However, in that case, only a simple template matching analyzer is used.</Paragraph>
      <Paragraph position="7"> Their grammar notation is unable to show both physical structure and the semantics of a table at the same time in a hierarchical manner. In contrast, information such as &amp;quot;a data block contains three rows of data cell&amp;quot; can be stored in the parse tree constructed by our parsing model.</Paragraph>
      <Paragraph position="8"> Outside of the table understanding literature, thereexistsadifferent2Dparsingtechniquecalled PLEX (Feder, 1971), (Costagliola et al., 1994)  whichallowsanobjecttohavefinitesetsofattaching points. PLEX is used to generate 2D diagrams such as molecular structures, circuit diagrams and flow charts in a grammatical way. However, we consider it too complex and computationally expensive for our application because it does not exploit that fact that a textual table cell only has at most four attaching points in fixed directions.</Paragraph>
      <Paragraph position="9"> Our parser is a two-dimensional extension of the conventional probabilistic chart parsing algorithm (Lari and Young, 1990), (Goodman, 1998).</Paragraph>
      <Paragraph position="10"> Intuitively, consider a sentence as a vector of tokens that will be parsed horizontally; then a table is a matrix of tokens (like a crossword puzzle) thatwillbeparsedbothhorizontallyandvertically.</Paragraph>
      <Paragraph position="11"> Because of this, our parser must run in both directions. We achieve this by employing a grammar notation that specifies the direction of parsing.</Paragraph>
      <Paragraph position="12"> The two-dimensional grammar notation includes of a set of nonterminals, terminals, and two generation operators &amp;quot;-&gt;&amp;quot; and &amp;quot;|-&gt;&amp;quot;. Let X be a nonterminal and Y, Z, be two symbols which may be either nonterminals or terminals. Then:</Paragraph>
      <Paragraph position="14"> rule saying that the nonterminal X horizontally generates two symbols Y and Z.</Paragraph>
      <Paragraph position="16"> saying that the nonterminal X vertically generates two symbols Y and Z.</Paragraph>
      <Paragraph position="18"> unary production rule saying that the nonterminal X generates a symbol Y.</Paragraph>
      <Paragraph position="19"> We assume that all rules are binary without loss of generality, since any grammar can be mechanically binarized without materially changing the parse tree structure, just as in the case of ordinary 1D grammars.</Paragraph>
      <Paragraph position="20"> The operators &amp;quot;-&gt;&amp;quot; and &amp;quot;|-&gt;&amp;quot; control the generation direction. In term of table analysis, a non-terminal represents a matrix of tokens and a terminal represents a single token. Sub-matrices generated by a horizontal rule will have same height but not necessarily same width; similarly, submatrices generated by a vertical rule will have same width but not necessarily same height. In otherwords,amatrixispartitionedintotwohalves by the binary production rule.</Paragraph>
      <Paragraph position="21"> Probabilities are placed on each rule, as in ordinary1DSCFGs. Theyareusedtoeliminateparses falling below a threshold, which also helps to reduce the time complexity in practice.</Paragraph>
      <Paragraph position="22"> Parsing with two-dimensional grammars can be conceptualized most easily via parse tree examples. Figure 2 shows a complete parse tree for parsing the table in Table 7 into a flat model. Figure 3 is a portion of a parse tree for parsing the table in Table 8 into a nested model, while Figure</Paragraph>
    </Section>
    <Section position="5" start_page="909" end_page="909" type="sub_section">
      <SectionTitle>
4isaportionofparsetreesforparsingTable7into
</SectionTitle>
      <Paragraph position="0"> a dimensional model. The following is the grammar fragment that gives the parse tree as Figure  Note that the internal nodes of the parse trees serve to label subregions with data models, thus assigning a semantic interpretation specifying the exact logical relations between table elements. None of the previous models construct declarative parse trees like these, which are necessary for many types of subsequent analysis, including information extraction applications.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="909" end_page="911" type="metho">
    <SectionTitle>
4 Experimental Method
</SectionTitle>
    <Paragraph position="0"> To the best of our knowledge, unfortunately none of the table corpora mentioned in previous work are available to the public. Thus, it was necessary to construct a corpus for our experiments.</Paragraph>
    <Paragraph position="1"> We collected a large sample of tables by issuing Google searches with a list of random keywords, for example, census age, confusion matrix, data table, movie ranking, MSFT, school ranking, telephone plan, tsunami numbers, weather report, and  For the blind evaluation, a human annotator independently manually annotated a randomly chosensampleof45tablesfromthecollection. Alltables in the evaluation sample were previously unseen test cases, never inspected prior to the construction of the two-dimensional grammar.</Paragraph>
    <Paragraph position="2"> Each tokenized table was tagged by the human judge with a list of types T relevant to the table. The relevance is defined as follows: a data model is relevant to a table if and only if the human would agree that such a data model would naturally be hypothesized as an interpretation for that table (analogously to the way that word senses are manually annotated for WSD evaluations). Each type is a tuple of the form (R, O, S), where R is the relevant data model, O is the reading orientation of R, and S is a boolean saying if a schema (i.e. attributes) exist in the table. Thus, Table 2 would be tagged as {(flat, vertical, true)} while the table in Table 4 would be tagged as {(flat, horizontal, true), (flat, vertical, true), (dimensional, , true)}. But Table 9 may be tagged as {(flat, horizontal, false)}. The exceptions are that both the nested model and the dimensional model always have a schema, while the dimensional model does nothaveorientation. Incaseswheremultiplelegitimate readings were possible, the table was tagged  with multiple types. A total of 92 relevant types were generated from the tokenized tables.</Paragraph>
    <Paragraph position="3"> We processed the tokenized tables with the two-dimensional SCFG parser, and computed the precision and the recall rates against the judge's lists of tags for all the test cases.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML