XML Viewer - c04-1172

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1172_metho.xml
Size: 12,573 bytes
Last Modified: 2025-10-06 14:08:48
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1172">
  <Title>Emdros a text database engine for analyzed or annotated text</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 History of Emdros
</SectionTitle>
    <Paragraph position="0"> Emdros springs out of a reformulation and implementation of the work done by Crist-Jan Doedens in his 1994 PhD thesis (Doedens, 1994). Doedens dened the MdF (Monads-dot-Features) text database model, and the QL query language. Doedens gave a denotational semantics for QL and loaded QL with features, thus making it very dif cult to implement.</Paragraph>
    <Paragraph position="1"> The present author later took Doedens' QL, scaled it down, and gave it an operational semantics, hence making it easier to implement, resulting in the MQL query language. I also took the MdF model and extended it slightly, resulting in the EMdF model.</Paragraph>
    <Paragraph position="2"> Later, I implemented both, resulting in the Emdros text database engine, which has been available as Open Source software since October 2001. The website1 has full sourcecode and documentation.</Paragraph>
    <Paragraph position="3"> Emdros is a general-purpose engine, not a speci c application. This means that Emdros must be incorporated into a speci c software application before it can be made useful.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The EMdF model
</SectionTitle>
    <Paragraph position="0"> The EMdF model is an extension of the MdF model developed in (Doedens, 1994). The EMdF (Extended MdF) model is based on four concepts: Monad, object, object type, and feature. I describe each of these in turn, and give a small example of an EMdF database.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Monad
</SectionTitle>
      <Paragraph position="0"> A monad is simply an integer. The sequence of the integers (1,2,3, etc.) dictates the sequence of the text. The monads do not impose a reading-direction (e.g., left-to-right, right-to-left), but merely a logical text-order.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Object
</SectionTitle>
      <Paragraph position="0"> An object is simply a set of monads with an associated object type. The set is arbitrary in the sense that there are no restrictions on the set. E.g., f1g, f2g, f1,2g, f1,2,6,7g are all valid objects. This allows for objects with gaps, or discontiguous objects (e.g., discontiguous clauses). In addition, an object always has a unique integer id, separate from the the object's monad set.</Paragraph>
      <Paragraph position="1"> Objects are the building blocks of the text itself, as well as the annotations or analyses in the  database. To see how, we must introduce object types.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Object type
</SectionTitle>
      <Paragraph position="0"> An object type groups a set of objects into such classes as Word , Phrase , Clause , Sentence , Paragraph , Chapter , Book , Quotation , Report , etc. Generally, when designing an Emdros database, one chooses a monad-granularity which dictates the smallest object in the database which corresponds to one monad. This smallest object is often Word , but could be Morpheme , Phoneme or even Grapheme . Thus, for example, Word number 1 might consist of the object set f1g, and Word number 2 might consist of the object set f2g, whereas the rst Phrase in the database might consist of the set f1,2g.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Feature
</SectionTitle>
      <Paragraph position="0"> An object type can have any number of features. A feature is an attribute of an object, and always has a type. The type can be a string, an integer, an enumeration, or an object id. The latter allows for complex interrelationships among objects, with objects pointing to each other, e.g., a dependent pointing to a head.</Paragraph>
      <Paragraph position="1"> An enumeration is a set of labels with values. For example, one might de ne an enumeration psp (part of speech) with labels such as noun , verb , adjective , etc. Emdros supports arbitrary de nition of enumeration label sets.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3.5 Example
</SectionTitle>
    <Paragraph position="0"> Consider Figure 1. It shows an EMdF database corresponding to one possible analysis of the sentence The door was blue. There are three object types: Word, Phrase, and Clause. The Clause object type has no features. The Phrase object type has the feature phr_type (phrase type). The Word object type has the features surface and psp .</Paragraph>
    <Paragraph position="1"> The monad-granularity is Word , i.e., each monad corresponds to one monad. Thus the word with id 10001 consists of the monad set f1g. The phrase with id 10005 consists of the monad set f1,2g. The single clause object consists of the monad set f1,2,3,4g.</Paragraph>
    <Paragraph position="2"> The text is encoded by the surface feature on Word object type. One could add features such as lemma , number , gender , or any other feature relevant to the database under construction. The Phrase object type could be given features such as function , apposition_head , relative_head , etc. The Clause object type could be given features distinguishing such things as VSO order , tense of verbal form , illocutionary force , nominal clause/verbless clause , etc. It all depends on the theory used to describe the database, as well as the research goals.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The MQL query language
</SectionTitle>
    <Paragraph position="0"> MQL is based on two properties of text which are universal: sequence and embedding. All texts have sequence, dictated by the constraints of time and the limitation of our human vocal tract to produce only one sequence of words at any given time. In addition, all texts have, when analyzed linguistically, some element of embedding, as embodied in the notions of phrase, clause, sentence, paragraph, etc.</Paragraph>
    <Paragraph position="1"> MQL directly supports searching for sequence and embedding by means of the notion of topographicity. Originally invented in (Doedens, 1994), a (formal) language is topographic if and only if there is an isomorphism between the structure of an expression in the language and the objects which the expression denotes.</Paragraph>
    <Paragraph position="2"> MQL's basic building block is the object block.</Paragraph>
    <Paragraph position="3"> An object block searches for objects in the database of a given type, e.g., Word, Phrase or Clause. If two object blocks are adjacent, then the objects which they nd must also be adjacent in the database. If an object block is embedded inside another object block, then the inner object must be embedded in the outer object in the database.</Paragraph>
    <Paragraph position="4"> Consider Figure 2. It shows two adjacent object blocks, with feature constraints. This would nd two Phrase objects in the database where the rst is an NP and the second is a VP. The objects must be adjacent in the database because the object blocks are adjacent.</Paragraph>
    <Paragraph position="5">  Now consider Figure 3. This query would nd a clause, with the restriction that embedded inside the clause must be two phrases, a subject NP and a predicate VP, in that order. The .. operator means that space is allowed between the NP and the VP, but the space must be inside the limits of the surrounding clause. All of this presupposes an appropriately tagged database, of course.</Paragraph>
    <Paragraph position="6">  The restrictions of type phrase_type = NP refer to features (or attributes) of the objects in the database. The restriction expressions can be any Boolean expression (and/or/not/parentheses), allowing very complex restrictions at the object-level.</Paragraph>
    <Paragraph position="7"> Consider Figure 4. It shows how one can look for objects inside gaps in other objects. In some linguistic theories, the sentence The door, which opened towards the East, was blue would consist of one discontiguous clause ( The door . . . was blue ) with an intervening nonrestrictive relative clause, not part of the surrounding clause. For a sustained argument in favor of this interpretation, see (Mc-Cawley, 1982). The query in Figure 4 searches for structures of this kind. The surrounding context is a Sentence. Inside of this sentence, one must nd a Clause. The rst object in this clause must be a sub-ject NP. Directly adjacent to this subject NP must be a gap in the surrounding context (the Clause). Inside of this gap must be a Clause whose clause type is nonrestr_rel . Directly after the close of the gap, one must nd a VP whose function is predicate. Mapping this structure to the example sentence is left as an exercise for the reader.</Paragraph>
    <Paragraph position="8">  Lastly, objects can refer to each other in the query. This is useful for specifying such things as agreement and heads/dependents. In Figure 5, the AS keyword gives a name ( w1 ) to the noun inside the NP, and this name can then be used inside the adjective in the AdjP to specify agreement.</Paragraph>
    <Paragraph position="9">  MQL provides a number of features not covered in this paper. For full documentation, see the website. null The real power of MQL lies in its ability to express complex search restrictions both at the level of structure (sequence and embedding) and at the object-level.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Application
</SectionTitle>
    <Paragraph position="0"> One prominent example of an Emdros database in use is the Werkgroep Informatica (WI) database of the Hebrew Bible developed under Prof. Dr. Eep Talstra at the Free University of Amsterdam. The WI database is a large text database comprising a syntactic analysis of the Hebrew Bible (also called the Old Testament in Hebrew and Aramaic). This is a 420,000 word corpus with about 1.4 million syntactic objects. The database has been analyzed up to clause level all the way through, and has been analyzed up to sentence level for large portions of the material. A complete description of the database and the underlying linguistic model can be found in (Talstra and Sikkel, 2000).</Paragraph>
    <Paragraph position="1"> In the book of Judges chapter 5 verse 1, we are told that Deborah and Barak sang a song. Deborah and Barak are clearly a plural entity, yet in Hebrew the verb is feminine singular. Was this an instance of bad grammar? Did only Deborah sing? Why is the verb not plural? In Hebrew, the rule seems to be that the verb agrees in number and gender with the rst item in a compound subject, when the verb precedes the subject. This has been known at least since the 19th century, as evidenced by the Gesenius-Kautzsch grammar of Hebrew, paragraph 146g.</Paragraph>
    <Paragraph position="2"> With Emdros and the WI database, we can validate the rule above. The query in Figure 6 nds 234 instances, showing that the pattern was not uncommon, and inspection of the results show that the verb most often agrees with the rst member of the compound subject. The 234 hits are the bare results returned from the query engine. It is up to the researcher to actually look at the data and verify or falsify their hypothesis. Also, one would have to look for counterexamples with another query.</Paragraph>
    <Paragraph position="3">  The query nds clauses within which there are two phrases, the rst being a predicate and the second being a subject. The phrases need not be adjacent. The predicate must contain a verb in the singular. The subject must rst contain a noun, proper noun, or pronoun which agrees with the verb in number and gender. Then a conjunction must follow the noun, still inside the subject, but not necessarily adjacent to the noun.</Paragraph>
    <Paragraph position="4"> The WI database is the primary example of an Emdros database. Other databases stored in Emdros include the morphologically encoded Hebrew Bible produced at the Westminster Hebrew Institute in Philadelphia, Pennsylvania, and a corpus of 67 million words in use at the University of Illinois at Urbana-Champaign.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML