File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-2003_intro.xml

Size: 4,089 bytes

Last Modified: 2025-10-06 14:03:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-2003">
  <Title>LinguaStream: An Integrated Environment for Computational Linguistics Experimentation</Title>
  <Section position="2" start_page="0" end_page="95" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Several important tendencies have been emerging recently in the NLP community. First of all, work on corpora tends to become the norm, which constitutes a fruitful convergence area between taskdriven, computational approaches and descriptive linguistic ones. On corpora validation becomes more and more important for theoretical models, and the accuracy of these models can be evaluated either with regard to their ability to account for the reality of a given corpus (pursuing descriptive aims), either with regard to their ability to analyse it accurately (pursuing operational aims).</Paragraph>
    <Paragraph position="1"> From this point of view, important questions have to be considered regarding which methods should be used in order to project efficiently and accurately linguistic models on corpora.</Paragraph>
    <Paragraph position="2"> It is indeed less and less appropriate to consider corpora as raw materials to which models and processes could be immediately applicable. On the contrary, the multiplicity of approaches, would they be lexical, syntactical, semantic, rhetorical or pragmatical, would they focus on one of these dimensions or cross them, raises questions about how these different levels can be articulated within operational models, and how the related processing systems can be assembled, applied on a corpus, andevaluatedwithinanexperimentalprocess.</Paragraph>
    <Paragraph position="3"> New NLP concerns confirm these needs: recent works on automatic discourse structure analysis, for example regarding thematic structures or rhetorical ones (Bilhaut, 2005; Widl&amp;quot;ocher, 2004), show that the results obtained from lower-grained analysers (such as part-of-speech taggers or local semantics analysers) can be successfully exploited to perform higher-grained analyses. Indeed, such works rely on non-trivial processing streams, where several modules collaborate basing on the principles of incremental enrichment of documents and progressive abstraction from surface forms. The LinguaStream platform (Widl&amp;quot;ocher and Bilhaut, 2005; Ferrari et al., 2005), which is presented here, promotes and facilitates such practices. It allows complex processing streams to be designed and evaluated, assembling analysis components of various types and levels: part-of-speech, syntax, semantics, discourse or statistical. Each stage of the processing stream discovers and produces new information, on which the subsequent steps can rely. At the end of the stream, various tools allow analysed documents and their annotations to be conveniently visualised. The uses of the platform range from corpora exploration to the development of fully operational automatic analysers.</Paragraph>
    <Paragraph position="4"> Other platform or tools pursue similar goals.</Paragraph>
    <Paragraph position="5"> We share some principles with GATE (Cunningham et al., 2002), HoG (Callmeier et al., 2004) and NOOJ1 (Muller et al., 2004), but one important difference is that the LinguaStream platform promotes the combination of purely declarative formalisms (when GATE is mostly based on the JAPE language and NOOJ focuses on a unique formalism), and allows processing streams to be designed graphically as complex graphs (when GATE relies on the pipeline paradigm). Also, the  low-level architecture of LinguaStream is comparable to the HoG middleware, but we are more interested in higher-level aspects such as analysis models and methodological concerns. Finally, when other platforms usually enforce the use of a dedicated document format, LinguaStream is able toprocessanyXMLdocument. Ontheotherhand, LinguaStream is more targeted to experimentation tasks on low amounts of data, when tools such as GATE or NOOJ allow to process larger ones.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML