File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2202_intro.xml
Size: 2,362 bytes
Last Modified: 2025-10-06 14:04:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2202"> <Title>Simple Information Extraction (SIE): A Portable and Effective IE System</Title> <Section position="3" start_page="9" end_page="9" type="intro"> <SectionTitle> 2 A Simple IE system </SectionTitle> <Paragraph position="0"> SIE has a modular system architecture. It is composedbyageneralpurposemachinelearningalgo- null rithm combined with several customizable components. The system components are combined in a pipeline, where each module constrains the data structures provided by the previous ones.</Paragraph> <Paragraph position="1"> This modular specification brings significant advantages. Firstly, a modular architecture is simpler to implement. Secondly, it allows to easily integrate different machine learning algorithms.</Paragraph> <Paragraph position="2"> Finally, it allows, if necessary, a fine tuning to a specific task by simply specializing few modules. Furthermore, it is worth noting that we tested SIE across different domains using the same basic configuration without exploiting any domain specific knowledge, such as gazetteers, and ad-hoc pre/post-processing.</Paragraph> <Paragraph position="3"> The architecture of the system is shown in Figure 1. The information extraction task is performed in two phases. SIE learns off-line a set of data models from a specified labeled corpus, then the models are applied to tag new documents.</Paragraph> <Paragraph position="4"> In both phases, the Instance Filtering module (Section 3) removes certain tokens from the data set in order to speed-up the whole process, while Feature Extraction module (Section 4) is used to extract a pre-defined set of features from the tokens. In the training phase, the Learning Module (Section 5) learns two distinct models for each entity, one for the beginning boundary and another for the end boundary (Ciravegna, 2000; Freitag and Kushmerick, 2000). In the recognition phase, as a consequence, the Classification module (Section 5) identifies the entity boundaries as distinct token classifications. A Tag Matcher module (Section 6) is used to match the boundary predictions made by the Classification module. Tasks with multiple entities are considered as multiple independentsingle-entityextractiontasks(i.e. SIE only extracts one entity at a time).</Paragraph> </Section> class="xml-element"></Paper>