XML Viewer - n06-2043

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-2043_intro.xml
Size: 6,336 bytes
Last Modified: 2025-10-06 14:03:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2043">
  <Title>Illuminating Trouble Tickets with Sublanguage Theory</Title>
  <Section position="3" start_page="169" end_page="170" type="intro">
    <SectionTitle>
2 Related Research
</SectionTitle>
    <Paragraph position="0"> Sublanguage theory posits that texts produced within a certain discourse community exhibit shared, often unconventional, vocabulary and grammar (Grishman and Kittredge, 1986; Harris, 1991). Sublanguage theory has been successfully applied in biomedicine (Friedman et al., 2002; Liddy et al., 1993), software development (Etzkorn et al., 1999), weather forecasting (Somers, 2003), and other domains. Trouble tickets exhibit a special discourse structure, combining systemgenerated, structured data and free-text sections; a special lexicon, full of acronyms, abbreviations and symbols; and consistent &amp;quot;bending&amp;quot; of grammar rules in favor of speed writing (Johnson, 1992; Marlow, 2004). Our work has also been informed by the research on machine classification techniques (Joachims, 2002; Yilmazel et al., 2005).</Paragraph>
    <Paragraph position="1"> 3 Development of the sublanguage model The client provided us with a dataset of 162,105 trouble tickets dating from 1995 to 2005. An important part of data preprocessing included tokenizing text strings. The tokenizer was adapted to fit the special features of the trouble tickets' vocabulary and grammar: odd punctuation; name variants; domain-specific terms, phrases, and abbreviations.</Paragraph>
    <Paragraph position="2"> Development of a sublanguage model began with manual annotation and analysis of a sample of 73 tickets, supplemented with n-gram analysis and contextual mining for particular terms and phrases.</Paragraph>
    <Paragraph position="3"> The analysis aimed to identify consistent linguistic patterns: domain-specific vocabulary (abbreviations, special terms); major ticket sections; and semantic components (people, organizations, locations, events, important concepts).</Paragraph>
    <Paragraph position="4"> The analysis resulted in compiling the core domain lexicon, which includes acronyms for Trouble Types (SMH - smoking manhole); departments  nally, the lexicon was intended to support the development of the sublanguage grammar, but, since no such lexicon existed in the company, it can now enhance the corporate knowledge base.</Paragraph>
    <Paragraph position="5"> Review of the data revealed a consistent structure for trouble ticket discourse. A typical ticket (Fig.1) consists of several text blocks ending with an operator's ID (12345 or JS). A ticket usually opens with a complaint (lines 001-002) that provides the original account of a problem and often contains: reporting entity (CONST MGMT), timestamp, short problem description, location. Field work (lines 009-010) normally includes the name of the assigned employee, new information about the problem, steps needed or taken, complications, etc. Lexical choices are limited and sectionspecific; for instance, reporting a problem typically  The resulting typical structure of a trouble ticket (Table 1) includes sections distinct in their content and data format.</Paragraph>
    <Paragraph position="6">  Analysis also identified recurring semantic components: people, locations, problem, timestamp, equipment, urgency, etc. The annotation of tickets by sections (Fig.2) and semantic components was validated with domain experts.</Paragraph>
    <Paragraph position="7">  The analysis became the basis for developing logical rules for automatic identification of ticket sections and selected semantic components.</Paragraph>
    <Paragraph position="8"> Evaluation of system performance on 70 manually annotated and 80 unseen tickets demonstrated high accuracy in automatic section identification, with an error rate of only 1.4%, and no significant difference between results on the annotated vs. unseen tickets. Next, the automatic annotator was run on the entire corpus of 162,105 tickets. The annotated dataset was used in further experiments. Identification of semantic components brings together variations in names and spellings under a single &amp;quot;normalized&amp;quot; term, thus streamlining and expanding coverage of subsequent data analysis.</Paragraph>
    <Paragraph position="9"> For example, strings UNSAFE LADDER, HAZ, (hazard) and PACM (Possible Asbestos Containing Material) are tagged and, thus, can be retrieved as hazard indicators. &amp;quot;Normalization&amp;quot; is also applied to name variants for streets and departments.</Paragraph>
    <Paragraph position="10"> The primary value of the annotation is in effective extraction of structured information from these unstructured free texts. Such information can next be fed into a database and integrated with other data attributes for further analysis. This will significantly expand the range and the coverage of data analysis techniques, currently employed by the company.</Paragraph>
    <Paragraph position="11"> The high accuracy in automatic identification of ticket sections and semantic components can, to a significant extent, be explained by the relatively limited number and high consistency of the identified linguistic constructions, which enabled their successful translation into a set of logical rules.</Paragraph>
    <Paragraph position="12"> This also supported our initial view of the ticket texts as exhibiting sublanguage characteristics, such as: distinct shared common vocabulary and constructions; extensive use of special symbols and abbreviations; and consistent bending of grammar in favor of shorthand. The sublanguage approach thus enables the system to recognize effectively a number of implicit semantic relationships in texts.</Paragraph>
    <Paragraph position="13"> 4 Leveraging pattern-based approaches with statistical techniques Next, we assessed the potential of some knowledge discovery approaches to meet company needs and fit the nature of the data.</Paragraph>
    <Section position="1" start_page="170" end_page="170" type="sub_section">
      <SectionTitle>
4.1 Identifying Related Tickets
</SectionTitle>
      <Paragraph position="0"> When several reports relate to the same or recurring trouble, or to multiple problems affecting the same area, a note is made in each ticket, e.g.:</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML