File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0901_metho.xml

Size: 8,759 bytes

Last Modified: 2025-10-06 14:07:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0901">
  <Title>Comparing Corpora using Frequency Profiling</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CORPUS CORPUS TOTAL
ONE TWO
</SectionTitle>
    <Paragraph position="0"> Freq a b a+b of word Freq c-a d-b c+d-a-b of other words TOTAL c d c+d Note that the value ~' corresponds to the number of words in corpus one, and ~' corresponds to the number of words in corpus two (1'4 values). The values ~' and b'are called the observed values (O). We need to calculate the expected values (E) according to the following formula:</Paragraph>
    <Paragraph position="2"> In our case N1 = c, and N2 = d. So, for this word, E1 = c*(a+b) / (c+d) and E2 = d*(a+b) / (c+d). The calculation for the expected values takes account of the size of the two corpora, so we do not need to normalise the figures before applying the formula. We can then calculate the log-likehood value according to this formula: -21n A = 2~ Oi In~-~ This equates to calculating LL as follows:</Paragraph>
    <Paragraph position="4"> The word frequency list is then sorted by the resulting LL values. This gives the effect of placing the largest LL value at the top of the list representing the word which has the most significant relative frequency difference between the two corpora. In this way, we can see the words most indicative (or characteristic) of one corpus, as compared to the other corpus, at the top of the list. The words which appear with roughly similar relative frequencies in the two corpora appear lower down the list. Note that we do not use the hypothesis-test by comparing the LL values to a chi-squared distribution table. As Kilgarriff &amp; Rose (1998) note, even Pearson~ X 2 is suitable without the hypothesis-testing link: Given the non-random nature of words in a text, we are always likely to find frequencies of words which differ across any two texts, and the higher the frequencies, the more information the statistical test has to work with. Hence, it is at this point that the researcher must intervene and qualitatively examine examples of the significant words highlighted by this technique.</Paragraph>
    <Paragraph position="5"> We are not proposing a completely automated approach.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="4" type="metho">
    <SectionTitle>
3 Applications
</SectionTitle>
    <Paragraph position="0"> This method has already been applied to study social differentiation in the use of English vocabulary and profiling of learner English. In Rayson et al (1997), selective quantitative analyses of the demographically sampled spoken English component of the BNC were carried out.</Paragraph>
    <Paragraph position="1"> This is a subcorpus of circa 4.5 million words, in which speakers and respondents are identified by such factors as gender, age, social group and geographical region. Using the method, a comparison was performed of the vocabulary of speakers, highlighting those differences which are marked by a very high value of significant difference between different sectors of the corpus according to gender, age and social group.</Paragraph>
    <Paragraph position="2"> In Granger and Rayson (1998), two similarsized corpora of native and non-native writing were compared at the lexical level. The corpora were analysed by a part-of-speech tagger, and this permitted a comparison at the major word-class level. The patterns of significant overuse and underuse for POS categories demonstrated that the learner data displayed many of the stylistic features of spoken rather than written English.</Paragraph>
    <Paragraph position="3"> The same technique has more recently been applied to compare corpora analysed at the semantic level in a systems engineering domain and this is the main focus of this section. The motivation for this work is that despite natural language's well-documented shortcomings as a medium for precise technical description, its use in software-intensive systems engineering remains inescapable. This poses many problems for engineers who must derive problem understanding and synthesise precise solution descriptions from free text. This is true both for the largely unstructured textual descriptions from which system requirements are derived, and for more formal documents, such as standards, which impose requirements on system development processes. We describe an experiment that has been carried out in the REVERE project (Rayson et al, 2000) to investigate the use of probabilistic natural language processing techniques to provide systems engineering support.</Paragraph>
    <Paragraph position="4"> The target documents are field reports of a series of ethnographic studies at an air traffic conlxol (ATC) centre. This formed part of a study of ATC as an example of a system that supports collaborative user tasks (Bentley et al, 1992). The documents consist of both the verbatim transcripts of the ethnographerb observations and interviews with controllers, and of reports compiled by the ethnographer for later analysis by a multi-disciplinary team of social scientists and systems engineers. The field reports form an interesting study because they exhibit many characteristics typical of documents seen by a systems engineer. The volume of the information is fairly high (103 pages) and the documents are not structured in a way designed to help the extraction of requirements (say around business processes or system architecture).</Paragraph>
    <Paragraph position="5"> The text is analysed by a part-of-speech tagger, CLAWS (Garside and Smith, 1997), and a semantic analyser (Rayson and Wilson, 1996) which assigns semantic tags that represent the semantic field (word-sense) of words from a lexicon of single words and an idiom list of multi-word combinations (e.g. ~ a rule). These resources contain approximately 52,000 words and idioms.</Paragraph>
    <Paragraph position="6"> The normative corpus that we used was a 2.3 million-word subset of the BNC derived from the transcripts of spoken English. Using this .corpus, the most over-represented sernanfie categories in the ATC field reports are shown in  described in the previous section and represents the semantic tag's frequency deviation from the normative corpus. The higher the figure, the greater the deviation.</Paragraph>
    <Paragraph position="8"> With the exception of Y I (an anomaly caused by an interviewees initials being mistaken for the PH unit of acidity), all of these semantic categories include important objects, roles, functions, etc. in the ATC domain. The frequency with which some of these occur, such as M5 (flying), are uusurprising. Others are more revealing about the domain of ATC.</Paragraph>
    <Paragraph position="9"> Figure 1 shows some of the occurrences of the semantic category 02 (general objects). The important information extracted here is the importance of Mrips' (formally, 1light strips).</Paragraph>
    <Paragraph position="10"> These are small pieces of cardboard with printed flight details that are the most fundamental artefact used by the air traffic controllers to manage their air space. Examination of other words in this category also shows that flight  !i tO mqt&amp;quot;ll~ &amp;quot; 1250L'i n red m a ,trip 'he :T..sle of Ilm ... &amp;:lU0t; Tht, ~trlp ~teC/l I~, 'the ~ printed tn box ~on prtn, t~l tn hot ' 6 ' of the strip =rr'twl tile over th=tbeo~n ( box iviousllJ only aA~'ozla,~te- :some .s'lwips ~l ttne neor the call$tfln on a ~trtp much I~msier . lhermere 1.6 ~tri~ !retort 1.6 strips in one oF hi, melts i*Y , .thot talking aml using on input  device sight =t~o be * but that retort aim be , but ~ the pr  strips are held in tacks' to organise them according to (for example) aircraft time-ofarrival. null Similarly, browsing the context for Q1.2 (paper documents and writing) would allow us to discover that controllers annotate flight strips 'to record deviations from flight plans, and L1 (life, living things) would reveal that some strips are live; that is, they refer to aircraft currently traversing the contxoller's sector. Notice also that the semantic categories' deviation from the normative corpus can also be expected to reveal domain roles (actors). In this example, the frequency of $7.1 (power, organising) shows the importance of the roles of ~ontrollers' and ~hiefs'.</Paragraph>
    <Paragraph position="11"> Using the frequency profiling method does not automate the task of identifying abstractions, much less does it produce fully formed requirements that can be pasted into a specification document. Instead, it helps the engineer quickly isolate potentially significant domain abstractions that require closer analysis.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML