File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1030_metho.xml
Size: 19,175 bytes
Last Modified: 2025-10-06 14:10:05
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1030"> <Title>Web Text Corpus for Natural Language Processing</Title> <Section position="4" start_page="233" end_page="235" type="metho"> <SectionTitle> 3 Creating the Web Corpus </SectionTitle> <Paragraph position="0"> There are many challenges in creating a web corpus, as the World Wide Web is unstructured and without a definitive directory. No simple method exists to collect a large representative sample of the web. Two main approaches exist for collecting representative web samples - IP address sampling and random walks. The IP address sampling technique randomly generates IP addresses and explores any websites found (Lawrence and Giles, 1999). This method requires substantial resources as many attempts are made for each web-site found. Lawrence and Giles reported that 1 in 269 tries found a web server.</Paragraph> <Paragraph position="1"> Random walk techniques attempt to simulate a regular undirected web graph (Henzinger et al., 2000). In such a graph, a random walk would produce a uniform sample of the nodes (i.e. the web pages). However, only an approximation of such a graph is possible, as the web is directed (i.e. you cannot easily determine all web pages linking to a particular page). Most implementations of random walks approximates the number of backward links by using information from search engines.</Paragraph> <Section position="1" start_page="234" end_page="234" type="sub_section"> <SectionTitle> 3.1 Web Spidering </SectionTitle> <Paragraph position="0"> We created a 10 billion word Web Corpus by spidering the web. While the corpus is not designed to be a representative sample of the web, we attempt to sample a topic-diverse collection of web sites. Our web spider is seeded with links from the Open Directory1.</Paragraph> <Paragraph position="1"> The Open Directory has a broad coverage of many topics on the web and allows us to create a topic-diverse collection of pages. Before the directory can be use, we had to address several coverage skews. Some topics have many more links in the Open Directory than others, simply due to the availability of editors for different topics. For example, we found that the topic University of Connecticut has roughly the same number of links as Ontario Universities. We would normally expect universities in a whole province of Canada to have more coverage than a single university in the United States. The directory was also constructed without keeping more general topics higher in the tree. For example, we found that Chicken Salad is higher in the hierarchy than Catholicism. The Open Directory is flattened by a rule-based algorithm which is designed to take into account the coverage skews of some topics to produce a list of 358 general topics.</Paragraph> <Paragraph position="2"> From the seed URLs, the spider performs a breadth-first search. It randomly selects a topic nodefromthelistandnextunvisitedURLfromthe node. It visits the website associated from the link and samples pages within the same section of the website until a minimum number of words have been collected or all of the pages were visited.</Paragraph> <Paragraph position="3"> gardless of the actual topic of the link. Although websites of one topic tends to link to other websites of the same topic, this process contributes to a topic drift. As the spider traverses away from the original seed URLs, we are less certain of the topic included in the collection.</Paragraph> </Section> <Section position="2" start_page="234" end_page="235" type="sub_section"> <SectionTitle> 3.2 Text Cleaning </SectionTitle> <Paragraph position="0"> Text cleaning is the term we used to describe the overall process of converting raw HTML found on the web into a form useable by NLP algorithms - white space delimited words, separated into one sentence per line. It consists of many low-level processes which are often accomplished by simple rule-based scripts. Our text cleaning process is divided into four major steps.</Paragraph> <Paragraph position="1"> First, different character encoding of HTML pages are transform into ISO Latin-1 and HTML named-entities (e.g. &nbsp; and &amp;) translated into their single character equivalents.</Paragraph> <Paragraph position="2"> Second, sentence boundaries are marked. Such boundaries are difficult to identify on web text as it does not always consists of grammatical sentences. A section of a web page may be mathematical equations or lines of C++ code. Grammatical sentences need to be separated from each other and from other non-sentence text. Sentence boundary detection for web text is a much harder problem than newspaper text.</Paragraph> <Paragraph position="3"> Weuseamachinelearningapproachtoidentifying sentence boundaries. We trained a Maximum Entropy classifier following Ratnaparkhi (1998) to disambiguate sentence boundary on web text, training on 153 manually marked web pages. Systems for newspaper text only use regular text features, such as words and punctuations. Our system for web text uses HTML tag features in addition to regular text features. HTML tag features are essential for marking sentence boundaries in web text, as many boundaries in web text are only indicated by HTML tags and not by the text. Our system using HTML tag features achieves 95.1% accuracy in disambiguating sentence boundaries in web text compared to 88.9% without using such features.</Paragraph> <Paragraph position="4"> Third, tokenisation is accomplished using the sed script used for the Penn Treebank project (MacIntyre, 1995), modified to correctly tokenise URLs, emails, and other web-specific text.</Paragraph> <Paragraph position="5"> The final step is filtering, where unwanted text is removed from the corpus. A rule-based component analyses each web page and each sentence within a page to identify sections that are unlikely to be useful text. Our rules are similar to those employed by Halacsy et al. (2004), where the percentage of non-dictionary words in a sentence or document helps identify non-Hungarian text. We classify tokens into dictionary words, word-like tokens, numbers, punctuation, and other tokens.</Paragraph> <Paragraph position="6"> Sentences or documents with too few dictionary words or too many numbers, punctuation, or other tokens are discarded.</Paragraph> </Section> </Section> <Section position="5" start_page="235" end_page="236" type="metho"> <SectionTitle> 4 Corpus Statistics </SectionTitle> <Paragraph position="0"> Comparing the vocabulary of the Web Corpus and existing corpora is revealing. We compared with the Gigaword Corpus, a 2 billion token collection (1.75 billion words before tokenisation) of newspaper text (Graff, 2003). For example, what types of tokens appears more frequently on the web than in newspaper text? From each corpus, we randomlyselecta1billionwordsampleandclassified null the tokens into seven disjoint categories:</Paragraph> <Section position="1" start_page="235" end_page="235" type="sub_section"> <SectionTitle> 4.1 Token Type Analysis </SectionTitle> <Paragraph position="0"> An analysis by token type shows big differences between the two corpora (see Table 1). The same size samples of the Gigaword and the Web Corpus have very different number of token types. Title case tokens is a significant percentage of the token types encountered in both corpora, possibly representing named-entities in the text. There are also a significant number of tokens classified as others in the Web Corpus, possibly representing URLs and email addresses. While 2.2 million token types are found in the 1 billion word sample of the Gigaword, about twice as many (4.8 million) are found in an equivalent sample of the Web Corpus.</Paragraph> </Section> <Section position="2" start_page="235" end_page="236" type="sub_section"> <SectionTitle> 4.2 Misspelling </SectionTitle> <Paragraph position="0"> One factor contributing to the larger number of tokentypesintheWebCorpus,ascomparedwiththe null Gigaword, is the misspelling of words. Web documents are authored by people with a widely varying command of English and their pages are not as carefully edited as newspaper articles. Thus, we anticipate a significantly larger number of misspellings and typographical errors.</Paragraph> <Paragraph position="1"> We identify some of the misspellings by letter combinations that are one transformation away from a correctly spelled word. Consider a target word, correctly spelled. Misspellings can be generated by inserting, deleting, or substituting one letter, orbyreorderinganytwoadjacentletters(although we keep the first letter of the original word, as very few misspellings change the first letter).</Paragraph> <Paragraph position="2"> Table 2 shows some of the misspellings of the word receive found in the Gigaword and the Web Corpus. While only 5 such misspellings were found in the Gigaword, 16 were found in the Web (* denotes also using 60% WSJ, 5% corrupted) Corpus. For all words found in the Unix dictionary, an average of 1.7 misspellings are found per word in the Gigaword by type. The proportion of mistakes found in the Web Corpus is roughly double that of the Gigaword, at 3.7 misspellings per dictionary word. However, misspellings only represent a small portion of tokens (5.6 million out of 699 million instances of dictionary word are misspellings in the Web Corpus).</Paragraph> </Section> </Section> <Section position="6" start_page="236" end_page="236" type="metho"> <SectionTitle> 5 Context-Sensitive Spelling Correction </SectionTitle> <Paragraph position="0"> A confusion set is a collection of words which are commonly misused by even native speakers of a language because of their similarity. For example, the words {it's, its}, {affect, effect}, and {weather, whether} are often mistakenly interchanged. Context-sensitive spelling correction is the task of selecting the correct confusion word in a given context. Two different metrics have been used to evaluate the performance of context-sensitive spelling correction algorithms. The Average Accuracy (AA) is the performance by type whereas the Weighted Average Accuracy (WAA) is the performance by token.</Paragraph> </Section> <Section position="7" start_page="236" end_page="237" type="metho"> <SectionTitle> 5.1 Related Work </SectionTitle> <Paragraph position="0"> Golding and Roth (1999) used the Winnow multiplicative weight-updating algorithm for context-sensitive spelling correction. They found that when a system is tested on text from a different from the training set the performance drops substantially (see Table 3). Using the same algorithm and 80% of the Brown Corpus, the WAA dropped from 96.4% to 94.5% when tested on 40% WSJ instead of 20% Brown.</Paragraph> <Paragraph position="1"> For cross corpus experiments, Golding and Roth devised a semi-supervised algorithm that is trained on a fixed training set but also extracts information from the same corpus as the testing set. Their experiments showed that even if up to 20% of the testing set is corrupted (using wrong confusion words), a system that trained on both the training and testing sets outperformed the system that only trained on the training set. The Winnow Semi-Supervised method increases the WAA back up to 96.6%.</Paragraph> <Paragraph position="2"> Lapata and Keller (2005) utilised web counts from Altavista for confusion set disambiguation.</Paragraph> <Paragraph position="3"> Their unsupervised method uses collocation features (one word to the left and right) where co-occurrence estimates are obtained from web counts of bigrams. This method achieves a stated accuracy of 89.3% AA, similar to the cross corpus experiment for Unpruned Winnow.</Paragraph> <Section position="1" start_page="236" end_page="236" type="sub_section"> <SectionTitle> 5.2 Implementation </SectionTitle> <Paragraph position="0"> Context-sensitive spelling correction is an ideal task for unannotated web data as unmarked text is essentially labelled data for this particular task, as words in a reasonably well-written text are positive examples of the correct usage of confusion words.</Paragraph> <Paragraph position="1"> To demonstrate the utility of a large collection of web data on a disambiguation problem, we implemented the simple memory-based learner from Banko and Brill (2001). The learner trains on simple collocation features, keeping a count of (wi[?]1,wi+1), wi[?]1, and wi+1 for each confusion word wi. The classifier first chooses the confusion word which appears with the context bigram most frequently, followed by the left unigram, right unigram, and then the most frequent confusion word.</Paragraph> <Paragraph position="2"> Three data sets were used in the experiments: the 2 billion word Gigaword Corpus, a 2 billion word sample of our 10 billion word Web Corpus, and the full 10 billion word Web Corpus.</Paragraph> </Section> <Section position="2" start_page="236" end_page="237" type="sub_section"> <SectionTitle> 5.3 Results </SectionTitle> <Paragraph position="0"> Our experiments compare the results when the three corpora were trained using the same algorithm. Thememory-basedlearnerwastestedusing the 18 confusion word sets from Golding (1995) on the WSJ section of the Penn Treebank and the Brown Corpus.</Paragraph> <Paragraph position="1"> For the WSJ testing set, the 2 billion word Web Corpus does not achieve the performance of the Gigaword (see Table 4). However, the 10 billion word Web Corpus results approach that of the Gigaword. Training on the Gigaword and testing on WSJ is not considered a true cross-corpus experiment, as the two corpora belong to the same genre of newspaper text. Compared to the Winnow method, the 10 billion word Web Corpus out-performs the cross corpus experiment but not the semi-supervised method.</Paragraph> <Paragraph position="2"> For the Brown Corpus testing set, the 2 billion word Web Corpus and the 2 billion word Gigaword achieved similar results. The 10 billion word Web Corpus achieved 95.4% WAA, higher than the 94.6% from the 2 billion Gigaword. This and the above result with the WSJ suggests that the Web Corpus approach is comparable with training on a corpus of printed text such as the Gigaword. The 91.8% AA of the 10 billion word Web Corpus testing on the WSJ is better than the 89.3% AA achieved by Lapata and Keller (2005) using the Altavista search engine. This suggests that a web collected corpus may be a more accurate method of estimating n-gram frequencies than through search engine hit counts.</Paragraph> </Section> </Section> <Section position="8" start_page="237" end_page="238" type="metho"> <SectionTitle> 6 Thesaurus Extraction </SectionTitle> <Paragraph position="0"> Thesaurusextractionisawordsimilaritytask. Itis a natural candidate for using web corpora as most systemsextractsynonymsofatargetwordfroman unlabelled corpus. Automatic thesaurus extraction is a good alternative to manual construction methods, as such thesauri can be updated more easily and quickly. They do not suffer from bias, low coverage, and inconsistency that human creators of thesauri introduce.</Paragraph> <Paragraph position="1"> Thesauri are useful in many NLP and Information Retrieval (IR) applications. Synonyms help expand the coverage of system but providing alternatives to the inputed search terms. For n-gram estimation using search engine queries, some NLP applications can boost the hit count by offering alternative combination of terms. This is especially helpful if the initial hit counts are too low to be reliable. In IR applications, synonyms of search terms help identify more relevant documents.</Paragraph> <Section position="1" start_page="237" end_page="237" type="sub_section"> <SectionTitle> 6.1 Method </SectionTitle> <Paragraph position="0"> We use the thesaurus extraction system implemented in Curran (2004). It operates on the distributional hypothesis that similar words appear in similar contexts. This system only extracts one word synonyms of nouns (and not multi-word expressions or synonyms of other parts of speech).</Paragraph> <Paragraph position="1"> The extraction process is divided into two parts.</Paragraph> <Paragraph position="2"> First, target nouns and their surrounding contexts are encoded in relation pairs. Six different types of relationships are considered: prepositional phrase The nouns (including subject and objects) are the target headwords and the relationships are represented in context vectors. In the second stage of the extraction process, a comparison is made between context vectors of headwords in the corpus to determine the most similar terms.</Paragraph> </Section> <Section position="2" start_page="237" end_page="238" type="sub_section"> <SectionTitle> 6.2 Evaluation </SectionTitle> <Paragraph position="0"> The evaluation of a list of synonyms of a target word is subject to human judgement. We use the evaluation method of Curran (2004), against gold standard thesauri results. The gold standard list is created by combining the terms found in four thesauri: Macquarie, Moby, Oxford and Roget's.</Paragraph> <Paragraph position="1"> The inverse rank (InvR) metric allows a comparison to be made between the extracted rank list of synonyms and the unranked gold standard list.</Paragraph> <Paragraph position="2"> For example, if the extracted terms at ranks 3, 5, and 28 are found in the gold standard list, then</Paragraph> <Paragraph position="4"> word to Web Corpus Gigaword (24 matches out of 200) house apartment building run office resident residence headquartersvictorynativeplacemansionroomtripmile family night hometown town win neighborhood life suburb school restaurant hotel store city street season area road homer day car shop hospital friend game farm facility center north child land weekend community loss return hour ...</Paragraph> <Paragraph position="5"> Web Corpus (18 matches out of 200) page loan contact house us owner search finance mortgage office map links building faq equity news center estate privacy community info business car site web improvement extention heating rate directory room apartment family service rental credit shop life city school property place location job online vacation store facility library free ... Table 7: Synonyms for home Gigaword (9 matches out of 200) store retailer supermarket restaurant outlet operator shop shelf owner grocery company hotel manufacturer retail franchise clerk maker discount business sale superstore brand clothing food giant shopping firm retailing industry drugstore distributor supplier bar insurer inc. conglomerate network unit apparel boutique mall electronics carrier division brokerage toy producer pharmacy airline inc ... Web Corpus (53 matches out of 200) necklace supply bracelet pendant rope belt ring earring gold bead silver pin wire cord reaction clasp jewelry charm frame bangle strap sterling loop timing plate metal collar turn hook arm length string retailer repair strand plug diamond wheel industry tube surface neck brooch store molecule ribbon pump choker shaft body ...</Paragraph> </Section> </Section> class="xml-element"></Paper>