File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0608_metho.xml
Size: 20,883 bytes
Last Modified: 2025-10-06 14:10:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0608"> <Title>The Hinoki Sensebank -- A Large-Scale Word Sense Tagged Corpus of Japanese --</Title> <Section position="4" start_page="0" end_page="63" type="metho"> <SectionTitle> 2 Corpus Design </SectionTitle> <Paragraph position="0"> In this section we describe the overall design of the corpus, and is constituent corpora. The basic aim is to combine structural semantic and lexical semantic markup in a single corpus. In order to make the first phase self contained, we started with dictionary definition and example sentences.</Paragraph> <Paragraph position="1"> We are currently adding other genre, to make the langauge description more general, starting with newspaper text.</Paragraph> <Section position="1" start_page="0" end_page="62" type="sub_section"> <SectionTitle> 2.1 Lexeed: A Japanese Basic Lexicon </SectionTitle> <Paragraph position="0"> We use word sense definitions from Lexeed: A Japanese Semantic Lexicon (Kasahara et al., 2004). It was built in a series of psycholinguistic experiments where words from two existing machine-readable dictionaries were presented to subjects and they were asked to rank them on a familiarity scale from one to seven, with seven being the most familiar (Amano and Kondo, 1999). Lexeed consists of all words with a familiarity greater than or equal to five. There are 28,000 words in all. Many words have multiple senses, there were 46,347 different senses. Definition sentences for these sentences were rewritten to use only the 28,000 familiar words. In the final configuration, 16,900 different words (60% of all possible words) were actually used in the definition sentences. An example entry for the word EGFCDFENAR doraib-a &quot;driver&quot; is given in Figure 1, with English glosses added.</Paragraph> <Paragraph position="1"> This figure includes the sense annotation and information derived from it that is described in this paper.</Paragraph> <Paragraph position="2"> Table 1 shows the relation between polysemy and familiarity. The #WS column indicates the average number of word senses that polysemous words have. Lower familiarity words tend to have less ambiguity and 70 % of words with a familiarity of less than 5.5 are monosemous. Most polysemous words have only two or three senses as seen in Table 2.</Paragraph> </Section> <Section position="2" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 2.2 Ontology </SectionTitle> <Paragraph position="0"> We also have an ontology built from the parse results of definitions in Lexeed (Nichols and Bond, 2005). The ontology includes more than 50 thousand relationship between word senses, e.g.</Paragraph> <Paragraph position="1"> synonym, hypernym, abbreviation, etc.</Paragraph> </Section> <Section position="3" start_page="62" end_page="63" type="sub_section"> <SectionTitle> 2.3 Goi-Taikei </SectionTitle> <Paragraph position="0"> As part of the ontology verification, all nominal and most verbal word senses in Lexeed were linked to semantic classes in the Japanese thesaurus, Nihongo Goi-Taikei (Ikehara et al., 1997). Common nouns are classified into about 2,700 semantic classes which are organized into a semantic hierarchy.</Paragraph> </Section> <Section position="4" start_page="63" end_page="63" type="sub_section"> <SectionTitle> 2.4 Hinoki Treebank </SectionTitle> <Paragraph position="0"> Lexeed definition and example sentences are syntactically and semantically parsed with HPSG and correct results are manually selected (Tanaka et al., 2005). The grammatical coverage over all sentences is 86%. Around 12% of the parsed sentences were rejected by the treebankers due to an incomplete semantic representation. This process had been done independently of word sense annotation.</Paragraph> </Section> <Section position="5" start_page="63" end_page="63" type="sub_section"> <SectionTitle> 2.5 Target Corpora </SectionTitle> <Paragraph position="0"> We chose two types of corpus to mark up: a dictionary and two newspapers. Table 3 shows basic statistics of the target corpora.</Paragraph> <Paragraph position="1"> The dictionary Lexeed, which defined word senses, is also used for a target for sense tagging. Its definition (LXD-DEF) and example (LXD-EX) sentences consist of basic words and function words only, i.e. it is self-contained. Therefore, all content words have headwords in Lexeed, and all word senses appear in at least one example sentence.</Paragraph> <Paragraph position="2"> Both newspaper corpora where taken from the Mainichi Daily News. One sample (Senseval2) was the text used for the Japanese dictionary task in Senseval-2 (Shirai, 2002), which has some words marked up with word sense tags defined in the Iwanami lexicon (Nishio et al., 1994).</Paragraph> <Paragraph position="3"> The second sample was those sentences used in the Kyoto Corpus (Kyoto), which is marked up with dependency analyses (Kurohashi and Nagao, 2003). We chose these corpora so that we can compare our annotation with existing annotation. Both these corpora were thus already segmented and annotated with parts-of-speech. However, they used different morphological analyzers to the one used in Lexeed, so we had to do some remapping. E.g. in Kyoto the copula is not split from nominal-adjectives, whereas in Lexeed it is: DQCJBLgenkida &quot;lively&quot; vs DQCJBL genki da. This could be done automatically after we had written a few rules.</Paragraph> <Paragraph position="4"> Although the newspapers contain many words other than basic words, only basic words have sense tags. Also, a word unit in the newspapers does not necessarily coincide with the headword in Lexeed since part-of-speech taggers used for annotation are different. We do not adjust the word segmentation and leave it untagged at this stage, even if it is a part of a basic word or consists of multiple basic words. For instance, Lexeed has the compound entry DGB7CMEO kahei-kachi &quot;monetary value&quot;, however, this word is split into two basic words in the corpora. In this case, both two words DGB7 kahei &quot;money&quot; and CMEO kachi &quot;value&quot; are tagged individually.</Paragraph> <Paragraph position="5"> The corpora are not fully balanced, but allow some interesting comparisons. There are effectively three genres: dictionary definitions, which tend to be fragments and are often syntactically highly ambiguous; dictionary example sentences, which tend to be short complete sentences, and are easy to parse; and newspaper text from two different years.</Paragraph> </Section> </Section> <Section position="5" start_page="63" end_page="64" type="metho"> <SectionTitle> 3 Annotation </SectionTitle> <Paragraph position="0"> Each word was annotated by five annotators.</Paragraph> <Paragraph position="1"> We actually used 15 annotators, divided into 3 groups. None were professional linguists or lexicographers. All of them had a score above 60 on a Chinese character based vocabulary test (Amano and Kondo, 1998). We used multiple annotators to measure the confidence of tags and the degree of difficulty in identifying senses.</Paragraph> <Paragraph position="2"> The target words for sense annotation are the 9,835 headwords having multiple senses in Lexeed (SS 2.1). They have 28,300 senses in all. Monosemous words were not annotated.</Paragraph> <Paragraph position="3"> Annotation was done word by word. Annotators are presented multiple sentences (up to 50) that contain the same target word, and they keep tagging that word until occurrences are done. This enables them to compare various contexts where a target word appears and helps them to keep the annotation consistent.</Paragraph> <Section position="1" start_page="63" end_page="64" type="sub_section"> <SectionTitle> 3.1 Tool </SectionTitle> <Paragraph position="0"> A screen shot of the annotation tool is given in with all information stored in SQL tables. The left hand frame lists the words being annotated. Each word is shown with some context: the surrounding paragraph, and the headword for definition and example sentences. These can be clicked on to get more context. The word being annotated is highlighted in red. For each word, the annotator chooses its senses or one or more of the other tags as clickable buttons. It is also possible to choose one tag as the default for all entries on the screen. The right hand side frame has the dictionary definitions for the word being tagged in the top frame, and a lower frame with instructions. A single word may be annotated with senses from more than one headword. For example ENE0 is divided into two headwords basu &quot;bus&quot; and basu &quot;bass&quot;, both of which are presented.</Paragraph> <Paragraph position="1"> As we used a tab-capable browser, it was easy for the annotators to call up more information in different tabs. This proved to be quite popular.</Paragraph> </Section> <Section position="2" start_page="64" end_page="64" type="sub_section"> <SectionTitle> 3.2 Markup </SectionTitle> <Paragraph position="0"> Annotators choose the most suitable sense in the given context from the senses that the word have in lexicon. Preferably, they select a single sense for a word, although they can mark up multiple tags if the words have multiple meanings or are truly ambiguous in the contexts.</Paragraph> <Paragraph position="1"> When they cannot choose a sense in some reasons, they choose one or more of the following special tags.</Paragraph> <Paragraph position="2"> o other sense: an appropriate sense is not found in a lexicon. Relatively novel concepts (e.g.</Paragraph> <Paragraph position="3"> EGFCDFENAR doraib-a &quot;driver&quot; for &quot;software driver&quot;) are given this tag.</Paragraph> <Paragraph position="4"> c multiword expressions (compound / idiom): the target word is a part of a non-compositional compound or idiom.</Paragraph> <Paragraph position="5"> p proper noun: the word is a proper noun.</Paragraph> <Paragraph position="6"> x homonym: an appropriate entry is not found in a lexicon, because a target is different from head words in a lexicon (e.g. only a headword ENE0 bass &quot;bus&quot; is present in a lexicon for ENE0basu &quot;bass&quot;).</Paragraph> <Paragraph position="7"> e analysis error: the word segmentation or part-of-speech is incorrect due to errors in preannotation of the corpus.</Paragraph> </Section> <Section position="3" start_page="64" end_page="64" type="sub_section"> <SectionTitle> 3.3 Feedback </SectionTitle> <Paragraph position="0"> One of the things that the annotators found hard was not knowing how well they were doing. As they were creating a gold standard, there was initially no way of knowing how correct they were.</Paragraph> <Paragraph position="1"> We also did not know at the start of the annotation how fast senses could or should be annotated (a test of the tool gave us an initial estimate of around 400 tokens/day).</Paragraph> <Paragraph position="2"> To answer these questions, and to provide feedback for the annotators, twice a day we calculated and graphed the speed (in words/day) and majority agreement (how often an annotator agrees with the majority of annotators for each token, measured over all words annotated so far). Each annotator could see a graph with their results labelled, and the other annotators made anonymous. The results are grouped into three groups of five annotators. Each group is annotating a different set of words, but we included them all in the feedback. The order within each group is sorted by agreement, as we wished to emphasise the importance of agreement over speed. An example of a graph is given in Figure 3. When this feedback was given, this particular annotator has the second worst agreement score in their subgroup (90.27%) and is reasonably fast (1799 words/day) -- they should slow down and think more.</Paragraph> <Paragraph position="3"> The annotators welcomed this feedback, and complained when our script failed to produce it.</Paragraph> <Paragraph position="4"> There was an enormous variation in speed: the fastest annotator was 4 times as fast as the slowest, with no appreciable difference in agreement.</Paragraph> <Paragraph position="5"> After providing the feedback, the average speed increased considerably, as the slowest annotators agonized less over their decisions. The final average speed was around 1,500 tokens/day, with the fastest annotator still almost twice as fast as the slowest.</Paragraph> </Section> </Section> <Section position="6" start_page="64" end_page="67" type="metho"> <SectionTitle> 4 Inter-Annotator Agreement </SectionTitle> <Paragraph position="0"> We employ inter-annotator agreement as our core measure of annotation consistency, in the same way we did for treebank evaluation (Tanaka et al., 2005). This agreement is calculated as the average of pairwise agreement. Let wi be a word in a set of content words W and wi, j be the jth occurrence of a word wi. Average pairwise agreement between the sense tags of wi, j each pair of annotators</Paragraph> <Paragraph position="2"> where nwi, j([?] 2) is the number of annotators that tag the word wi, j, and mi, j(sik) is the number of sense tags sik for the word wi, j. Hence, the agreement of the word wi is the average of awi, j over all occurrences in a corpus:</Paragraph> <Paragraph position="4"> where Nwi is the frequency of the word wi in a corpus.</Paragraph> <Paragraph position="5"> Table 4 shows statistics about the annotation results. The average numbers of word senses in the newspapers are lower than the ones in the dictionary and, therefore, the token agreement of the newspapers is higher than those of the dictionary sentences. %Unanimous indicates the ratio of tokens vs types for which all annotators (normally five) choose the same sense. Snyder and Palmer (2004) report 62% of all word types on the English all-words task at SENSEVAL-3 were labelled unanimously. It is hard to directly compare with our task since their corpus has only 2,212 words tagged by two or three annotators.</Paragraph> <Section position="1" start_page="66" end_page="66" type="sub_section"> <SectionTitle> 4.1 Familiarity </SectionTitle> <Paragraph position="0"> As seen in Table 5, the agreement per type does not vary much by familiarity. This was an unexpected result. Even though the average polysemy is high, there are still many highly familiar words with very good agreement.</Paragraph> </Section> <Section position="2" start_page="66" end_page="66" type="sub_section"> <SectionTitle> 4.2 Part-of-Speech </SectionTitle> <Paragraph position="0"> Table 6 shows the agreement according to part of speech. Nouns and verbal nouns (vn) have the highest agreements, similar to the results for the English all-words task at SENSEVAL-3 (Snyder and Palmer, 2004). In contrast, adjectives have as low agreement as verbs, although the agreement of adjectives was the highest and that of verbs was the lowest in English. This partly reflects differences in the part of speech divisions between Japanese and English. Adjectives in Japanese are much close in behaviour to verbs (e.g. they can head sentences) and includes many words that are translated as verbs in English.</Paragraph> </Section> <Section position="3" start_page="66" end_page="66" type="sub_section"> <SectionTitle> 4.3 Entropy </SectionTitle> <Paragraph position="0"> Entropy is directly related to the difficulty in identifing senses as shown in Table 7.</Paragraph> </Section> <Section position="4" start_page="66" end_page="66" type="sub_section"> <SectionTitle> 4.4 Sense Lumping </SectionTitle> <Paragraph position="0"> Low agreement words have some senses that are difficult to distinguish from each other: these senses often have the same hypernyms. For example, the agreement rate of AFD7 kusabana &quot;grass/flower&quot; in LXD-DEF is only 33.7 %. It has three senses whose semantic class is similar: kusabana1 &quot;flower that blooms in grass&quot;, kusabana2 &quot;grass that has flowers&quot; and souka1 &quot;grass and flowers&quot; (hypernyms flower1, grass1 and flower1 & grass1 respectively).</Paragraph> <Paragraph position="1"> In order to investigate the effect of semantic similarity on agreement, we lumped similar word senses based on hypernym and semantic class.</Paragraph> <Paragraph position="2"> We use hypernyms from the ontology (SS 2.1) and semantic classes in Goi-Taikei (SS 2.3), to regard the word senses that have the same hypernyms or belong to the same semantic classes as the same senses.</Paragraph> <Paragraph position="3"> Table 8 shows the distribution after sense lumping. Table 9 shows the agreement with lumped senses. Note that this was done with an automatically derived ontology that has not been fully hand corrected.</Paragraph> <Paragraph position="4"> As is expected, the overall agreement increased, from 0.787 to 0.829 using the ontology, and to 0.835 using the coarse-grained Goi-Taikei semantic classes. For many applications, we expect that this level of disambiguation is all that is required.</Paragraph> </Section> <Section position="5" start_page="66" end_page="67" type="sub_section"> <SectionTitle> 4.5 Special Tags </SectionTitle> <Paragraph position="0"> Table 10 shows the ratio of special tags and multiple tags to all tags. These results show the differences in corpus characteristics between dictionary and newspaper. The higher ratios of Other Sense and Homonym at newspapers indicate that the words whose surface form is in a dictionary are frequently used for the different meanings in real text, e.g. GX gin &quot;silver&quot; is used for the abbrebiation of GXA3 ginkou &quot;bank&quot;. %Multiple Tags is the percentage of tokens for which at least one annotator marked multiple tags.</Paragraph> </Section> </Section> <Section position="7" start_page="67" end_page="68" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="67" end_page="67" type="sub_section"> <SectionTitle> 5.1 Comparison with Senseval-2 corpus </SectionTitle> <Paragraph position="0"> The Senseval-2 Japanese dictionary task annotation used senses from a different dictionary (Shirai, 2002). In the evaluation, 100 test words were selected from three groups with different entropy bands (Kurohashi and Shirai, 2001). Da is the highest entropy group, which contains the most hard to tag words, and Dc is the lowest entropy group.</Paragraph> <Paragraph position="1"> We compare our results with theirs in Table 11. The Senseval-2 agreement figures are slightly higher than our overall. However, it is impossible to make a direct comparison as the numbers of annotators (two or three annotators in Senseval vs more than 5 annotators in our work) and the sense inventories are different.</Paragraph> </Section> <Section position="2" start_page="67" end_page="68" type="sub_section"> <SectionTitle> 5.2 Problems </SectionTitle> <Paragraph position="0"> Two main problems came up when building the corpora: word segmentation and sense segmentation. Multiword expressions like compounds and idioms are tied closely to both problems.</Paragraph> <Paragraph position="1"> The word segmentation is the problem of how to determine an unit expressing a meaning. At the present stage, it is based on headword in Lexeed, in particular, only compounds in Lexeed are recognized, we do not discriminate non-decomposable compounds with decomposable ones. However, if the headword unit in the dictionary is inconsistent, word sense tagging inherits this problem. For examples, FPA3 ichibu has two main usage: one + classifier and a part of something. Lexeed has an entry including both two senses. However, the former is split into two agreement, lower row: the number of word senses) words by our morphological analyser in the same way as other numeral + classifier.</Paragraph> <Paragraph position="2"> The second problem is how to mark off metaphorical meaning from literal meanings.</Paragraph> <Paragraph position="3"> Currently, this also depends on the Lexeed definition and it is not necessarily consistent either. Some words in institutional idioms (Sag et al., 2002) have the idiom sense in the lexicon while most words do not. For instance, AFEP shippo &quot;tail of animal&quot;) has a sense for the reading &quot;weak point&quot; in an idiom AFEPCZA8CH shippo-o tsukamu &quot;lit. to grasp the tail, idiom. to find one's weak point&quot;, while AP ase &quot;sweat&quot; does not have a sense for the applicable meaning in the idiom AP CZE3BEase-o nagasu &quot;lit. to sweat, idiom, to work hard&quot;.</Paragraph> </Section> </Section> class="xml-element"></Paper>