File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1089_metho.xml

Size: 20,989 bytes

Last Modified: 2025-10-06 14:10:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1089">
  <Title>Guessing Parts-of-Speech of Unknown Words Using Global Information</Title>
  <Section position="4" start_page="705" end_page="707" type="metho">
    <SectionTitle>
2 POS Guessing of Unknown Words with
Global Information
</SectionTitle>
    <Paragraph position="0"> We handle POS guessing of unknown words as a sub-task of POS tagging, in this paper. We assume that POS tags of known words are already determined beforehand, and positions in the document where unknown words appear are also identified.</Paragraph>
    <Paragraph position="1"> Thus, we focus only on prediction of the POS tags of unknown words.</Paragraph>
    <Paragraph position="2"> In the rest of this section, we first present a model for POS guessing of unknown words with global information. Next, we show how the test data is analyzed and how the parameters of the model are estimated. A method for incorporating unlabeled data with the model is also discussed.</Paragraph>
    <Section position="1" start_page="705" end_page="706" type="sub_section">
      <SectionTitle>
2.1 Probabilistic Model Using Global
Information
</SectionTitle>
      <Paragraph position="0"> We attempt to model the probability distribution of the parts-of-speech of all occurrences of the unknown words in a document which have thesame lexical form. We suppose that such parts-of-speech have correlation, and the part-of-speech of each occurrence is also affected by its local context. Similar situations to this are handled inphysics. For example, let us consider a case where a number of electrons with spins exist in a system.</Paragraph>
      <Paragraph position="1"> The spins interact with each other, and each spin is also affected by the external magnetic field. In the physical model, if the state of the system is s and the energy of the system is E(s), the probability distribution of s is known to be represented by the following Boltzmann distribution:</Paragraph>
      <Paragraph position="3"> where b is inverse temperature and Z is a normalizing constant defined as follows:</Paragraph>
      <Paragraph position="5"> Takamura et al. (2005) applied this model to an NLP task, semantic orientation extraction, and we apply it to POS guessing of unknown words here.</Paragraph>
      <Paragraph position="6"> Suppose that unknown words with the same lexical form appear K times in a document. Assume that the number of possible POS tags for unknown words is N, and they are represented by integers from 1 to N. Let tk denote the POS tag of the kth occurrence of the unknown words, let wk denotethe local context (e.g. the lexical forms and the POS tags of the surrounding words) of the kth occurrence of the unknown words, and let w and tdenote the sets of w k and tk respectively: w={w1,***,wK}, t={t1,***,tK}, tk[?]{1,***,N}.</Paragraph>
      <Paragraph position="7"> li,j is a weight which denotes strength of the interaction between parts-of-speech i and j, and is symmetric (li,j = lj,i). We define the energy where POS tags of unknown words given w are</Paragraph>
      <Paragraph position="9"> where p0(t|w) is an initial distribution (local model) of the part-of-speech t which is calculated with only the local context w, using arbitrary statistical models such as maximum entropy models.</Paragraph>
      <Paragraph position="10"> The right hand side of the above equation consists of two components; one represents global interactions between each pair of parts-of-speech, and the other represents the effects of local information.</Paragraph>
      <Paragraph position="11"> In this study, we fix the inverse temperature b = 1. The distribution of t is then obtained from Equation (1), (2) and (3) as follows:</Paragraph>
      <Paragraph position="13"> where T(w) is the set of possible configurations of POS tags given w. The size of T(w) is NK, because there are K occurrences of the unknownwords and each unknown word can have one of N POS tags. The above equations can be rewritten as follows by defining a function fi,j(t):</Paragraph>
      <Paragraph position="15"> where d(i,j) is the Kronecker delta:</Paragraph>
      <Paragraph position="17"> fi,j(t) represents the number of occurrences of the POS tag pair i and j in the whole document (divided by 2), and the model in Equation (8) is essentially a maximum entropy model with the document level features.</Paragraph>
      <Paragraph position="18"> As shown above, we consider the conditional joint probability of all the occurrences of the unknown words with the same lexical form in the document given their local contexts, P(t|w), in contrast to conventional approaches which assume independence of the sentences in the document and use the probabilities of all the words only in a sentence. Note that we assume independence between the unknown words with different lexical forms, and each set of the unknown words with the same lexical form is processed separately from the sets of other unknown words.</Paragraph>
    </Section>
    <Section position="2" start_page="706" end_page="707" type="sub_section">
      <SectionTitle>
2.2 Decoding
</SectionTitle>
      <Paragraph position="0"> Let us consider how to find the optimal POS tags t basing on the model, given K local contexts of the unknown words with the same lexical form (test data) w, an initial distribution p0(t|w) and a set of model parameters L = {l1,1,***,lN,N}. One way to do this is to find a set of POS tags which maximizes P(t|w) among all possible candidates of t. However, the number of all possible candidates of the POS tags is NK and the calculation is generally intractable. Although HMMs, MEMMs, and CRFs use dynamic programming and some studies with probabilistic models which have specific structures use efficient algorithms (Wang et al., 2005), such methods cannot be applied here because we are considering interactions (dependencies) between all POS tags, and their joint distribution cannot be decomposed. Therefore, we use a sampling technique and approximate the solution using samples obtained from the probability distribution.</Paragraph>
      <Paragraph position="1"> We can obtain a solution ^t = {^t1,***,^tK} as follows:</Paragraph>
      <Paragraph position="3"> where Pk(t|w) is the marginal distribution of the part-of-speech of the kth occurrence of the unknown words given a set of local contexts w, and is calculated as an expected value over the distri-bution of the unknown words as follows:</Paragraph>
      <Paragraph position="5"> Expected values can be approximately calculated using enough number of samples generated from the distribution (MacKay, 2003). Suppose that A(x) is a function of a random variable x, P(x)</Paragraph>
      <Paragraph position="7"/>
      <Paragraph position="9"> Thus, if we have M samples {t(1),***,t(M)} generated from the conditional joint distribution P(t|w), the marginal distribution of each POS tag is approximated as follows:</Paragraph>
      <Paragraph position="11"> d(t(m)k ,t). (14) Next, we describe how to generate samples from the distribution. We use Gibbs sampling for this purpose. Gibbs sampling is one of the Markov chain Monte Carlo (MCMC) methods, which can generate samples efficiently from high-dimensional probability distributions (Andrieu et al., 2003). The algorithm is shown in Figure 1. The algorithm firstly set the initial state t(1), then one new random variable is sampled at a time from the conditional distribution in which all othervariables are fixed, and new samples are created by repeating the process. Gibbs sampling is easy to implement and is guaranteed to converge to the true distribution. The conditional distri-bution P(t k|w,t1,***,tk[?]1,tk+1,***,tK) in Fig-ure 1 can be calculated simply as follows:  In later experiments, the number of samples M is set to 100, and the initial state t(1) is set to the POS tags which maximize p0(t|w).</Paragraph>
      <Paragraph position="12"> The optimal solution obtained by Equation (11) maximizes the probability of each POS tag given w, and this kind of approach is known as the maximum posterior marginal (MPM) estimate (Marroquin, 1985). Finkel et al. (2005) used simulated annealing with Gibbs sampling to find a solution in a similar situation. Unlike simulated annealing, this approach does not need to define a cooling  schedule. Furthermore, this approach can obtain not only the best solution but also the second best or the other solutions according to Pk(t|w), which are useful when this method is applied to semi-automatic construction of dictionaries because human annotators can check the ranked lists of candidates. null</Paragraph>
    </Section>
    <Section position="3" start_page="707" end_page="707" type="sub_section">
      <SectionTitle>
2.3 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> Let us consider how to estimate the param-</Paragraph>
      <Paragraph position="2"> from training data consisting of L examples; {&lt;w1,t1&gt; ,***,&lt;wL,tL&gt; } (i.e., the training data contains L different lexical forms of unknownwords). We define the following objective function LL, and find L which maximizes LL (the subscript L denotes being parameterized by L):</Paragraph>
      <Paragraph position="4"> where C is a constant and s is set to 1 in laterexperiments. The optimal L can be obtained by quasi-Newton methods using the above LL and [?]LL [?]li,j , and we use L-BFGS (Liu and Nocedal, 1989) for this purpose2. However, the calculation is intractable because ZL(wl) (see Equation (9)) in Equation (16) and a term in Equation (17) contain summations over all the possible POS tags. To cope with the problem, we use the sampling technique again for the calculation, as suggested by Rosenfeld et al. (2001). ZL(wl) can be approximated using M samples {t(1),***,t(M)} generated from p0(t|wl):</Paragraph>
      <Paragraph position="6"> 2In later experiments, L-BFGS often did not converge completely because we used approximation with Gibbs sampling, and we stopped iteration of L-BFGS in such cases.</Paragraph>
      <Paragraph position="8"> mated using M samples {t(1),***,t(M)} generated from PL(t|wl) with Gibbs sampling:</Paragraph>
      <Paragraph position="10"> In later experiments, the initial state t(1) in Gibbs sampling is set to the gold standard tags in the training data.</Paragraph>
    </Section>
    <Section position="4" start_page="707" end_page="707" type="sub_section">
      <SectionTitle>
2.4 Use of Unlabeled Data
</SectionTitle>
      <Paragraph position="0"> In our model, unlabeled data can be easily used by simply concatenating the test data and the unlabeled data, and decoding them in the testing phase.</Paragraph>
      <Paragraph position="1"> Intuitively, if we increase the amount of the test data, test examples with informative local features may increase. The POS tags of such examples can be easily predicted, and they are used as global features in prediction of other examples. Thus, this method uses unlabeled data in only the testing phase, and the training phase is the same as the case with no unlabeled data.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="707" end_page="710" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="707" end_page="708" type="sub_section">
      <SectionTitle>
3.1 Data and Procedure
</SectionTitle>
      <Paragraph position="0"> We use eight corpora for our experiments; the Penn Chinese Treebank corpus 2.0 (CTB), a part of the PFR corpus (PFR), the EDR corpus (EDR), the Kyoto University corpus version 2 (KUC), the RWCP corpus (RWC), the GENIA corpus 3.02p (GEN), the SUSANNE corpus (SUS) and the Penn Treebank WSJ corpus (WSJ), (cf. Table 1). All the corpora are POS tagged corpora in Chinese(C), English(E) or Japanese(J), and they are split into three portions; training data, test data and unlabeled data. The unlabeled data is used in experiments of semi-supervised learning, and POS tags of unknown words in the unlabeled data are eliminated. Table 1 summarizes detailed information about the corpora we used: the language, the number of POS tags, the number of open class tags (POS tags that unknown words can have, described later), the sizes of training, test and unlabeled data, and the splitting method of them.</Paragraph>
      <Paragraph position="1"> For the test data and the unlabeled data, unknown words are defined as words that do not appear in the training data. The number of unknown words in the test data of each corpus is shown in Table 1, parentheses. Accuracy of POS guessing of unknown words is calculated based on how many words among them are correctly POS-guessed.</Paragraph>
      <Paragraph position="2"> Figure 2 shows the procedure of the experiments. We split the training data into two parts; the first half as sub-training data 1 and the latter half as sub-training data 2 (Figure 2, *1). Then, we check the words that appear in the sub-training  data 1 but not in the sub-training data 2, or vice versa. We handle these words as (pseudo) unknown words in the training data. Such (two-fold) cross-validation is necessary to make training examples that contain unknown words3. POS tags that these pseudo unknown words have are defined as open class tags, and only the open class tags are considered as candidate POS tags for unknown words in the test data (i.e., N is equal to the number of the open class tags). In the training phase, we need to estimate two types of parameters; local model (parameters), which is necessary to calculate p0(t|w), and global model (parameters), i.e., li,j. The local model parameters are estimated using all the training data (Figure 2, *2). Local 3A major method for generating such pseudo unknown words is to collect the words that appear only once in a corpus (Nagata, 1999). These words are called hapax legomena and known to have similar characteristics to real unknown words (Baayen and Sproat, 1996). These words are interpreted as being collected by the leave-one-out technique (which is a special case of cross-validation) as follows: One word is picked from the corpus and the rest of the corpus is considered as training data. The picked word is regarded as an unknown word if it does not exist in the training data. This procedure is iterated for all the words in the corpus. However, this approach is not applicable to our experiments because those words that appear only once in the corpus do not have global information and are useless for learning the global model, so we use the two-fold cross validation method. model parameters and training data are necessary to estimate the global model parameters, but the global model parameters cannot be estimated from the same training data from which the local model parameters are estimated. In order to estimate the global model parameters, we firstly train sub-local models 1 and 2 from the sub-training data 1 and 2 respectively (Figure 2, *3). The sub-local models 1 and 2 are used for calculating p0(t|w) of unknown words in the sub-training data 2 and 1 respectively, when the global model parameters are estimated from the entire training data. In the testing phase, p0(t|w) of unknown words in the test data are calculated using the local model parameters which are estimated from the entire training data, and test results are obtained using the global model with the local model.</Paragraph>
      <Paragraph position="3"> Global information cannot be used for unknown words whose lexical forms appear only once in the training or test data, so we process only non-unique unknown words (unknown words whose lexical forms appear more than once) using the proposed model. In the testing phase, POS tags of unique unknown words are determined using only the local information, by choosing POS tags which maximize p0(t|w).</Paragraph>
      <Paragraph position="4"> Unlabeled data can be optionally used for semi-supervised learning. In that case, the test data and the unlabeled data are concatenated, and the best POS tags which maximize the probability of the mixed data are searched.</Paragraph>
    </Section>
    <Section position="2" start_page="708" end_page="708" type="sub_section">
      <SectionTitle>
3.2 Initial Distribution
</SectionTitle>
      <Paragraph position="0"> In our method, the initial distribution p0(t|w) is used for calculating the probability of t given lo-cal context w (Equation (8)). We use maximum entropy (ME) models for the initial distribution.</Paragraph>
      <Paragraph position="1"> p0(t|w) is calculated by ME models as follows (Berger et al., 1996):</Paragraph>
      <Paragraph position="3"> where gh(w,t) is a binary feature function. We assume that each local context w contains the following information about the unknown word: * The POS tags of the two words on each side of the unknown word: t[?]2,t[?]1,t+1,t+2.4 * The lexical forms of the unknown word itself and the two words on each side of the unknown word: o[?]2,o[?]1,o0,o+1,o+2.</Paragraph>
      <Paragraph position="4"> * The character types of all the characters composing the unknown word: ps1,***,ps|o0|.</Paragraph>
      <Paragraph position="5"> We use six character types: alphabet, numeral (Arabic and Chinese numerals), symbol, Kanji (Chinese character), Hiragana (Japanese script) and Katakana (Japanese script).</Paragraph>
      <Paragraph position="6"> A feature function gh(w,t) returns 1 if w and t satisfy certain conditions, and otherwise 0; for ex- null The features we use are shown in Table 2, which are based on the features used by Ratnaparkhi (1996) and Uchimoto et al. (2001).</Paragraph>
      <Paragraph position="7"> The parameters ah in Equation (20) are estimated using all the words in the training data whose POS tags are the open class tags.</Paragraph>
    </Section>
    <Section position="3" start_page="708" end_page="710" type="sub_section">
      <SectionTitle>
3.3 Experimental Results
</SectionTitle>
      <Paragraph position="0"> The results are shown in Table 3. In the table, lo-cal, local+global and local+global w/ unlabeled indicate that the results were obtained using only local information, local and global information, and local and global information with the extra unlabeled data, respectively. The results using only local information were obtained by choosing POS  Number of Correctly Tagged Words tags ^t = {^t1,***,^tK} which maximize the probabilities of the local model:</Paragraph>
      <Paragraph position="2"> The table shows the accuracies, the numbers of errors, the p-values of McNemar's test against the results using only local information, and the numbers of non-unique unknown words in the test data. On an Opteron 250 processor with 8GB of RAM, model parameter estimation and decoding without unlabeled data for the eight corpora took 117 minutes and 39 seconds in total, respectively.</Paragraph>
      <Paragraph position="3"> In the CTB, PFR, KUC, RWC and WSJ corpora, the accuracies were improved using global information (statistically significant at p &lt; 0.05), compared to the accuracies obtained using only lo-cal information. The increases of the accuracies on the English corpora (the GEN and SUS corpora) were small. Table 4 shows the increased/decreased number of correctly tagged words using global information in the PFR, RWC and SUS corpora.</Paragraph>
      <Paragraph position="4"> In the PFR (Chinese) and RWC (Japanese) corpora, many proper nouns were correctly tagged using global information. In Chinese and Japanese, proper nouns are not capitalized, therefore proper nouns are difficult to distinguish from common nouns with only local information. One reason that only the small increases were obtained with global information in the English corpora seems to be the low ambiguities of proper nouns. Many verbal nouns in PFR and a few sahen-nouns (Japanese verbal nouns) in RWC, which suffer from the problem of possibility-based POS tags, were also correctly tagged using global information. When the unlabeled data was used, the number of non-unique words in the test data increased. Compared with the case without the unlabeled data, the accu- null son to Simulated Annealing racies increased in several corpora but decreased in the CTB, KUC and WSJ corpora.</Paragraph>
      <Paragraph position="5"> Since our method uses Gibbs sampling in the training and the testing phases, the results are affected by the sequences of random numbers used in the sampling. In order to investigate the influence, we conduct 10 trials with different sequences of pseudo random numbers. We also conduct experiments using simulated annealing in decoding, as conducted by Finkel et al. (2005) for information extraction. We increase inverse temperature b in Equation (1) from b = 1 to b [?] [?] with the linear cooling schedule. The results are shown in Table 5. The table shows the mean values and the standard deviations of the accuracies for the 10 trials, and Marginal and S.A. mean that decoding is conducted using Equation (11) and simulated annealing respectively. The variances caused by random numbers and the differences of the accuracies between Marginal and S.A. are relatively small.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML