A System to Solve Language Tests for Second Grade Students 
Manami Saito 
Nagaoka University of Technology 
saito@nlp.nagaokaut.ac.jp 
Kazuhide Yamamoto 
Nagaoka University of Technology 
yamamoto@fw.ipsj.or.jp 
Satoshi Sekine 
New York University 
Language Craft 
sekine@cs.nyu.edu 
Hitoshi Isahara 
National Institute of Information and Com-
munications Technology 
isahara@nict.go.jp 
 
 
Abstract 
This paper describes a system which 
solves language tests for second grade 
students (7 years old). In Japan, there 
are materials for students to measure 
understanding of what they studied, 
just like SAT for high school students 
in US. We use textbooks for the stu-
dents as the target material of this study. 
Questions in the materials are classified 
into four types: questions about Chi-
nese character (Kanji), about word 
knowledge, reading comprehension, 
and composition. This program doesn’t 
resolve the composition and some other 
questions which are not easy to be im-
plemented in text forms. We built a 
subsystem for each finer type of ques-
tions. As a result, we achieved 55% - 
83% accuracy in answering questions 
in unseen materials. 
1 Introduction 
This paper describes a system which solves lan-
guage tests for second grade students (7 years 
old). We have the following two objections. 
First, we aim to realize the NLP technologies 
into the form which can be easily observed by 
ordinary people. It is difficult to evaluate NLP 
technology clearly by ordinary people. Thus, we 
set the target to answer second grade Japanese 
language test, as an example of intelligible ap-
plication to ordinary people. The ability of this 
program will be shown by scores which are fa-
miliar to ordinary people. 
Second aim is to observe the problems of the 
NLP technologies by degrading the level of tar-
get materials. Those of the current NLP research 
are usually difficult, such as newspapers or tech-
nological texts. They require high accuracy lan-
guage processing, complex world knowledge or 
semantic processing. The NLP problems would 
become more apparent when we degrade the 
target materials. Although questions for second 
grade students also require world knowledge, it 
is expected that the questions become simpler 
and are resolved without tangled techniques.  
2 Related Works 
Hirschman et al. (1999) and Charniak et al. 
(2000) proposed systems to solve “Reading 
Comprehension.” Hirschman et al. (1999) de-
veloped “Deep Read,” which is a system to se-
lect sentences in the text which include answers 
to a question. In their experiments, the types of 
questions are limited to “When,” “Who” and so 
on. The system is basically an information re-
trieval system which selects a sentence, instead 
of a document, based on the bag-of-words 
method. That system retrieves the sentence con-
taining the answer at 30-40% of the time on the 
tests of third to sixth grade materials. In short, 
Deep Read is very restricted compared to our 
system. Charniak et al. (2000) built a system 
improved over the Deep Read by giving more 
weights for verb and subject, and introduced 
heuristic rules for “Why” question. Though, the 
essential target and method are the same as that 
of Deep Read. 
43
3 Question Classification 
First, we bought five language test books for 
second grade students and one of them, pub-
lished by KUMON, was used as a training text 
to develop our system. The other four books are 
referred occasionally. Second, we classified the 
questions in the training text into four types: 
questions about Chinese character (Kanji), ques-
tions on word knowledge, reading comprehen-
sion, and composition. We will call these types 
as major types. Each of the “major types” is 
classified into several “minor types.” Table 1 
shows four major types and their minor types. In 
practice, each minor type farther has different 
style of questions; such as description question, 
choice question, and true-false question. The 
questions can be classified into approximately 
100 categories. We observed that some ques-
tions in other books are mostly similar; however 
there are several questions which are not cov-
ered by the categories. 
 
Major type Minor type 
Kanji Reading, Writing, Radical, The 
order of writing, Classification 
Word knowl-
edge 
Katakana, How to use Kana, Ap-
propriate Noun, To fill blanks for 
Verb, Adjective, and Adjunct, 
Synonym, Antonym, Particle, 
Conjunction, Onomatopoeia, Po-
lite Expression, Punctuation mark
Reading com-
prehension 
Who, What, When, Where, How, 
Why question, Extract specific 
phrases, Progress order of a story, 
Prose and Verse  
Composition Constructing sentence, How to 
write composition 
Table 1. Question types 
4 Answering questions and evaluation 
result 
In this section, we describe the programs to 
solve the questions for each of the 100 catego-
ries and these evaluation results. First we classi-
fied questions. Some questions are difficult to 
cover by the system such as the stroke order of 
writing Kanji. For about 90% of all the question 
types other than such questions, we created pro-
grams for basically one or two categories of 
questions. There are 47 programs for the catego-
ries found in the training data. 
In each section of 4.1 to 4.3, we describe 
how to solve questions of typical types and the 
evaluation results. The evaluation results of the 
total system will be reported in the following 
section. 
Table 2 to 4 show the distributions of minor 
types in each major type, “Kanji,” “Word 
knowledge,” and “Reading comprehension,” in 
the training data and the evaluation results. 
Training and evaluation data have no overlap.  
4.1 Kanji questions 
Most of the Kanji questions are categorized into 
“reading” and “writing.” Morphological analysis 
is used in the questions of both reading and writ-
ing Kanji; we found that large corpus is effec-
tive to choose the answer from Kanji candidates 
given from dictionary. Table 2 shows it in detail. 
This system cannot answer to the questions 
which are asking the order of writing Kanji, be-
cause it is difficult to put it into digital format.  
The system made 7 errors out of 334 ques-
tions. The most of the questions are the errors in 
reading Kanji by morphological analysis. 
In particular, morphological analysis is the 
only effective method to answer questions on 
this type. It would be very helpful, if we had a 
large corpus considering reading information, 
but there is no such corpus.  
 
Training 
data 
Test data Ques-
tion type
The rate 
of Q in 
training 
data[%]
The used 
knowledge 
and tools Correct 
ans. 
(total) 
Correct ans. 
(known type 
Q., total) 
Reading 27 Kanji dictionary, 
Morphological 
analysis 
 
96(100) 
 
6(8,8) 
Writing 61 
Word diction-
ary, Large 
corpus 
 
220(222) 
 
63(66,66) 
Order of 
writing 
6 
-  
0(20) 
 
0(0,2) 
Combi-
nation 
of Kanji 
parts 
3 
- 
 
0(10) 
 
0(0,0) 
Classify 
Kanji 
3 
Word diction-
ary, Thesaurus 
 
11(12) 
 
0(0,0) 
Total 100 - 327(364) 69(74,76) 
Table 2. Question types for Kanji 
4.2  Word knowledge questions 
The word knowledge question dealt with vo-
cabulary, different kinds of words and the struc-
ture of sentence. These don’t include Kanji 
questions and reading comprehension. Table 3 
shows different types of questions in this type. 
44
For the questions on antonym, the system can 
answer correctly by choosing most relevant an-
swer candidate using the large corpus out of 
multiple candidates found in the antonym dic-
tionary. 
The questions about synonyms ask relations 
of priority/inferiority between words and choos-
ing the word in a different group. These ques-
tions can usually be answered using thesaurus. 
Ex.1 shows a question about particle, Japa-
nese postposition, which asks to select the most 
appropriate particle for the sentence. 
The system produces all possible sentences 
with the particle choices, and finds most likely 
sentences in a corpus. In Ex.1, all combinations 
are not in a corpus, therefore shorter parts of the 
sentence are used to find in the corpus (e.g. “�
�]  (1) 
O { ”, “�T�  (2) 
O { ”). In this 
case, the most frequent particle for (1) is “� ” in 
a corpus, so this system outputs incorrect answer. 
Ex.1 [� /q /t ]wOjz ()t�O\ qy �  
	{Vs^ M{  
 Select particle which fits the sentence from 
{wo,to,ni} 
[1] ��]  (1) �T�  (2) 
O{  
apple-(1) orange-(2) buy 
(1) correct=q  (to) system=�  (wo) 
(2) correct=�  (wo) system=�  (wo) 
 
The questions of Katakana can be answered 
mostly by the dictionary. The accuracy of this 
type is not so high, found in Table 2, because 
there are questions asking the origin of the 
words, most Katakana words in Japanese has a 
few origins: borrowed words, onomatopoeia, 
and others. Because we don’t have such knowl-
edge, we could not answer those questions.   
The questions of onomatopoeia include those 
shown in Ex.2. The system uses co-occurrence 
of words in the given sentence and each answer 
candidate to choose the correct answer in Ex.2, 
“]�]� .” However, it was not chosen be-
cause the co-occurrence frequency of “�lX
� ,” the word in the sentence, and “w�w � ,” 
incorrect answer, is higher.  
Ex.2 mWw  �Ob�  K ��b  \q y  
�  [ ] T�  Q��pz ○ p  T\�s ^M {  
Choose the most appropriate onomatopoeia 
(1) GVs  �w U  �lX�  \�U�   
�Ob{ (A large object is rowing slowly) 
[\�\�~ ]�]� ~w �w� ] 
The questions of word knowledge are classi-
fied into 29 types. We made a subsystem for 
each type. As there are possibly more types in 
other books, making a subsystem for each type 
is very costly. One of the future directions of 
this study is to solve this problem. 
Training 
data 
Test data Question 
type 
The rate 
of Q in 
training 
data[%] 
The used 
knowledge and 
tools Correct 
ans. 
(total) 
Correct ans. 
(known type 
Q., total) 
Anonym 
18 
Antonym 
dictionary, 
Large corpus 
26(27) 12(15,21) 
Synonym 
11 
Thesaurus 
14(17) 34(44,83) 
Particle 19 Large corpus 25(28) 16(17,17) 
Katakana 
25 
Word diction-
ary, Morpho-
logical analysis 
18(37) 19(22,52) 
Onomato-
poeia 
19 Large corpus, 
Morphological 
analysis 
18(29) 16(20,31) 
Structure 
of sen-
tence 
5 
Morphological 
analysis 
7(7) 20(22,22) 
How to 
use kana 
2 - 0(3) 0(0,19) 
Dictation 
of verb 
2 - 0(3) 0(0,0) 
Total 100 - 108(151) 117(140,245)
Table 3. Question types for Word knowledge 
4.3 Reading comprehension questions 
The reading comprehension questions need to 
read a story and answer questions on the story. 
We will describe five typical techniques that are 
used at different types of questions, shown in 
Table 4. 
Pattern matching (a) is used in questions to 
fill blanks in an expression which describes a 
part of the story. In general, the sequence of 
word used in the matching is the entire expres-
sion, but if no match was found, smaller por-
tions are used in the matching. 
Ex.3 Fill blanks in the expression 
Story� partial� ��z ~�  hmq z  
fw  Vx  `��pz i� i�  l� M   
	�t  T�l o  MV�b {�  
(In a few days, the flower withers and gradu-
ally changes its color to black.) 
Expression � Vx  (1)z (2) 	�t  T��{�  
(The flower (1) and change its color to (2).) 
Answer (1) `��p  (withers) 
(2) l�M  (black) 
The effectiveness of this technique is found 
in this example. The other methods will be 
needed when questions will be more difficult. At 
the time, this technique is very useful to solve 
many questions in reading comprehension.  
45
When the question is “Why” question, key-
words such as “f� p  (thus)” and “iT�
(because)” are used. 
For example, when questions start with 
“When (Mm )” and “Where (r\ ),” we can 
restrict the type of answer word to time or loca-
tion, respectively. If the question includes the 
expression of “r\t ” (“t ” is a particle to 
indicate direction/specification), the answer is 
also likely to be expressed with “ni” right after 
the location in the story. (The kind of NE 
(Named Entity) and particle right after word 
(b)) 
For the questions asking the time or location 
about the entire story, this system outputs the 
appropriate type of the word which appeared 
first in the story. Although there are mistakes 
due to morphological analysis and NE extraction, 
this technique is also consistently very useful.  
The technique which is partial matching 
with keywords (c) is used to seek an answer 
from story for “how,” “why” or “of what” ques-
tions. Keywords from the question are used to 
locate the answer in the story. 
Ex.4 TQl hyT� w  {sxz r� X�M
w  GV^
Frequency in the large corpus is used to 
find the appropriate sentence conjunction. (d) 
Answer is chosen by comparing the mutual in-
formation of the candidate conjunctions and the 
last basic block of the previous sentence. How-
ever, this technique turns out to be not effective. 
Discourse analysis considering wider context is 
required to solve this question. 
The technique which uses distance between 
keywords in question and answers (e) is sup-
plementary to the abovementioned methods. If 
multiple answers are found, the answer candi-
date that is the closest in the story text to the 
keywords in questions is generated. These key-
words are content words and unknown words in 
the text. This technique is found very effective. 
In Table 4, the column “Used method” shows 
the techniques used to solve the type of ques-
tions, in the order of priority. “f” in the table 
denotes means that we use a method which was 
not mentioned above.  
pbT{ (How big are chicks when 
they hatched?) 
Text� partial� �TQ lhyT �w  {s  
 
 
xzS� �| w
�Y� Mw  GV^ pb {�  
(The size of chicks when they hatched is about 
the size of your thumb.) 
 
 
 
Answer S��|w  
� Y�M  (size of 
thumb) 
 
 
 5 I F  S B U F  P G  5 S B J O J O H  E B U B  5 F T U  E B U B
 R V F T U J P O T  J O  6 T F E  D P S S F O U  X O T   D P S S F O U  B O T 
 U S B J O J O H  E B U B  <  >  N F U I P E T  	 U P U B M 
  	 L O P X O  U Z Q F  2 
  U P U B M 

 8 I P  T B J E    C 
 B 
 G
 5 I F  P U I F S T   C 
 F G
 - J L F  X I B U  D 
 C 
 F
 0 G  X I B U  D 
 G 
 F
 8 I B U  E P J O H  D 
 B 
 G
 8 I B U  J T  B 
 F
 8 I B U  E P  "  T B Z  C 
 B 
 G 
 F
 8 I P M F  T U P S Z  C
 1 B S U  P G  T U P S Z  
 8 I P M F  T U P S Z  C
 1 B S U  P G  T U P S Z  C 
 G 
 D
    D 
 G    	   
   	  
   

    D   	   
   	  
   

   C 
 D 
 G   	  
   	  
   

    B    	   
   	  
   

      	  
   	  
   

   E   	  
   	  
   

    G   	   
   	  
   

    G   	   
   	  
   

      	  
   	  
   

         	    
    	   
    

 8 I Z
 ) P X
 ) P X  M P O H 
  I P X  P G U F O 
  I P X  M B S H F
 5 P U B M
 1 B S B H S B Q I
 5 I F  P U I F S T
 5 P  G J M M  C M B O L T
 / P U  I B W F  J O U F S S P H B U J W F  Q S P O P V O
 $ P O K V O D U J P O
 1 S P H S F T T  P S E F S  P G  B  T U P S Z
 8 I F S F     	 
 8 I F O  	 
 
 8 I P
 2 V F T U J P O  U Z Q F
   	   
   	  
   
 8 I B U   
  	  
   	  
   

 
 

 
 
Table 4. Question types for Reading comprehension 
46
5 Evaluation 
We collected questions for the test from differ-
ent books of the training data. The proposition 
of the number of questions for different sections 
is not the same as that of the training data. Table 
2 to 4 show the evaluation results in the test data 
for each type. Table 5 shows the summary of the 
evaluation result. In the test, we use only the 
questions of the type in training data. The tables 
also show the total number of questions, the 
number of questions which are solved correctly, 
and the number of questions which are not one 
of the types the system targeted (not a type in 
the training data). 
The ratio of the questions covered by the sys-
tem, questions in test data which have the same 
type in the training data are 97.4% in Kanji, 
57.1% in word knowledge, and 64.7% in read-
ing comprehension. It indicates that about a half 
of the questions in word knowledge isn’t cov-
ered. As the result, accuracy on the questions of 
the covered types in word knowledge is 83.6%, 
but it drops to 47.8% for the entire questions. It 
is because our system classified the questions 
into many small types and builds a subsystem 
for each type. 
The accuracy for the questions of covered 
type is 83.4%. In particular, for the questions of 
Kanji and word knowledge, the scores in the test 
data are nearly the same as those in the training 
data. It presents that the accuracy of the system 
is provided that the question is in the covered 
type. However, the score of reading comprehen-
sion is lower in the test data. We believe that 
this is mainly due to the small test data of read-
ing comprehension (only 34) and that the accu-
racy for “Who” questions and the questions to 
fill blanks in the test data are quite difficult com-
pared to the training data. 
 / V N   / V N   / V N   3 $ " 
 
 3 $ "
 
 3 $ "
 
 P G  P G  P G  	 L O P X O  	 U P U B M�  J O  U P U B M
 B M M  2  L O P X O   D P S S F O U  U Z Q F  2 
  <  >  P G  L O P X O
 U Z Q F  2  B O T   <  >  U Z Q F  2  <  >
 , B O K J                        
 8 P S E  L O P X M F E H F                           
 3 F B E J O H 
 $ P N Q S F I F O T J P O
 5 P U B M                           
                      
 
Table 5. Evaluations at test data 
 
* Rate of Correct Answer 
6 Discussions 
We will discuss the problems and future direc-
tions found by the experiments. 
6.1 Recognition and Classification of Ques-
tions 
In order to solve a question in language test, 
students have to recognize the type of the ques-
tion. The current system skips this process. In 
this system, we set up about 100 types of ques-
tions referring the training data and a subpro-
gram solves questions corresponding to each 
type. There are two problems to be solved. First, 
we have to design the appropriate classification 
and avoid unknown types in the test data. From 
the experiment, we found that the current types 
are not enough to solve this problem. Second, 
the program has to classify the questions auto-
matically. We are building this system and are 
forecasting it quite optimistically once a good 
format is provided. 
6.2 Effectiveness of Large Corpus 
The large corpus of newspapers and the Web are 
used effectively in many different cases. We 
will describe several examples. 
In Japanese, there are different Kanji for the 
same reading. For example, Kanji for “KO (au: 
to see, to solve correctly) are “qO (to see)” or 
“�O (to solve correctly)” for “
tKO (to see 
people)” and “t Q UKO (to solve an answer 
correctly),” respectively. This type of questions 
can be solved by counting the expressions with 
Kanji in the corpus. It is similar to word sense 
disambiguation. 
In the questions of particle complement, such 
as “T^  (umbrella) �t /q /�  (locative-, con-
junctive-, and objective particles) �H  (home) 
�t /q /�� �b � �  (to left) {  (Intentional 
sentence is ‘I left the umbrella at home’)”, it can 
be solved by counting the expressions with each 
particle in a corpus. This method is mentioned in 
Matsui� 2004� but the evaluation result was 
not reported. When the answer is not found for 
the entire expression, the answer is searched by 
deleting some contexts. Most questions of filling 
blank types, similar strategy is helpful to find 
the correct answer. 
In summary, the experiments showed that the 
large corpus is quite useful in several types of 
47
questions. We believe it would be quite difficult 
to achieve the same accuracy by compiled 
knowledge, such as a dictionary of verbs, anto-
nyms, synonyms, and relation words, and a the-
saurus.  
6.3 World Knowledge 
The questions sometimes need various types of 
world knowledge. For example, “A student en-
ters junior high school after graduated from 
elementary school.” And “People become happy, 
if he receives something nice from someone.” It 
is a difficult problem how to describe and how 
use that knowledge. Another type of world 
knowledge includes origin of words, such as 
foreign borrowed word or onomatopoeia. As far 
as we know, there is no comprehensive knowl-
edge of such in electronic form. It is required to 
design attributes of world knowledge and to use 
them flexibly when applying then to solve the 
questions. 
6.4 Difference between Reading Compre-
hension and Question Answering 
The current QA systems identify the NE type of 
questions and seek the answer candidate of the 
type. However, the questions in the reading 
comprehension don’t limit the answer types to 
person and organization, even if the question is 
“Who” type question. For example, “raccoon 
dog behind our house” or “the moon” can be the 
answer. Also, the answer is not always a noun 
phrase, but can be a clause, for example, “the 
time when new leaves growing on a branch” for 
questions asking time. There are different kinds 
of questions, which are asking not the time of 
specific event but the time or season of the en-
tire story. For example “When is this story 
about?” In this case, the question can’t be an-
swered by just extracting a noun phrase.  
However, at the moment, we can’t conclude 
if the question can or cannot be answered with-
out really understanding it. Sometime, we can 
find a correct answer without reading the story 
down the line or understanding the story per-
fectly. It is one of the future works. 
6.5 Other techniques: discourse and 
anaphora 
Some techniques other than morphological 
analysis, frequency of appearance in a corpus, 
and question answering methods are used in our 
system. We raise two issues. One of those is the 
discourse analysis. It is required in the questions 
to assign the order of paragraphs, and to select 
appropriate sentence conjunction. The other is 
anaphora analysis, which is very important, not 
only to indicate the antecedent, but also to find 
the link of mentions of entities.  
7 Conclusion 
 several inter-
esting NLP problems were found. 
Hi
Comprehension system”.  
Ch
er-based Language Understanding 
K.
of 
K. 
atural Language Processing, 2004, 
 
We develop a system to solve questions of sec-
ond grade language tests. Our objectives are to 
demonstrate the NLP technologies to ordinary 
people, and to observe the problems of NLP by 
degrading the level of target materials. We 
achieved 55% - 83% accuracy and
References 
rschman, L., Light, M., Breck, E. and Burger, J. D. 
“Deep READ: a Reading 
ACL, 1999, pp 325-332. 
arniak et al., “Reading Comprehension Programs 
in a Statistical-language-Processing Class”. Work-
shop on Reading Comprehension Tests as Evalua-
tion for Comput
Systems. 2000. 
 Matsui: “Search Technologies on WWW which 
utilize search engines”. (In Japanese) Journal 
Japanese Language, February, 2004, pp 34-43. 
Yoshihira, Y. Takeda, S. Sekine: “KWIC System 
on WEB documents”, (In Japanese) 10th Annual 
Meeting of N
pp 137-139. 
 
F  
(http://languagecraft.jp/dennou/) 
igure 1. A Snapshot of the system
48
