Candidate Terms Extracted Using NP-Chunking (1)

This directory contains extracted candidate terms using NP-chunks. Sentences in the corpus are chunked using the Apache OpenNLP chunker (Release version 1:5:2 (http://opennlp.apache.org/)). All noun phases (NP chunks) of maximum length 5 (after removing determiners and stop words) are then considered as candidate term. For the candidate terms listed here, the frequency of candidate terms in the corpus is computed independently of the NP chunk boundaries.

The structure of files and folders is similar to candid_term/pos_based/. However, the listed candidate terms in the file "_all_candid_term_by_np_chunk_1.zip" are also marked with the CHUNK_IDs that have been employed to extract the terms. The CHUNK_IDs are listed after PARAGRPAH_FREQ as described below.


Index of: CANDIDATE_TERM/


Size:Name:Description:
25.787.427  _ALL_CANDID_TERM_BY
_NP_CHUNK_1.ZIP
This file contains all the extracted candidate terms. Each line of the file represent the following information:
  • TERM_ID: an assigned universal integer id to the candidate term (note that if a term appears in other lists of extracted candidate terms (e.g. pos-based extracted candidate terms), then its assigned integer id is the same across these lists)
  • STRING LENGTH: length of candidate term
  • CORPUS_FREQ: the number of occurrences of the candidate term in the segmented pre-processed corpus (i.e. SEPID_CORPUS), in other words the term frequency (tf). As stated above, the boundaries of NP chunks are ignored when collecting term frequencies.
  • DOCUMENT_FREQ: the number of documents in which the candidate term has been occurred, i.e. the term document frequency which can be used for calculating the inverse document frequency.
  • SECTION_FREQ: the number of sections in which the term has been occurred.
  • PARAGRPAH_FREQ: the number of paragraphs in which the term has been occurred.
  • CHUNK_ID: the integer id of the origin NP chunk from the SEPID_CORPUS.
33.777.930  _ALL_CANDID_TERM_BY_
NP_CHUNK_1
_DOCUMENT_INDEX.ZIP
An inverted index file that maps terms to documents in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by DOCUMENT_ID (tab separated). Please note DOCUMENT_ID corresponds to an integer id that is assigned to each document in the SEPID_CORPUS.
40.835.147  _ALL_CANDID_TERM_BY
_NP_CHUNK_1
_SECTION_INDEX.ZIP
Similar as above, however, for sections: an inverted index file that maps terms to sections in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by SECTION_ID (tab separated).
51.301.685  _ALL_CANDID_TERM_BY_
NP_CHUNK_1
_PARAGRAPH_INDEX.ZIP
Similar as above, however, for paragraphs, i.e. TERM_ID followed by PARAGRAPH_ID (tab separated from SEPID_CORPUS).
125.578.997  _ALL_CANDID_TERM_BY_
NP_CHUNK_1
_SENTENCE_INDEX.ZIP
Similar as above however for sentences. The format of the file is TERM_ID followed by SENTENCE_ID followed by START and END positions of the term. START and END are the token numbers in the sentence.
236  README.TXTA note on collecting frequencies.
<DIR>  CANDID_TERM_BY
_NP_CHUNK_1
_SENTENCE_INDEX/
The (candidate-term-id, sentence-id) indices (i.e. in _all_candid_term_by_ngram_sentence_index.zip) are grouped by the date(year) of publication of source documents. The first two letters of filenames show the year of publication. For instance, the file "84_candid_term_by_ngram_sentence_index.zip" contains all sentence--term-id mapping from the corpus in which the sentences are from the publications in the year 84. These files together with the additional provided index files in SEPID_CORPUS can be used to organize candidate terms in a chronological order. There are currently 34 files, representing publications from 67 (i.e. 1967) to 06 (i.e. 2006).

Directory contains 277.281.422 Bytes in 6 Files

Index of: CANDID_TERM_BY_NP_CHUNK_1_SENTENCE_INDEX/


<Up to the higher level directory>

To download all these files in one zip file click here.
Size:Name:Description:
6.869.371  00_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2000.
3.900.547  01_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2001.
5.641.808  02_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2002.
6.517.638  03_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2003.
12.071.310  04_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2004.
6.864.500  05_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2005.
13.309.338  06_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2006.
271.362  65_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1965.
218.459  67_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1967.
532.014  69_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1969.
355.245  73_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1973.
450.201  75_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1975.
585.869  78_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1978.
2.260.052  79_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1979.
1.512.396  80_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1980.
643.966  81_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1981.
1.309.543  82_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1982.
1.327.300  83_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1983.
1.353.830  84_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1984.
1.390.042  85_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1985.
2.661.252  86_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1986.
1.793.222  87_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1987.
3.528.650  88_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1988.
2.610.335  89_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1989.
3.933.601  90_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1990.
3.567.428  91_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1991.
5.468.993  92_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1992.
4.606.487  93_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1993.
5.943.821  94_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1994.
2.537.697  95_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1995.
5.079.275  96_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1996.
5.400.770  97_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1997.
6.998.228  98_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1998.
3.872.754  99_CANDID_TERM_BY_NP_CHUNK_1_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1999.

Directory contains 125.387.304 Bytes in 34 Files

Total: 402.668.726 Bytes in 40 Files

This page last edited on 06 October 2025.




*** ***