SEPID Corpus: Segmented, Pre-Processed, Indexed ACL ARC 1.0

Skip descriptions and go to

SEPID CORPUS is the segmented processed ACL ARC documents that are represented in a data model as shown in the figures shown below. In this representation, each linguistically well defined unit, i.e. lexemes (part-of-speech-tagged, lemmatized words), sentences, paragraphs, sections, etc., is identified by a unique identifier. Moreover, units of higher granularity than lexemes consists of a combination of linguistic units of a finer level of granularity. For instance, a sentences consists of a list of lexemes and their position in the sentence; paragraphs are lists of sentences and so on. These representation of text is then serialized using a set of tab-separated text files; each text file represent a particular linguistic unit (data-entity in the given diagrams).

The data-entity relationships diagram

Text data-entity relationships diagram at levels finer than paragraph


In order to model nested sections and subsections, text units of a granularity level higher than paragraphs are all consider as a content-unit of specific content-type. Each of these units is then a list of content-unit and at a specific position. Further information about the content_unit can be found in the relevant file to that text unit.

The data-entity relationships diagram

Text data-entity relationships diagram at levels higher than paragraph


The presented data in the listed files here are derived from processing the cleansed text documents using the Stanford tokenizer and part-of-speech tagger (version release date 9 July 2012), the Apache OpenNLP's sentence splitter and Chunker(version 1.5) and MaltParser(version 1.6), a data-driven dependency parser.

Each of the data-entities in the above figures are represented by a tab-separated text file. The first line of each file starts with character "#" and describe the content of records in the file. The corpus files can be downloaded from the list given below:


SEPID CORPUS


Size:Name:Description:
554.505.209  sepid_corpus.zipAll the files listed below in one zip file.
7.543.215  _all_lexicon.zipAll the extracted lexemes, i.e. part-of-speech tagged, lemmatized words, that are extracted from the ACL ARC. The structure of this tab-separated file is as follows:
  • LEXEME_ID;
  • LEXEME_STRING: the extracted string/word as appeared in the corpus;
  • LEMMA: the assigned lemma to the word;
  • POS: the assigned part-of-speech tag to the word (description of the employed penn-style part-of-speech tags can be found in the Stanford tagger documentations);
  • FREQUENCY: the frequency of this lexeme in the corpus.
2.358.739  _all_sentence.zipAll the extracted sentences from the ACL ARC. This file contains only one column, i.e. the list of employed integers as SENTENCE_ID.
139.071.858  _all_sentence_lexeme.zip This file defines the extracted sentences from the corpus as a list of the tuples (lexeme_id, lexeme_position). The structure of this tab-delimited file is as follows:
  • SENTENCE_ID: the employed id to identify individual sentences, which are also listed in the above _all_sentence file;
  • LEXEME_ID: the lexeme_ids of the words in the sentence. These lexeme_ids come from the above _all_lexicon file;
  • POSITION: the position of the lexeme in the sentence.
24.908.979  _all_chunk.zipAll the extracted chunks (phrases) from the ACL ARC. This tab-separated file has records in the form of:
  • CHUNK_ID;
  • TYPE: the type of chunk, e.g. NP, VP, etc;
  • LIST_OF_LEXEME_IDS: the list of lexeme_ids in the same order as they are appeared in the chunk. These lexeme_ids are separated by the space character and are coming from the above _all_lexicon file;
  • FREQUENCY: the frequency of the chunk in the corpus.
85.975.964  _all_sentence_chunk.zipThis file maps the extracted chunks to extracted sentences. This tab-separated file has records in the form of:
  • SENTENCE_ID;
  • CHUNK_ID: from the above _all_chunk file;
  • START_POSITION: the start position for the chunk in the sentence (i.e. token offset: the number of tokens from the beginning of the sentence);
  • FREQUENCY: the end position of the chunk in the sentence.
54.683.904  _all_dependency.zipAll the extracted syntactic relations (dependencies) between lexemes in the corpus. The structure of the records in this tab-separated file is as follows:
  • DEPENDENCY_ID;
  • GOVERNOR_LEXEME_ID: the lexeme_id of the lexeme (i.e. part-of-speech tagged word) appeared in the governor position in the syntactic relation;
  • REGENT_LEXEME_ID: the lexeme_id of the lexeme (i.e. part-of-speech tagged word) appeared in the regent position in the syntactic relation;
  • DEPENDENCY_TYPE: the type of syntactic relation, e.g. auxpass, det, etc. (for further information on the type of syntactic relations please see MaltParser documentations);
  • FREQUENCY: the frequency of the specified syntactic relations between the given two lexemes in the corpus.
226.281.573  _all_sentence_dependency_parse.zipThe extracted syntactic relations are mapped into sentences. The structure of the records in this tab-separated file is as follows:
  • DEPENDENCY_ID;
  • SENTENCE_ID;
  • GOVERNOR_POSITION: the position of governor in the sentence;
  • REGENT_POSITION: the position of regent in the sentence;
  • DEPENDENCY_ID: the dependncy_id from the above _all_dependency file;
  • DEPENDENCY_TYPE: the type of dependency ( which is redundant as it can be obtained from _all_dependency file).
5.330.203  _all_paragraph_sentence.zipThis file identifies the extracted paragraphs from the corpus. Each paragraph is formed of a list of sentences are certain position, i.e. the list of tuples (sentence_id, sentence_position). The structure of the records in this tab-separated file is as follows:
  • PARAGRAPH_ID;
  • SENTENCE_ID;
  • POSITION: the position of the sentence in the paragrpah.
439.649  _all_section.zipAll the extracted sections from the corpus. The structure of the records in this tab-separated file is as follows:
  • SECTION_ID: the assigned integer id to the section;
  • SECTION_TYPE: the type of the section, e.g. abstract, method, sub_section, etc.;
  • SENTENCE_ID: a sentenc_id which can be used to obtain the title of the section.
In order to retrieve the section text, it is necessary to use the file _all_content_content (listed below) and traverse it recursively.
1.026.572  _all_content_type.zipThe list of all text units other than lexeme, sentence and paragraphs, i.e. sections, documents, figures, tables and equations. This file is used to recover the text and the structure of documents. The structure of the records in this tab-separated file is as follows:
  • CONTENT_ID: the assigned id to the content: these ids are coming from the files all_document, _all_section, _all_equation, and _all_paragraph;
  • CONTENT_TYPE: the type of the content, i.e. document, section, etc. In other words, the origin of the listed id.
1.722.830  _all_content_content.zipThis file is used to retrieve/recover the structure and text for sections and documents. The structure of the records in this tab-separated file is as follows:
  • CONTENT_ID (SUPER_CONTENT): the content id of the text unit of higher level of granularity; for instance, for a section with a number of subsections, the section_id is listed as CONTENT_ID (SUPER_CONTENT);
  • CONTENT_ID (SUB_CONTENT): the content id of the text unit of finer level of granularity; for instance, for a section with a number of subsections, the sub_section_ids are listed as CONTENT_ID (SUB_CONTENT);
  • POSITION_OF_SUB_CONTENT_IN_SUPER_CONTENT: this determines the position of the sub-content in the content of higher level of granularity; e.g., the position of subsections in the section.
47.260  _all_document.zipThe list of documents from the ACL ARC that are processed and indexed successfully. The structure of the records in this tab-separated file is as follows:
  • DOCUMENT_ID: an integer number that identifies the document in the corpus;
  • SENTENCE_ID: the sentence_id that can be used to retrieve the document title.
106.842  _all_equation_caption.zipAll the extracted equations from the ACL ARC's sections. The structure of the records in this tab-separated file is as follows:
  • EQUATION_ID: an integer number that identifies the equation;
  • EQUATION_PARAGRAPH_ID: the paragrpah_id that can be used to retrieve the text that may have accompanied the equation.
60.542  _all_figure_caption.zipAll the extracted figures from the ACL ARC. Please note the figure themselves nor their position are not stored. The structure of the records in this tab-separated file is as follows:
  • FIGURE_ID;
  • PARAGRAPH_ID: the paragrpah_id that can be used to retrieve the caption of the figure.
56.392  _all_section_figure.zipMapping between figures and documents in the corpus; The structure of the records in this tab-separated file is as follows:
  • SECTION_ID: the id of the origin section;
  • FIGURE_ID: from the above _all_figure_caption file.
40.852  _all_table_caption.zipThe extracted table captions from the corpus. The structure of the records in this tab-separated file is as follows:
  • TABLE_ID;
  • CAPTION_PARAGRAPH_ID: the assigned id to the caption paragraph; this paragraph_id can be used to retrieve the caption text.
37.576  _all_section_table.zipMapping between tables and documents in the croups; The structure of the records in this tab-separated file is as follows:
  • SECTION_ID: the id of the origin section;
  • TABLE_ID: from the above _all_table_caption file.
41.977  _id_map_to_acl_arc.zipThis file gives the mapping between the employed integer ids for documents in the SEPID_CORPUS to the original ACL ARC ids. The structure of the records in this tab-separated file is as follows:
  • DOC_ID: the integer id of a document in the SEPID_CORPUS;
  • ACL_ARC_ID: the id in the ACL ARC (publications' original ACL ID).
98.043  _all_affiliation.zipExtracted affiliations from the ACL ARC corpus. This tab-separated file contains the following information:
  • AFFILIATION_ID;
  • AFFILIATION: text that is used to represent the affiliation.
1.004.429  _all_author.zipThe list of extracted authors from the ACL ARC. This tab-separated file contains the following information:
  • AUTHOR_ID;
  • FIRST_NAME;
  • MIDDLE_NAME;
  • LAST_NAME.
43.115  _all_author_affiliation.zipExtracted Affiliations for the authors appeared in the corpus. This tab-separated file has records in the form of:
  • AUTHOR_ID: the ids from the above _all_author file.
  • AFFILIATION_ID: these ids are from the above _all_author_affiliation file.
62.079  _all_email.zipAll the extracted email addresses from the ACL ARC. This tab-separated file has records in the form of:
  • EMAIL_ID
  • EMAIL
24.031  _all_author_email.zipExtracted email addresses from the ACL ARC are assigned to authors. This tab-separated file has records in the form of:
  • AUTHOR_ID: these ids are from the file _all_author.
  • EMAIL_ID: these ids are from the file _all_email
2.759.385  _all_citation.zipThe list of all extracted citations from the ACL ARC. The structure of the records in this tab-separated file is as follows:
  • CITATION_ID: the assigned id to the citation entry.
  • TITLE: a string that shows the title of the entry.
  • DATE: publication date.
Please note that a more reliable citation network is represented in the accompanied meta-data in the ACL ARC distribution.
558.776  _all_citation_author.zipThe list of authors of the extracted citations. The structure of the records in this tab-separated file is as follows:
  • CITATION_ID: from the above _all_citation file.
  • AUTHOR_ID: from the above _all_author file.
220.974   _all_document_citation.zipIndicate the list of citations for each document in the corpus. The structure of the records in this tab-separated file is as follows:
  • DOCUMENT_ID: from the above _all_document file.
  • CITATION_ID: from the above _all_citation file.
<DIR>  sepid_corpus_examples/A short-truncated version of all the above listed files can be found in this folder.
<DIR>  redundant_index/A Set of additional redundant indexes that may come handy, or make it easier to process the corpus, can be found in this folder.
<DIR>  sepid_corpus_by_year_type/This folder contains the same data as the files listed above. However, each text file in this folder is broken into 34 different files, each file repents the text units that are extracted from articles published in a particular year, e.g. 67, 78, 92, and so on.

Directory contains 1.109.010.968 Bytes in 27 Files


Redundant Index Files

A set of redundant index files that can help to manipulate, s/s/s/s/search and process the corpus are provided. The set of available files for download are listed below.


Index of: redundant_index/


<Up to the higher level directory>
Size:Name:Description:
136.001.881  _redundant_index.zipAll the files listed below in one Zip file.
63.877.459  _all_sentence_text.zipAll the extracted sentences from the corpus; each record is one line of the text file in which the field values in the record are separated by tab_character+<s>+tab_character. Each record has the following fields:
  • SENTENCE_ID: the assigned id to the sentence in SEPID_CORPUS.
  • SENTENCE_STRING: extracted string for the sentence.
59.163.252  _all_paragraph_text.zip All the extracted paragraphs from the corpus; each record is one line of the text file in which the field values in the record are separated by tab_character+<p>+tab_character. Each record has the following fields:
  • PARAGRAPH_ID: the assigned id to the paragraph in the SEPID_CORPUS.
  • PARAGRAPH_STRING: extracted string for the paragraph.
5.516.793  _all_document_sentence.zipMapping between extracted sentences and documents. The structure of the records in this tab-separated file is as follows:
  • DOCUMENT_ID: the assigned id to the document in the SEPID_CORPUS.
  • SENTENCE_ID: the assigned id to the sentence in the SEPID_CORPUS.
  • POSITION: the absolute position of the sentence in the document (i.e. the order of appearance of sentences in the document, caption sentences are also included).
1.449.532  _all_document_paragraph.zip Mapping between extracted paragraphs and documents. The structure of the records in this tab-separated file is as follows:
  • DOCUMENT_ID: the assigned id to the document in the SEPID_CORPUS.
  • PARAGRAPH_ID: the assigned id to the paragraph in the SEPID_CORPUS.
  • POSITION: the absolute position of the paragraph in the document (i.e. the order of appearance of paragraphs in the document, caption paragraphs are also included).
5.921.040  _all_section_sentence.zipMapping between extracted sentences and sections. The structure of the records in this tab-separated file is as follows:
  • SECTION_ID: the assigned id to the section in the SEPID_CORPUS.
  • SENTENCE_ID: the assigned id to the sentence in the SEPID_CORPUS.
  • POSITION: the absolute position of the sentence in the section (i.e. the order of appearance of sentences in the section, caption sentences are also included).
73.579  _all_orphan_sentence.zipThis file lists all the sentence ids that are not connected to any document. This problem is due to the incomplete indexing of some of the documents, e.g. because of a bug in codes, appearance of illegal characters in documents etc. This problem will be addressed in the release. The reocrds of this file are thus only SENTENCE_ID.

Directory contains 272.003.536 Bytes in 7 Files

SEPID CORPUS Sectioned and Grouped by the Publication Year of Documents

Here you can download all the above listed text units in the SEPID CORPUS, however, when files are sectioned and organized by the year of publication of their origin documents. The structure of these files are exactly the same as the descriptions given in the table above.

Each of the files listed in the SEPID CORPUS (the above table) are broken down into 34 different files, each file represent the text units that are extracted from the documents published in a particular year. For example, the _all_lexicon file is broken down into 34 files, each file starts with a two digit number, e.g. 98, 87, 67 and so on, which shows the year of publication, followed by "_lexicon". In this way, the file 87_lexicon contains all the lexemes that are extracted from the documents published in the year 87 and the file "98_lexicon" contains all the extracted lexemes form documents published in year 98.

In the current release these 34 years are: '06', '05', '04', '03', '02', '01', '00', '99', '98', '97', '96', '95', '94', '93', '92', '91', '90', '89','88', '87', '86', '85', '84', '83', '82', '81', '80', '79', '78','75', '73','69', '67', '65' .


Index of: sepid_corpus_by_year_type/


<Up to the higher level directory>
Size:Name:Description:
623.815.172  _sepid_corpus_by_year_type.zipAll the files listed below in one Zip file.
132.336   affiliation.zip Extracted affiliations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
1.049.391  author.zip Extracted author names, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
53.147  author_affiliation.zipExtracted mappings between authors and affiliations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
31.103  author_email.zipExtracted mappings between author names and email addresses, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
30.850.371  chunk.zipExtracted chunks, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
2.806.382  citation.zipExtracted citations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
596.417  citation_author.zipExtracted mappings between authors and citations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
1.960.407  content_content.zipExtracted content mappings, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
1.043.264  content_type.zipExtracted contents marked by their type, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
107.218.828  dependency.zipExtracted syntactic relations between words, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
59.717  document.zipExtracted documents, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
250.900  document_citation.zipExtracted document citations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
85.957  email.zipExtracted email addresses, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
132.992  equation_caption.zipExtracted equations, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
77.716  figure_caption.zipExtracted figure captions, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
17.150.076  lexicon.zipExtracted lexemes (part-of-speech tagged, lemmatized words), grouped by the year of publication of their source documents. For the structure of the records see the description given above.
5.570.395  paragraph_sentence.zipExtracted mapping between paragraphs and sentences, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
458.957  section.zipExtracted text sections, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
68.211  section_figure.zipExtracted mappings between sections and figure captions, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
47.434  section_table.zipExtracted mappings between sections and tables, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
1.706.071  sentence.zipExtracted sentences, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
86.363.133  sentence_chunk.zipExtracted mappings between sentences and chunks, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
226.439.393  sentence_dependency_parse.zipExtracted mappings between syntactic dependencies and sentences, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
139.609.923  sentence_lexeme.zipExtracted mappings between sentences and lexemes, grouped by the year of publication of their source documents. For the structure of the records see the description given above.
53.179  table_caption.zipExtracted table captions, grouped by the year of publication of their source documents. For the structure of the records see the description given above.

Directory contains 1.247.630.872 Bytes in 26 Files

Example of Records in the SEPID CORPUS Index Files

You can explore the index files' structure in the truncated example files listed below.


Index of: sepid_corpus_examples/


<Up to the higher level directory>
Size:Name:Description:
27.501  _affiliationExample of Affiliation Index File
153.585  _authorExample of Author Index File
9.369  _author_affiliationExample of Mapping Between Author and Affiliation Index Files
4.761  _author_emailExample of Mapping Between Author and Email Index Files
395.388  _chunkExample of Chunk Index File
8.856  _citationExample of Citation Index File
2.421  _citation_authorExample of Mapping Between Citation and Author Index Files
20.783  _content_contentExample of Mapping Between Contents and Sub-Contents Index Files
22.735  _content_typeExample of Content Index File
4.806  _dependencyExample of Syntactic Dependency Index File
378  _documentExample of Document Index File
1.451  _document_citationExample of Mapping Between Document and Citation Index Files
4.985  _emailExample of Email Index File
1.401  _equation_captionExample of Equation Index File
1.564  _figure_captionExample of Figure Caption Index File
5.017  _lexiconExample of Lexeme Index File
3.463  _paragraph_sentenceExample of Mapping Between Paragraphs and Sentences
2.243  _sectionExample of Section Index File
1.088  _section_figureExample of Mapping between Section and Figure Caption Index Files
1.492  _section_tableExample of Mapping between Section and Table Caption Index Files
22.038  _sentenceExample of Sentence Index File
3.589  _sentence_chunkExample of Mapping Between Sentence and Chunk Index Files
5.010  _sentence_dependency_parseExample of Mapping between Sentences and Indexed Syntactic Dependencies
3.306  _sentence_lexemeExample of Mapping Between Lexeme Indices and Sentences
2.004  _table_captionExample of Table Caption Index File

Directory contains 709.234 Bytes in 25 Files

Total: 2.629.354.610 Bytes in 85 Files

© Behrang QasemiZadeh Some Rights Reserved.

Creative Commons License

This page last edited on 16 October 2025.




*** ***