File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1639_intro.xml
Size: 3,312 bytes
Last Modified: 2025-10-06 14:04:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1639"> <Title>floor-debate transcripts</Title> <Section position="4" start_page="328" end_page="328" type="intro"> <SectionTitle> 2 Corpus </SectionTitle> <Paragraph position="0"> This section outlines the main steps of the process by which we created our corpus (download site: www.cs.cornell.edu/home/llee/data/convote.html).</Paragraph> <Paragraph position="1"> GovTrack (http://govtrack.us) is an independent website run by Joshua Tauberer that collects publicly available data on the legislative and fundraising activities of U.S. congresspeople. Due to its extensive cross-referencing and collating of information, it was nominated for a 2006 &quot;Webby&quot; award. A crucial characteristic of GovTrack from our point of view is that the information is provided in a very convenient format; for instance, the floor-debate transcripts are broken into separate HTML files according to the subject of the debate, so we can trivially derive long sequences of speeches guaranteed to cover the same topic.</Paragraph> <Paragraph position="2"> We extracted from GovTrack all available transcripts of U.S. floor debates in the House of Representatives for the year 2005 (3268 pages of transcripts in total), together with voting records for all roll-call votes during that year. We concentrated on debates regarding &quot;controversial&quot; bills (ones in which the losing side generated at least 20% of the speeches) because these debates should presumably exhibit more interesting discourse structure.</Paragraph> <Paragraph position="3"> Each debate consists of a series of speech segments, where each segment is a sequence of uninterrupted utterances by a single speaker. Since speech segments represent natural discourse units, we treat them as the basic unit to be classified.</Paragraph> <Paragraph position="4"> Each speech segment was labeled by the vote (&quot;yea&quot; or &quot;nay&quot;) cast for the proposed bill by the person who uttered the speech segment.</Paragraph> <Paragraph position="5"> We automatically discarded those speech segments belonging to a class of formulaic, generally one-sentence utterances focused on the yielding of time on the house floor (for example, &quot;Madam Speaker, I am pleased to yield 5 minutes to the gentleman from Massachusetts&quot;), as such speech segments are clearly off-topic. We also removed speech segments containing the term &quot;amendment&quot;, since we found during initial inspection that these speeches generally reflect a speaker's opinion on an amendment, and this opinion may differ from the speaker's opinion on the underlying bill under discussion.</Paragraph> <Paragraph position="6"> We randomly split the data into training, test, and development (parameter-tuning) sets representing roughly 70%, 20%, and 10% of our data, respectively (see Table 1). The speech segments remained grouped by debate, with 38 debates assigned to the training set, 10 to the test set, and 5 to the development set; we require that the speech segments from an individual debate all appear in the same set because our goal is to examine classification of speech segments in the context of the surrounding discussion.</Paragraph> </Section> class="xml-element"></Paper>