Skip to content

code4conference/Data

Repository files navigation

Data

// - Detailed steps for processing of Gigaword dataset will be released soon.

Step 1: Download the Gigaword dataset (copyright requirement).

  • It consists of 7 files, afp_eng, apw_eng, cna_eng, ltw_eng, nyt_eng, wpb_eng, xin_eng.
Source #Files Gzip-MB Totl-MB K-wrds #DOCs
afp_eng 146 1732 4937 738322 2479624
apw_eng 193 2700 7889 1186955 3107777
cna_eng 144 86 261 38491 145317
ltw_eng 127 651 1694 268088 411032
nyt_eng 197 3280 8938 1422670 1962178
wpb_eng 12 42 111 17462 26143
xin_eng 191 834 2518 360714 1744025
TOTAL 1010 9325 26348 4032686 9876086
  • Note that in the #DOCs column, there are in total 9,876,086 documents. The following is example document released by the Gigaword official website:

  • We specifically take the headline and the very first sentence as one instance, so theoretically we should have 9,876,086 instances.

Please note that each document corresponds to a id (in the example, doc ID: XIN_ENG_20100920.0459), and we will release corresponding label for these IDs instead of the first sentence due to the copyright.

Step 2: Data Cleansing

  • To exclude the effect of strange punctuations, we keep (sentence, headline) pair only if the sentence contains alphabet (a-z, A-Z), pause mark(","), stop mark(".") and "'s".

  • Filter out (sentence, headline) pairs if the length of sentence is less than 10 or over than 60. There is no need for sentence compression if the length is too short.

After step 2, one should yield around 1.02 million (sentence, headline) pairs

Human Annotated Data

About

Gigaword data process

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors