GitHub - code4conference/Data: Gigaword data process

Data

// - Detailed steps for processing of Gigaword dataset will be released soon.

Step 1: Download the Gigaword dataset (copyright requirement).

It consists of 7 files, afp_eng, apw_eng, cna_eng, ltw_eng, nyt_eng, wpb_eng, xin_eng.

Source	#Files	Gzip-MB	Totl-MB	K-wrds	#DOCs
afp_eng	146	1732	4937	738322	2479624
apw_eng	193	2700	7889	1186955	3107777
cna_eng	144	86	261	38491	145317
ltw_eng	127	651	1694	268088	411032
nyt_eng	197	3280	8938	1422670	1962178
wpb_eng	12	42	111	17462	26143
xin_eng	191	834	2518	360714	1744025
TOTAL	1010	9325	26348	4032686	9876086

Note that in the #DOCs column, there are in total 9,876,086 documents. The following is example document released by the Gigaword official website:

We specifically take the headline and the very first sentence as one instance, so theoretically we should have 9,876,086 instances.

Please note that each document corresponds to a id (in the example, doc ID: XIN_ENG_20100920.0459), and we will release corresponding label for these IDs instead of the first sentence due to the copyright.

Step 2: Data Cleansing

To exclude the effect of strange punctuations, we keep (sentence, headline) pair only if the sentence contains alphabet (a-z, A-Z), pause mark(","), stop mark(".") and "'s".
Filter out (sentence, headline) pairs if the length of sentence is less than 10 or over than 60. There is no need for sentence compression if the length is too short.

After step 2, one should yield around 1.02 million (sentence, headline) pairs

Human Annotated Data

We ask human annotator to create 200 extractive compressions for the very first 200 sentences

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
1.02M sentences in Gigaword.txt		1.02M sentences in Gigaword.txt
LDC2011T07.png		LDC2011T07.png
README.md		README.md
first 200 sentences.txt		first 200 sentences.txt
first 200-sentences labels.txt		first 200-sentences labels.txt
first 200-sentences labels_1.txt		first 200-sentences labels_1.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data

Step 1: Download the Gigaword dataset (copyright requirement).

Step 2: Data Cleansing

Human Annotated Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Data

Step 1: Download the Gigaword dataset (copyright requirement).

Step 2: Data Cleansing

Human Annotated Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages