Kocisky et al. (2017) created an awesome dataset that significantly extends upon their earlier dataset which was created based on news articles from CNN and DailyMail. This new dataset now focuses on having a much larger context (an entire book or full script of a movie).

The authors kindly released the scripts and accompanying documentation to generate the datasets (see here). Unfortunately, we are all aware of an issue in creating a dataset from the web: the internet is dynamic, and web pages change frequently, leading to inconsistency between datasets created at different points in time.

I don't know (and perhaps don't want to know) why DeepMind couldn't release the full data with the crawled web pages, but I believe it is important to have one copy based on which researchers all over the world can evaluate their approaches. Thus, I'm making the generated dataset (or rather downloaded content) available here. This will hopefully make this dataset used by a wider audience and lead to faster progress in Q&A research.

Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, Edward Grefenstette.
The NarrativeQA Reading Comprehension Challenge.
Transactions of Association of Computational Linguistics. (to appear)

NarrativeQA

This tar ball includes all the downloaded stories (and associated log files) in ./narrativeqa/tmp directory. The files were downloaded using ./narrativeqa/download_stories.sh on Dec 20 2017. ./narrativeqa/compare.sh was executed to check the size differences between the original files (used by the authors) and the downloaded files. According to this check, the differences are under 3.5% except for the article id 37c11f984cb14401d85abfc20e8305ca7a472c9f of which difference is almost 99%.


Kyunghyun Cho