BookCorpus

BookCorpus（ブック・コーパス）は、Toronto Book Corpus（トロント・ブック・コーパス）としても知られ、インターネットから収集された約11,000冊の未発表書籍のテキストで構成されるデータセットである。このコーパスは、OpenAIによる初期の言語モデルであるGPTの訓練に使用された主要なコーパスであり^[1]、GoogleのBERTを含む他の初期の大規模言語モデルの訓練データとしても使われた^[2]。このデータセットは約9億8,500万語からなり、ロマンス、SF、ファンタジーなど幅広いジャンルの書籍に及んでいる^[2]。

このコーパスは、トロント大学とマサチューセッツ工科大学の研究者による2015年の論文「Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books」で公開された。著者らはこれを「まだ出版されていない著者による無料の書籍」で構成されていると説明している^[3]^[4]。このデータセットは当初、トロント大学のウェブページから提供された^[4]。もとのデータセットの公式バージョンは非公開となり、それに代わるものとしてBookCorpusOpenが作成されている^[5]。2015年のオリジナル論文には触れられていないが、このコーパスの書籍を収集したサイトはSmashwords（英語版）であることが知られている^[4]^[5]。

脚注

^ “Improving Language Understanding by Generative Pre-Training”. 2021年1月26日時点のオリジナルよりアーカイブ。2020年6月9日閲覧。
^ ^a ^b Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL]。
^ Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
^ ^a ^b ^c Lea, Richard (2016年9月28日). “Google swallows 11,000 novels to improve AI's conversation”. The Guardian. 2023年3月9日閲覧。
^ ^a ^b Bandy, John; Vincent, Nicholas (2021). "Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus" (PDF). =Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.

[gpt-1-paper-1] “Improving Language Understanding by Generative Pre-Training”. 2021年1月26日時点のオリジナルよりアーカイブ。2020年6月9日閲覧。

[bert-paper-2] Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL]。

[bookpaper-3] Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[swallows-4] Lea, Richard (2016年9月28日). “Google swallows 11,000 novels to improve AI's conversation”. The Guardian. 2023年3月9日閲覧。

[debt-5] Bandy, John; Vincent, Nicholas (2021). "Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus" (PDF). =Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.

[1]

[2]

[3]

[4]

[5]