BERT (言語モデル)

BERT（バート、英: Bidirectional Encoder Representations from Transformers）は、Googleの研究者によって2018年に導入された言語モデルファミリーである^[1]^[2]。2020年の文献調査では、「わずか1年強の間に、BERTは自然言語処理（NLP）実験のいたるところで使用される基準線となり、150を超える研究発表がこのモデルを分析・改良している」と結論づけている^[3]。

背景

方向制約

BERT 以前の多くの言語モデルは事前学習に単方向性（英: unidirectional）のタスクを採用しており^[4]、学習された表現も単方向の文脈しか考慮しないものであった。この制約は文脈レベルの表現が必要なタスクの性能へ大きなペナルティを与えうる。

アーキテクチャ

後述するMLM により双方向に依存するモデルを採用可能になったことから、BERT ではネットワークとして双方向性の Transformerアーキテクチャ (Bidirectional Encoder^[5] of Transformer) を採用した^[6]。すなわち self-attention による前後文脈取り込みと位置限局全結合による変換を繰り返すネットワークを用いている。

BERTは、トークン化にWordPieceを使用して、各英単語を整数コードに変換する。その語彙数は30,000である。語彙に含まれないトークンは、「不明」（unknown）を意味する [UNK] で置き換えられる。

BERTは、同時に2つのタスクで事前訓練された^[7]。

双方向タスク/MLM

単方向制約を超えた双方向（Bidirectional）の言語モデルを構築するために、BERT では事前学習タスク/損失関数として masked language model (MLM) を採用した^[8]。MLMでは部分マスクされた系列を入力としてマスク無し系列を予測し、マスク部に対応する出力に対して一致度を計算し学習する^[9]。モデルはマスクされていない情報（周囲の文脈/context）のみからマスク部を予測する事前学習タスクを解くことになる^[10]。

トークンの15%が予測用に選択され、訓練の目的は文脈を考慮して選択されたトークンを予測することとされた。選択されたトークンは、

確率80%で [MASK] トークンに置き換えられ、
確率10%でランダムな単語トークンに置き換えられ、
確率10%で置き換えられなかった。

たとえば、「私の犬はかわいいです」（"my dog is cute"）という文では、4番目のトークンが予測のために選択される可能性があった。このモデルの入力テキストは次の様になる。

確率80%で「私の犬は[MASK]です」
確率10%で「私の犬は幸せです」
確率10%で「私の犬はかわいい」

入力テキストを処理した後、モデルの4番目の出力ベクトルは別のニューラルネットワークに渡され、ニューラルネットワークは30,000語の大規模な語彙に対する確率分布を出力する。

次文予測

2つのスパン（範囲）が与えられたとき、モデルはこれらの2つのスパンが訓練コーパスで連続して出現するかを予測し、[IsNext] または [NotNext] のどちらかかを出力する。最初のスパンは特別なトークン [CLS] （「classify」の意味）で始まる。2つのスパンは特別なトークン [SEP] （「separate」の意味）で区切られる。2つのスパンを処理した後、先頭の出力ベクトル（すなわち、[CLS] を符号化したベクトル）は別のニューラルネットワークに渡され、ニューラルネットワークは [IsNext] と [NotNext] に二値分類する。

たとえば、「[CLS] 私の犬はかわいいです [SEP] 彼は遊ぶのが好きです」が与えられると、モデルはトークン [IsNext] を出力するべきである。
たとえば、「[CLS] 私の犬はかわいいです [SEP] 磁石はどのように働きますか」が与えらると、モデルはトークン [NotNext] を出力するべきである。

この訓練プロセスの結果、BERTは、文脈における単語や文の潜在的表現（英語版）を学習する。事前訓練後、BERTをより小さなデータセット上でより少ないリソースでファインチューニングし、NLPタスク（言語理解、文書分類）や、シーケンス変換（英語版）に基づく言語生成タスク（質問応答、会話応答生成）などの特定のタスクでの性能を最適化することができる^[1]^[11]。事前訓練段階は、ファインチューニングよりもはるかに計算コストが高い。

性能

当初、BERTは、英語の2つのモデルサイズで実装された^[1]。

BERT_BASE：12個のエンコーダと12個の双方向自己アテンションヘッド、合計1億1,000万パラメータ、
BERT_LARGE：24個のエンコーダと16個の双方向自己アテンションヘッド、合計3億4,000万パラメータ。

両モデルとも、Toronto BookCorpus^[12]（8億語）と、英語版ウィキペディア（25億語）で事前訓練された。

BERTが発表されたとき、多くの自然言語理解タスクで最先端の性能を達成した^[1]。

GLUE（一般自然言語理解（英語版））タスクセット（9タスクで構成）
SQuAD（スタンフォード質問応答データセット^[13]）v1.1およびv2.0
SWAG（Situations With Adversarial Generations、敵対的生成を含む状況^[14]）

解析

これらの自然言語理解タスクでBERTが最先端の性能を発揮できる理由は、まだよく分かっていない^[15]^[16]。現在の研究は、注意深く選択された入力シーケンス^[17]^[18]、プロービング分類器による内部ベクトル表現の分析^[19]^[20]、およびアテンションウェイトによって表される関連性の結果として、BERT出力の背後にある関係を調査することに重点を置いている^[15]^[16]。また、BERTモデルの高い性能は、それが双方向に学習されるという事実に帰する可能性もある。つまり、Transformerモデルアーキテクチャに基づくBERTが、その自己アテンション機構を使用して、訓練中にテキストの左側と右側から情報を学習するため、文脈を深く理解することができる。たとえば、fine という単語は、文脈によって2つの異なる意味を持つことがある。「I feel fine today, She has fine blond hair」（今日は良い気分だ。彼女は細いブロンドの髪をしている）。BERTは、対象となる単語 fine を囲む単語列を左右から見る。

しかし、これには代償が伴う。エンコーダのみでデコーダを持たないアーキテクチャのため、BERTはプロンプトを出したり、テキストを生成することができない^[要説明]。一般的に、双方向モデルは右側がないと効果的に動作しないため、プロンプトを出力するのが難しく、短いテキストを生成するにも高度で計算コストのかかる技術が必要となる^[21]。

訓練するために非常に大量のデータを必要とするディープラーニング・ニューラルネットワークとは対照的に、BERTはすでに事前訓練されている。すなわち、単語や文の表現、およびそれらを接続する基本的な意味関係を学習していることを意味する。BERTはその後、感情分類などの特定のタスクに合わせて、より小規模なデータセットを使用してファインチューニングすることができる。したがって、事前訓練モデルの選択においては、使用するデータセットの内容だけでなく、タスクの目的も考慮される。たとえば、財務データに関する感情分類タスクに使用する場合、財務テキストの感情分析のための事前訓練モデルを選択するべきである。オリジナルの訓練済みモデルのウェイトはGitHubで公開されている^[22]。

沿革

BERTは元々、Googleの研究者Jacob Devlin、Ming-Wei Chang、Kenton Lee、Kristina Toutanovaによって発表された。この設計は、半教師ありシーケンス学習（英語版）^[23]、生成的事前訓練、ELMo（英語版）^[24]、ULMFit^[25]などの文脈表現の事前訓練を起源とする。従来のモデルとは異なり、BERTは完全な双方向性を持つ教師なし言語表現であり、平文テキストコーパスのみを使用して事前訓練されている。word2vecやGloVe（英語版）のような文脈独立モデルは、語彙内の各単語ごとに単一の単語埋め込み表現を生成するのに対し、BERTは与えられた単語が出現するごとに文脈を考慮する。たとえば、「He is running a company」（彼は会社を経営している）と「He is running a marathon」（彼はマラソンをしている）の2つの文について、word2vecでは「running」のベクトル表現は同じであるのに対し、BERTでは文によって異なる文脈に応じた埋め込みを生成する。

2019年10月25日、Google検索は、米国内の英語検索クエリに BERTモデルの適用を開始したことを発表した^[26]。2019年12月9日、BERT が 70を超える言語で Google検索に採用されたことが報告された^[27]。2020年10月、ほぼすべての英語ベースのクエリが BERT によって処理された^[28]。

表彰

BERTを記述した研究論文は、2019年の北米計算言語学学会（英語版）（NAACL）年次会議で、最優秀論文賞を受賞した^[29]。

脚注

[脚注の使い方]

出典

^ ^a ^b ^c ^d Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL]。
^ “Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing” (英語). Google AI Blog (2 November 2018). 2019年11月27日閲覧。
^ Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020). “A Primer in BERTology: What We Know About How BERT Works”. Transactions of the Association for Computational Linguistics 8: 842–866. arXiv:2002.12327. doi:10.1162/tacl_a_00349.
^ "objective function during pre-training, where they use unidirectional language models to learn general language representations" Devlin (2018)
^ "Critically ... the BERT Transformer uses bidirectional self-attention ... We note that in the literature the bidirectional Transformer is often referred to as a 'Transformer encoder' while the left-context-only version is referred to as a 'Transformer decoder' since it can be used for text generation."
^ "the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer." Devlin (2018)
^ “Summary of the models — transformers 3.4.0 documentation”. huggingface.co. 2023年2月16日閲覧。
^ "BERT alleviates the previously mentioned unidirectionality constraint by using a 'masked language model' (MLM) pre-training objective" Devlin (2018)
^ "The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word" Devlin (2018)
^ "predict the original vocabulary id of the masked word based only on its context." Devlin (2018)
^ “BERT Explained: State of the art language model for NLP”. Towards Data Science (2018年). 27 September 2021閲覧。
^ Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). "Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books". pp. 19–27. arXiv:1506.06724 [cs.CV]。
^ Rajpurkar, Pranav; Zhang, Jian; Lopyrev, Konstantin; Liang, Percy (10 October 2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text". arXiv:1606.05250 [cs.CL]。
^ Zellers, Rowan; Bisk, Yonatan; Schwartz, Roy; Choi, Yejin (15 August 2018). "SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference". arXiv:1808.05326 [cs.CL]。
^ ^a ^b Kovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna (November 2019). “Revealing the Dark Secrets of BERT” (英語). Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 4364–4373. doi:10.18653/v1/D19-1445
^ ^a ^b Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (2019). “What Does BERT Look at? An Analysis of BERT's Attention”. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 276–286. doi:10.18653/v1/w19-4828.
^ Khandelwal, Urvashi; He, He; Qi, Peng; Jurafsky, Dan (2018). “Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context”. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics): 284–294. arXiv:1805.04623. doi:10.18653/v1/p18-1027.
^ Gulordava, Kristina; Bojanowski, Piotr; Grave, Edouard; Linzen, Tal; Baroni, Marco (2018). “Colorless Green Recurrent Networks Dream Hierarchically”. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics): 1195–1205. arXiv:1803.11138. doi:10.18653/v1/n18-1108.
^ Giulianelli, Mario; Harding, Jack; Mohnert, Florian; Hupkes, Dieuwke; Zuidema, Willem (2018). “Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information”. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 240–248. arXiv:1808.08079. doi:10.18653/v1/w18-5426.
^ Zhang, Kelly; Bowman, Samuel (2018). “Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis”. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 359–361. doi:10.18653/v1/w18-5448.
^ Patel, Ajay; Li, Bryan; Mohammad Sadegh Rasooli; Constant, Noah; Raffel, Colin; Callison-Burch, Chris (2022). "Bidirectional Language Models Are Also Few-shot Learners". arXiv:2209.14500 [cs.LG]。
^ “BERT”. GitHub. 28 March 2023閲覧。
^ Dai, Andrew; Le, Quoc (4 November 2015). "Semi-supervised Sequence Learning". arXiv:1511.01432 [cs.LG]。
^ Peters, Matthew; Neumann, Mark; Iyyer, Mohit; Gardner, Matt; Clark, Christopher; Lee, Kenton; Luke, Zettlemoyer (15 February 2018). "Deep contextualized word representations". arXiv:1802.05365v2 [cs.CL]。
^ Howard, Jeremy; Ruder, Sebastian (18 January 2018). "Universal Language Model Fine-tuning for Text Classification". arXiv:1801.06146v5 [cs.CL]。
^ Nayak (25 October 2019). “Understanding searches better than ever before”. Google Blog. 10 December 2019閲覧。
^ Montti (10 December 2019). “Google's BERT Rolls Out Worldwide”. Search Engine Journal. Search Engine Journal. 10 December 2019閲覧。
^ “Google: BERT now used on almost every English query”. Search Engine Land (2020年10月15日). 2020年11月24日閲覧。
^ “Best Paper Awards”. NAACL (2019年). Mar 28, 2020閲覧。

外部リンク

[:02-1] Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL]。

[2] “Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing” (英語). Google AI Blog (2 November 2018). 2019年11月27日閲覧。

[3] Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020). “A Primer in BERTology: What We Know About How BERT Works”. Transactions of the Association for Computational Linguistics 8: 842–866. arXiv:2002.12327. doi:10.1162/tacl_a_00349.

[4] "objective function during pre-training, where they use unidirectional language models to learn general language representations" Devlin (2018)

[5] "Critically ... the BERT Transformer uses bidirectional self-attention ... We note that in the literature the bidirectional Transformer is often referred to as a 'Transformer encoder' while the left-context-only version is referred to as a 'Transformer decoder' since it can be used for text generation."

[6] "the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer." Devlin (2018)

[7] “Summary of the models — transformers 3.4.0 documentation”. huggingface.co. 2023年2月16日閲覧。

[8] "BERT alleviates the previously mentioned unidirectionality constraint by using a 'masked language model' (MLM) pre-training objective" Devlin (2018)

[9] "The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word" Devlin (2018)

[10] "predict the original vocabulary id of the masked word based only on its context." Devlin (2018)

[11] “BERT Explained: State of the art language model for NLP”. Towards Data Science (2018年). 27 September 2021閲覧。

[12] Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). "Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books". pp. 19–27. arXiv:1506.06724 [cs.CV]。

[13] Rajpurkar, Pranav; Zhang, Jian; Lopyrev, Konstantin; Liang, Percy (10 October 2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text". arXiv:1606.05250 [cs.CL]。

[14] Zellers, Rowan; Bisk, Yonatan; Schwartz, Roy; Choi, Yejin (15 August 2018). "SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference". arXiv:1808.05326 [cs.CL]。

[:12-15] Kovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna (November 2019). “Revealing the Dark Secrets of BERT” (英語). Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 4364–4373. doi:10.18653/v1/D19-1445

[:22-16] Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (2019). “What Does BERT Look at? An Analysis of BERT's Attention”. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 276–286. doi:10.18653/v1/w19-4828.

[17] Khandelwal, Urvashi; He, He; Qi, Peng; Jurafsky, Dan (2018). “Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context”. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics): 284–294. arXiv:1805.04623. doi:10.18653/v1/p18-1027.

[18] Gulordava, Kristina; Bojanowski, Piotr; Grave, Edouard; Linzen, Tal; Baroni, Marco (2018). “Colorless Green Recurrent Networks Dream Hierarchically”. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics): 1195–1205. arXiv:1803.11138. doi:10.18653/v1/n18-1108.

[19] Giulianelli, Mario; Harding, Jack; Mohnert, Florian; Hupkes, Dieuwke; Zuidema, Willem (2018). “Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information”. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 240–248. arXiv:1808.08079. doi:10.18653/v1/w18-5426.

[20] Zhang, Kelly; Bowman, Samuel (2018). “Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis”. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 359–361. doi:10.18653/v1/w18-5448.

[21] Patel, Ajay; Li, Bryan; Mohammad Sadegh Rasooli; Constant, Noah; Raffel, Colin; Callison-Burch, Chris (2022). "Bidirectional Language Models Are Also Few-shot Learners". arXiv:2209.14500 [cs.LG]。

[22] “BERT”. GitHub. 28 March 2023閲覧。

[23] Dai, Andrew; Le, Quoc (4 November 2015). "Semi-supervised Sequence Learning". arXiv:1511.01432 [cs.LG]。

[24] Peters, Matthew; Neumann, Mark; Iyyer, Mohit; Gardner, Matt; Clark, Christopher; Lee, Kenton; Luke, Zettlemoyer (15 February 2018). "Deep contextualized word representations". arXiv:1802.05365v2 [cs.CL]。

[25] Howard, Jeremy; Ruder, Sebastian (18 January 2018). "Universal Language Model Fine-tuning for Text Classification". arXiv:1801.06146v5 [cs.CL]。

[26] Nayak (25 October 2019). “Understanding searches better than ever before”. Google Blog. 10 December 2019閲覧。

[27] Montti (10 December 2019). “Google's BERT Rolls Out Worldwide”. Search Engine Journal. Search Engine Journal. 10 December 2019閲覧。

[28] “Google: BERT now used on almost every English query”. Search Engine Land (2020年10月15日). 2020年11月24日閲覧。

[29] “Best Paper Awards”. NAACL (2019年). Mar 28, 2020閲覧。

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]