多臂赌博机

在概率论和机器学习中，多臂赌博机问题（英語：multi-armed bandit problem）^[1]有时称为K-或N-臂赌博机问题（英語：K-or N-armed bandit problem）^[2]，是一个必须在竞争（替代）之间分配一组固定的有限资源的问题。当每个选择的属性在分配时仅部分已知时，以最大化其预期收益的方式进行选择，并且随着时间的推移或通过向该选择分配资源可能会更好地被理解。这是一个经典的强化学习问题，体现了探索-利用权衡困境^[3]^[4]。这个名字来源于想象一个赌徒坐在一排赌博机（或称角子机、老虎机）前（有时被称为“单臂賭博機”），他必须决定玩哪台机器，每台机器玩多少次以及玩的顺序^[5]，并且是否继续使用当前机器或尝试不同的机器。多臂赌博机问题也属于随机调度的广义范畴。

在该问题中，每台机器根据该机器特定的概率分布提供随机奖励，该奖励是先验未知的。赌徒的目标是最大化通过一系列杠杆拉动所获得的奖励总和^[4]。赌徒在每次试验中面临的关键权衡是在“利用”具有最高预期收益的机器和“探索”以获得有关其他机器的预期收益的更多信息之间^[3]。机器学习也面临着探索和利用之间的权衡。在实践中，多臂赌博机已用于对诸如管理大型组织（如科学基金会或制药公司）中的研究项目等问题进行建模^[3]^[4]。在问题的早期版本中，赌徒一开始对机器一无所知。

赫伯特·罗宾斯于1952年认识到该问题的重要性，在“实验序贯设计的某些方面”中构建了收敛种群选择策略^[6]。约翰·C·吉廷斯首次发表的吉廷斯指数定理给出了最大化预期折扣奖励的最优策略^[7]。

参考资料

^ Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning. 2002, 47 (2/3): 235–256. doi:10.1023/A:1013689704352 .
^ Katehakis, M. N.; Veinott, A. F. The Multi-Armed Bandit Problem: Decomposition and Computation. Mathematics of Operations Research. 1987, 12 (2): 262–268. S2CID 656323. doi:10.1287/moor.12.2.262.
^ ^3.0 ^3.1 ^3.2 引用错误：没有为名为Gittins89的参考文献提供内容
^ ^4.0 ^4.1 ^4.2 引用错误：没有为名为BF的参考文献提供内容
^ Weber, Richard, On the Gittins index for multiarmed bandits, Annals of Applied Probability, 1992, 2 (4): 1024–1033, JSTOR 2959678, doi:10.1214/aoap/1177005588 
^ Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society. 1952, 58 (5): 527–535. doi:10.1090/S0002-9904-1952-09620-8 .
^ J. C. Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society. Series B (Methodological). 1979, 41 (2): 148–177. JSTOR 2985029. S2CID 17724147. doi:10.1111/j.2517-6161.1979.tb01068.x.

延伸阅读

Scholia上有關多臂赌博机的信息

Guha, S.; Munagala, K.; Shi, P., Approximation algorithms for restless bandit problems, Journal of the ACM, 2010, 58: 1–50, S2CID 1654066, arXiv:0711.3861 , doi:10.1145/1870103.1870106
Dayanik, S.; Powell, W.; Yamazaki, K., Index policies for discounted bandit problems with availability constraints, Advances in Applied Probability, 2008, 40 (2): 377–400, doi:10.1239/aap/1214950209  .
Powell, Warren B., Chapter 10, Approximate Dynamic Programming: Solving the Curses of Dimensionality, New York: John Wiley and Sons, 2007, ISBN 978-0-470-17155-4 .
Robbins, H., Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, 1952, 58 (5): 527–535, doi:10.1090/S0002-9904-1952-09620-8  .
Sutton, Richard; Barto, Andrew, Reinforcement Learning, MIT Press, 1998, ISBN 978-0-262-19398-6, （原始内容存档于2013-12-11） .

Allesiardo, Robin, A Neural Networks Committee for the Contextual Bandit Problem, Neural Information Processing – 21st International Conference, ICONIP 2014, Malaisia, November 03-06,2014, Proceedings, Lecture Notes in Computer Science 8834, Springer: 374–381, 2014, ISBN 978-3-319-12636-4, S2CID 14155718, arXiv:1409.8191 , doi:10.1007/978-3-319-12637-1_47 .

Weber, Richard, On the Gittins index for multiarmed bandits, Annals of Applied Probability, 1992, 2 (4): 1024–1033, JSTOR 2959678, doi:10.1214/aoap/1177005588  .
Katehakis, M.; C. Derman, Computing optimal sequential allocation rules in clinical trials, Adaptive statistical procedures and related topics, Institute of Mathematical Statistics Lecture Notes - Monograph Series 8: 29–39, 1986, ISBN 978-0-940600-09-6, JSTOR 4355518, doi:10.1214/lnms/1215540286 .
Katehakis, M.; A. F. Veinott, Jr., The multi-armed bandit problem: decomposition and computation, Mathematics of Operations Research, 1987, 12 (2): 262–268, JSTOR 3689689, S2CID 656323, doi:10.1287/moor.12.2.262.

外部链接

MABWiser, open source Python implementation of bandit strategies that supports context-free, parametric and non-parametric contextual policies with built-in parallelization and simulation capability.
PyMaBandits （页面存档备份，存于互联网档案馆）, open source implementation of bandit strategies in Python and Matlab.
Contextual （页面存档备份，存于互联网档案馆）, open source R package facilitating the simulation and evaluation of both context-free and contextual Multi-Armed Bandit policies.
bandit.sourceforge.net Bandit project （页面存档备份，存于互联网档案馆）, open source implementation of bandit strategies.
Banditlib （页面存档备份，存于互联网档案馆）, Open-Source implementation of bandit strategies in C++.
Leslie Pack Kaelbling and Michael L. Littman (1996). Exploitation versus Exploration: The Single-State Case.
Tutorial: Introduction to Bandits: Algorithms and Theory. Part1 （页面存档备份，存于互联网档案馆）. Part2 （页面存档备份，存于互联网档案馆）.
Feynman's restaurant problem, a classic example (with known answer) of the exploitation vs. exploration tradeoff.
Bandit algorithms vs. A-B testing （页面存档备份，存于互联网档案馆）.
S. Bubeck and N. Cesa-Bianchi A Survey on Bandits （页面存档备份，存于互联网档案馆）.
A Survey on Contextual Multi-armed Bandits （页面存档备份，存于互联网档案馆）, a survey/tutorial for Contextual Bandits.
Blog post on multi-armed bandit strategies, with Python code （页面存档备份，存于互联网档案馆）.
Animated, interactive plots （页面存档备份，存于互联网档案馆） illustrating Epsilon-greedy, Thompson sampling, and Upper Confidence Bound exploration/exploitation balancing strategies.

[doi10.1023/A:1013689704352-1] Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning. 2002, 47 (2/3): 235–256. doi:10.1023/A:1013689704352 .

[2] Katehakis, M. N.; Veinott, A. F. The Multi-Armed Bandit Problem: Decomposition and Computation. Mathematics of Operations Research. 1987, 12 (2): 262–268. S2CID 656323. doi:10.1287/moor.12.2.262.

[Gittins89-3] 3.0 ^3.1 ^3.2 引用错误：没有为名为Gittins89的参考文献提供内容

[BF-4] 4.0 ^4.1 ^4.2 引用错误：没有为名为BF的参考文献提供内容

[weber-5] Weber, Richard, On the Gittins index for multiarmed bandits, Annals of Applied Probability, 1992, 2 (4): 1024–1033, JSTOR 2959678, doi:10.1214/aoap/1177005588 

[6] Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society. 1952, 58 (5): 527–535. doi:10.1090/S0002-9904-1952-09620-8 .

[Gittins1979-7] J. C. Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society. Series B (Methodological). 1979, 41 (2): 148–177. JSTOR 2985029. S2CID 17724147. doi:10.1111/j.2517-6161.1979.tb01068.x.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

查论编可微分计算
概论	可微分编程自動微分张量微积分信息几何统计流形神经形态工程（英语：Neuromorphic engineering）模式识别运算学习理论（英语：Computational learning theory）归纳偏置
概念	梯度下降 SGD（英语：Stochastic gradient descent）聚类回归过拟合幻觉对抗（英语：Adversarial machine learning）注意力卷积損失函數反向传播激活函数 softmax sigmoid ReLU 正则化数据集扩散（英语：Diffusion process）自回归
应用	机器学习人工神经网络深度学习科学计算人工智能語言模型大型语言模型
硬件	TPU VPU IPU（英语：Graphcore）憶阻器 SpiNNaker（英语：SpiNNaker）
软件库	Theano TensorFlow Keras PyTorch JAX Flux.jl（英语：Flux (machine-learning framework)）
架构	多层感知器（MLP）循环神经网络（RNN）長短期記憶（LSTM）门控循环单元（英语：Gated recurrent unit）（GRU）卷积神经网络（CNN）残差神经网络（ResNet）变换器自编码器变分自编码器（VAE）生成对抗网络（GAN）图神经网络（英语：Graph neural network）（GNN）回响状态网络（英语：Echo state network）（ESN）神经图灵机（NTM）可微分神经计算机（英语：Differentiable neural computer）（DNC）
主题计算机编程技术分类人工神经网络机器学习