在概率论 和机器学习 中,多臂赌博机问题 (英語:multi-armed bandit problem )[ 1] 有时称为K- 或N-臂赌博机问题 (英語:K-or N-armed bandit problem )[ 2] ,是一个必须在竞争(替代)之间分配一组固定的有限资源的问题。当每个选择的属性在分配时仅部分已知时,以最大化其预期收益的方式进行选择,并且随着时间的推移或通过向该选择分配资源可能会更好地被理解。这是一个经典的强化学习问题,体现了探索-利用权衡困境[ 3] [ 4] 。这个名字来源于想象一个赌徒坐在一排赌博机 (或称角子机、老虎机)前(有时被称为“单臂賭博機”),他必须决定玩哪台机器,每台机器玩多少次以及玩的顺序[ 5] ,并且是否继续使用当前机器 或尝试不同的机器。多臂赌博机问题也属于随机调度的广义范畴。
在该问题中,每台机器根据该机器特定的概率分布提供随机奖励,该奖励是先验未知的。赌徒的目标是最大化通过一系列杠杆拉动所获得的奖励总和[ 4] 。赌徒在每次试验中面临的关键权衡是在“利用”具有最高预期收益的机器和“探索”以获得有关其他机器的预期收益的更多信息之间[ 3] 。机器学习也面临着探索和利用之间的权衡。在实践中,多臂赌博机已用于对诸如管理大型组织(如科学基金会 或制药公司 )中的研究项目等问题进行建模[ 3] [ 4] 。在问题的早期版本中,赌徒一开始对机器一无所知。
赫伯特·罗宾斯 于1952年认识到该问题的重要性,在“实验序贯设计的某些方面”中构建了收敛种群选择策略[ 6] 。约翰·C·吉廷斯 首次发表的吉廷斯指数定理给出了最大化预期折扣奖励的最优策略[ 7] 。
^ Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning. 2002, 47 (2/3): 235–256. doi:10.1023/A:1013689704352 .
^ Katehakis, M. N.; Veinott, A. F. The Multi-Armed Bandit Problem: Decomposition and Computation . Mathematics of Operations Research. 1987, 12 (2): 262–268. S2CID 656323 . doi:10.1287/moor.12.2.262 .
^ 3.0 3.1 3.2 引用错误:没有为名为Gittins89
的参考文献提供内容
^ 4.0 4.1 4.2 引用错误:没有为名为BF
的参考文献提供内容
^ Weber, Richard, On the Gittins index for multiarmed bandits, Annals of Applied Probability , 1992, 2 (4): 1024–1033, JSTOR 2959678 , doi:10.1214/aoap/1177005588
^ Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society. 1952, 58 (5): 527–535. doi:10.1090/S0002-9904-1952-09620-8 .
^ J. C. Gittins . Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society. Series B (Methodological). 1979, 41 (2): 148–177. JSTOR 2985029 . S2CID 17724147 . doi:10.1111/j.2517-6161.1979.tb01068.x .
Guha, S.; Munagala, K.; Shi, P., Approximation algorithms for restless bandit problems, Journal of the ACM, 2010, 58 : 1–50, S2CID 1654066 , arXiv:0711.3861 , doi:10.1145/1870103.1870106
Dayanik, S.; Powell, W.; Yamazaki, K., Index policies for discounted bandit problems with availability constraints, Advances in Applied Probability, 2008, 40 (2): 377–400, doi:10.1239/aap/1214950209 .
Powell, Warren B., Chapter 10, Approximate Dynamic Programming: Solving the Curses of Dimensionality, New York: John Wiley and Sons, 2007, ISBN 978-0-470-17155-4 .
Robbins, H. , Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society , 1952, 58 (5): 527–535, doi:10.1090/S0002-9904-1952-09620-8 .
Sutton, Richard; Barto, Andrew, Reinforcement Learning , MIT Press, 1998, ISBN 978-0-262-19398-6 , (原始内容 存档于2013-12-11) .
Allesiardo, Robin, A Neural Networks Committee for the Contextual Bandit Problem, Neural Information Processing – 21st International Conference, ICONIP 2014, Malaisia, November 03-06,2014, Proceedings, Lecture Notes in Computer Science 8834 , Springer: 374–381, 2014, ISBN 978-3-319-12636-4 , S2CID 14155718 , arXiv:1409.8191 , doi:10.1007/978-3-319-12637-1_47 .
Weber, Richard, On the Gittins index for multiarmed bandits, Annals of Applied Probability , 1992, 2 (4): 1024–1033, JSTOR 2959678 , doi:10.1214/aoap/1177005588 .
Katehakis, M. ; C. Derman, Computing optimal sequential allocation rules in clinical trials, Adaptive statistical procedures and related topics, Institute of Mathematical Statistics Lecture Notes - Monograph Series 8 : 29–39, 1986, ISBN 978-0-940600-09-6 , JSTOR 4355518 , doi:10.1214/lnms/1215540286 .
Katehakis, M. ; A. F. Veinott, Jr., The multi-armed bandit problem: decomposition and computation , Mathematics of Operations Research, 1987, 12 (2): 262–268, JSTOR 3689689 , S2CID 656323 , doi:10.1287/moor.12.2.262 .
MABWiser , open source Python implementation of bandit strategies that supports context-free, parametric and non-parametric contextual policies with built-in parallelization and simulation capability.
PyMaBandits (页面存档备份 ,存于互联网档案馆 ), open source implementation of bandit strategies in Python and Matlab.
Contextual (页面存档备份 ,存于互联网档案馆 ), open source R package facilitating the simulation and evaluation of both context-free and contextual Multi-Armed Bandit policies.
bandit.sourceforge.net Bandit project (页面存档备份 ,存于互联网档案馆 ), open source implementation of bandit strategies.
Banditlib (页面存档备份 ,存于互联网档案馆 ), Open-Source implementation of bandit strategies in C++.
Leslie Pack Kaelbling and Michael L. Littman (1996). Exploitation versus Exploration: The Single-State Case .
Tutorial: Introduction to Bandits: Algorithms and Theory. Part1 (页面存档备份 ,存于互联网档案馆 ). Part2 (页面存档备份 ,存于互联网档案馆 ).
Feynman's restaurant problem , a classic example (with known answer) of the exploitation vs. exploration tradeoff.
Bandit algorithms vs. A-B testing (页面存档备份 ,存于互联网档案馆 ).
S. Bubeck and N. Cesa-Bianchi A Survey on Bandits (页面存档备份 ,存于互联网档案馆 ).
A Survey on Contextual Multi-armed Bandits (页面存档备份 ,存于互联网档案馆 ), a survey/tutorial for Contextual Bandits.
Blog post on multi-armed bandit strategies, with Python code (页面存档备份 ,存于互联网档案馆 ).
Animated, interactive plots (页面存档备份 ,存于互联网档案馆 ) illustrating Epsilon-greedy, Thompson sampling , and Upper Confidence Bound exploration/exploitation balancing strategies.
可微分计算
概论 概念 应用 硬件 软件库 架构
主题
分类