
Initially you have no information about which machine is expected to pay out the most money, so you try one at random and observe its payout. You enter the casino with a fixed amount of money and want to learn the best strategy to maximize your profits. Imagine you’re at a casino and are presented with a row of \(k\) slot machines, with each machine having a hidden payoff function that determines how much it will pay out. The multi-armed bandit problem is often introduced via an analogy of a gambler playing slot machines. Multi-armed bandits belong to a class of online learning algorithms that allocate a fixed number of resources to a set of competing choices, attempting to learn an optimal resource allocation policy over time. I evaluate their performance as content recommendation systems on a real-world movie ratings dataset and provide simple, reproducible code for applying these algorithms to other tasks.

In this post I discuss the multi-armed bandit problem and implementations of four specific bandit algorithms in Python (epsilon greedy, UCB1, a Bayesian UCB, and EXP3). Multi-Armed Bandits in Python: Epsilon Greedy, UCB1, Bayesian UCB, and EXP3
