Combinatorial bandits (CMAB) are a generalization of the well-known Multi-Armed Bandit framework, in which the learner chooses, at each round, a subset of the available arms that satisfies some known constraints. The learner observes the payoffs of each chosen arm and aims at maximizing the cumulative reward. We study, for the first time, CMAB settings with some form of correlation over the arms expected rewards. The arm correlation is crucial to allow algorithms to be effective when the space of the arms is large. In the present paper, we propose a bandit algorithm, namely Gaussian Combinatorial Bandit (GCB), designed for settings in which the arms are partitioned in subsets, and the payoff functions of the arms of each subset are jointly distributed as a Gaussian Process (GP). We provide two different variations of our algorithm (frequentist and Bayesian) that, under mild assumptions, their worst-case regret is O˜(C √ N), where C is the number of subsets of arms whose payoffs are correlated and N is the number of rounds.
Guglielmo Maria Accabi, Francesco Trovo, Alessandro Nuara, Nicola Gatti, Marcello Restelli
14th European Workshop on Reinforcement Learning
October, 2018