Improved Online Learning Algorithm for Multinomial Logit Contextual Bandits

Priyank Agrawal,Vashist Avadhanula,Theja Tulabandhula

Improved Online Learning Algorithm for Multinomial Logit Contextual Bandits

2020

In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic assortment optimization problem, where in every round a decision maker offers a subset (assortment) of products to a consumer, and observes their response. Consumers purchase products so as to maximize their utility. We assume that the products are described by a set of attributes and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior by means of the widely used Multinomial Logit (MNL) model, and consider the decision maker's problem of dynamically learning the model parameters, while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods and their theoretical performance depend on a problem dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(\sqrt{\kappa d T})$, where $\kappa$ is a problem dependent constant that can have exponential dependency on the number of attributes. In this paper, we propose a new algorithm with a carefully designed exploration strategy and show that the regret is bounded by $O(\sqrt{dT} + \kappa)$, significantly improving the performance over existing methods.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations