language-icon Old Web
English
Sign In

Thompson sampling

Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists in choosing the action that maximizes the expected reward with respect to a randomly drawn belief. Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists in choosing the action that maximizes the expected reward with respect to a randomly drawn belief. Consider a set of contexts X {displaystyle {mathcal {X}}} , a set of actions A {displaystyle {mathcal {A}}} , and rewards in R {displaystyle mathbb {R} } . In each round, the player obtains a context x ∈ X {displaystyle xin {mathcal {X}}} , plays an action a ∈ A {displaystyle ain {mathcal {A}}} and receives a reward r ∈ R {displaystyle rin mathbb {R} } following a distribution that depends on the context and the issued action. The aim of the player is to play actions such as to maximize the cumulative rewards.

[ "Regret", "Bayesian probability" ]
Parent Topic
Child Topic
    No Parent Topic