WORKING PAPER

Online Action Learning in High Dimensions: A New Exploration Rule for Contextual et-Greedy Heuristics

2020

Marcelo Medeiros, Claudio Cardoso Flores.

TD n. 674

Download the text

Bandit problems are pervasive in various fields of research and are also present in  several practical applications. Examples, including dynamic pricing and assortment and the design of auctions and incentives, permeate a large number of sequential treat- ment experiments. Different applications impose distinct levels of restrictions on viable actions. Some favor diversity of outcomes, while others require harmful actions to be closely monitored or mainly avoided. In this paper, we extend one of the most popular bandit solutions, the original et-greedy heuristics, to high-dimensional contexts. Moreover, we introduce a competing exploration mechanism that counts with searching sets based on order statistics. We view our proposals as alternatives for cases where pluralism is valued or, in the opposite direction, cases where the end-user should carefully tune the range of exploration of new actions. We find reasonable bounds for the cumulative regret of a decaying et-greedy heuristic in both cases, and we provide an upper bound for the initialization phase that implies the regret bounds when order statistics are considered to be at most equal but mostly better than the case when random searching is the sole exploration mechanism. Additionally, we show that endusers have sufficient  exibility to avoid harmful actions since any cardinality for the higher-order statistics can be used to achieve stricter upper bound. In a simulation exercise, we show that the algorithms proposed in this paper outperform simple and adapted counterparts.

 

Share

See also