Abstract
Online learning is a popular framework for sequential prediction problems. The standard approach to analyzing an algorithm's (learner's) performance in online learning is in terms of its empirical regret defined to be the excess loss suffered by the online learner compared to the best algorithm in hindsight. Regret makes sense when the learner's environment is oblivious; however, there is a serious difficulty in interpreting regret when the environment is reactive. The problem stems from the fact that regret evaluates any competing strategy on the same sequence of losses as generated in response to the learner's actions. This does not make much sense in reactive environments such as financial markets, healthcare, and social networks, where learner's actions affect potential future outcomes. To deal with this issue, we introduced policy regret, a counterfactual notion of regret designed to capture the reactive nature of the learning environment. In this talk, we will revisit policy regret in a partial observation setting given in terms of a feedback graph, and a game-theoretic setting where the environment is a self-interested agent rather than a malicious adversary.