Selected Papers


NeurIPS 2020 Oral (top 1% of submissions)

We formally define Unsupervised Environment Design (UED), as the problem of automatically building environments to promote learning and transfer. We show that UED has deep connections to the field of decisions under ignorance, and, borrowing from that field, aim to find high regret environments to promote efficient learning and transfer.

Protagonist Antagonist Induced Regret Environment Design (PAIRED) provably finds minimax regret policies in equilibrium by training an adversary to generate levels which are difficult for the protagonist agent but easy for an antagonist agent.

This motivates the adversary to build difficult but solvable levels, like the maze to the left, as an unsolvable maze is difficult for the antagonist.

Adversarial Policies

ICLR 2020
NeurIPS 2019 DeepRL Workshop Talk (top 6% of accepted papers)
ICML 2019 SPML Workshop Spotlight (top 6% of accepted papers)

Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify another agent’s observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial?

We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots, against state-of-the-art victims. As shown in our demo videos, the adversarial policies reliably and dramatically win against the victims with seemingly random and uncoordinated behavior.


NeurIPS 2021 DeepRL Workshop

Two important insights in Unsupervised Environment Design (UED) are:

  • High Regret Levels promote efficient learning and transfer (See PAIRED, Robust PLR)
  • Evolution is more efficient at optimizing environments than RL (See POET)
In this work we combine these two threads, curating level edits to maximize regret, and allowing evolution to compound this effect over time.
ACCEL achieves state of the art performance in channgling domains including:
  • Bipedal Walker (used in POET)
  • Minigrid environments (used in PAIRED and Robust PLR)
Stress test our method yourself with our interactive demo right in your browser!

Robust PLR

NeurIPS 2021

Unsupervised Environment Design (UED) is a promising technique to accelerate training and promote transfer performance. In this work we study two classes of techniques for UED, active design like PAIRED and passive curation like PLR, casting them into one framework of Dual Curriculum Design (DCD). The DCD allows us to extend regret-based guarantees to combinations of existing frameworks.

Moreover, the theory suggests a counterintuitive conclusion:
PLR can be improved by training on less data, which we call Robust PLR.
essentially UED methods should focus on quality over quantity

We validate Robust PLR empirically, showing it achieves state of the art in challenging transfer tasks.


NeurIPS 2021 DeepRL Workshop

For a policy to achieve good performance it often needs to internalize the probabilities of uncertain events -- the chance the next card is an ace, the next coin flip turns up heads, or there is rain on the next day. However, curriculum techniques such as Unsupervised Environment Design (UED), which are often instrumental to the policies' performance, function precisely by changing these distributions during training to focus on the most informative experiences. We call this effect curriculum induced covariate shift (CICS). CICS can cause the agent to overestimate the probability of the rare events the curriculum prioritized, thus internalizing incorrect probabilities and making wrong decisions. SAMPLR corrects for this bias, without changing the curriculum distributions, by correcting the value-estimate to be unbiased with respect to the true distribution of aleatoric parameters. Thus we maintain the ability of the curriculum to sample informative experience, while correcting the updates from that experience to correct for the CISC the curriculum introduced.


ICLR 2021 Spotlight (top 5% of submissions)

The evaluation of reward learning techniques, such as IRL, have traditionally relied on evaluating the behavior of a policy optimizing the learned reward. This process is slow, fails when the policy optimization fails, and restricts our evaluation to test environments.

Equivalent-Policy Invariant Comparison (EPIC) distance evaluates reward functions directly, quickly, and reliably -- without requiring policy optimization.

Moreover, EPIC is:

  • Invariant to reward shaping
  • Invariant reward scaling
  • Bounds the regret incurred by a policy against the true reward function
  • Invariant to dynamics, so regret bound applies to all environments