Michael Dennis

Real world environments are complicated, too complicated to be completely specified in any simulation or model. Despite this, we need our agents to be able to solve our real world problems and be able to manage real world complexity. I am interested with this interesection between Problem Specification and Open-Ended Complexity -- studying the boundary between what complexity must be described, and what can be artificially generated.

To this end we have formalized the problem of Unsupervised Environment Design (UED), which aims to build complex and challenging environments automatically to promote efficient learning and transfer . This framework has deep connections to decision theory, which allows us to make guarantees about how the resulting policies would perform in human-designed environments, without having ever trained on them.

I'm currently a Research Scientist on Google Deepmind's Openendedness team. I was previously a Ph.D. Student at the Center for Human Compatible AI (CHAI) advised by Stuart Russell. Prior to research in AI I conducted research on computer science theory and computational geometry.

Selected Papers

PAIRED

NeurIPS 2020 Oral (top 1% of submissions)

We formally define Unsupervised Environment Design (UED), as the problem of automatically building environments to promote learning and transfer. We show that UED has deep connections to the field of decisions under ignorance, and, borrowing from that field, aim to find high regret environments to promote efficient learning and transfer.

Protagonist Antagonist Induced Regret Environment Design (PAIRED) provably finds minimax regret policies in equilibrium by training an adversary to generate levels which are difficult for the protagonist agent but easy for an antagonist agent.

This motivates the adversary to build difficult but solvable levels, like the maze to the left, as an unsolvable maze is difficult for the antagonist.

Michael Dennis*, Natasha Jaques*, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, Sergey Levine

Resources:

Reception:

Adversarial Policies

ICLR 2020
NeurIPS 2019 DeepRL Workshop Talk (top 6% of accepted papers)
ICML 2019 SPML Workshop Spotlight (top 6% of accepted papers)

Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify another agent’s observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial?

We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots, against state-of-the-art victims. As shown in our demo videos, the adversarial policies reliably and dramatically win against the victims with seemingly random and uncoordinated behavior.

Adam Gleave, Michael Dennis Cody Wild, Neel Kant, Sergey Levine, Stuart Russell

Resources:

Reception:

ACCEL

NeurIPS 2021 DeepRL Workshop

Two important insights in Unsupervised Environment Design (UED) are:

High Regret Levels promote efficient learning and transfer (See PAIRED, Robust PLR)
Evolution is more efficient at optimizing environments than RL (See POET)

In this work we combine these two threads, curating level edits to maximize regret, and allowing evolution to compound this effect over time.
ACCEL achieves state of the art performance in channgling domains including:

Bipedal Walker (used in POET)
Minigrid environments (used in PAIRED and Robust PLR)

Stress test our method yourself with our interactive demo right in your browser!

Jack Parker-Holder*, Minqi Jiang*, Michael Dennis Mikayel Samvelyan, Jakob Foerster , Edward Grefenstette, Tim Rocktäschel

Resources:

Reception:

Robust PLR

NeurIPS 2021

Unsupervised Environment Design (UED) is a promising technique to accelerate training and promote transfer performance. In this work we study two classes of techniques for UED, active design like PAIRED and passive curation like PLR, casting them into one framework of Dual Curriculum Design (DCD). The DCD allows us to extend regret-based guarantees to combinations of existing frameworks.

Moreover, the theory suggests a counterintuitive conclusion:
PLR can be improved by training on less data, which we call Robust PLR.
essentially UED methods should focus on quality over quantity

We validate Robust PLR empirically, showing it achieves state of the art in challenging transfer tasks.

Minqi Jiang*, Michael Dennis* Jack Parker-Holder, Jakob Foerster , Edward Grefenstette, Tim Rocktäschel

Links:

SAMPLR

NeurIPS 2021 DeepRL Workshop

For a policy to achieve good performance it often needs to internalize the probabilities of uncertain events -- the chance the next card is an ace, the next coin flip turns up heads, or there is rain on the next day. However, curriculum techniques such as Unsupervised Environment Design (UED), which are often instrumental to the policies' performance, function precisely by changing these distributions during training to focus on the most informative experiences. We call this effect curriculum induced covariate shift (CICS). CICS can cause the agent to overestimate the probability of the rare events the curriculum prioritized, thus internalizing incorrect probabilities and making wrong decisions. SAMPLR corrects for this bias, without changing the curriculum distributions, by correcting the value-estimate to be unbiased with respect to the true distribution of aleatoric parameters. Thus we maintain the ability of the curriculum to sample informative experience, while correcting the updates from that experience to correct for the CISC the curriculum introduced.

Minqi Jiang, Michael Dennis Jack Parker-Holder, Andrei Lupu, Heinrich Kuttler, Edward Grefenstette, Tim Rocktäschel, Jakob Foerster

Links:

EPIC

ICLR 2021 Spotlight (top 5% of submissions)

The evaluation of reward learning techniques, such as IRL, have traditionally relied on evaluating the behavior of a policy optimizing the learned reward. This process is slow, fails when the policy optimization fails, and restricts our evaluation to test environments.

Equivalent-Policy Invariant Comparison (EPIC) distance evaluates reward functions directly, quickly, and reliably -- without requiring policy optimization.

Moreover, EPIC is:

Invariant to reward shaping
Invariant reward scaling
Bounds the regret incurred by a policy against the true reward function
Invariant to dynamics, so regret bound applies to all environments

Adam Gleave, Michael Dennis Shane Legg, Stuart Russell, Jan Leike

Resources:

Reception: