Model predictive control (MPC) is an optimal control strategy that derives a control by learning a model of the world.
In order to learn a good model, we need to explore states that weโll end up in. One basic method would be the following:
- Run some base policy to randomly collect
tuples in a dataset . - Fit a model of the world on the transitions.
- Plan actions with some optimal control or optimization method like ๐ฒ Cross Entropy Method.
- Execute in the world and add observations to
, then repeat.
The basic idea here is to alternate between fitting a model and executing actions generated via a planning algorithm. However, this methodโs main limitation is that we donโt actually need to execute all the actions we planned. Rather, executing just the single action and observing the new state could allow us to avoid some future planning error. This observation gives us the MPC algorithm, which replans after each action:
- Run a base policy to collect
. - Learn a dynamics model
from . - Plan through
to get actions. - Execute the first action and observe
. - Add
to and go back to step 3; go back to step 2 every steps.
Uncertainty
At the start of training, our dynamics model may overfit the data, causing the planner to exploit some random trajectories that arenโt actually advantageous. This causes the future transitions added to
To address this, we can learn a model that predicts a distribution for
Uncertainty-Aware Models
To build an uncertainty-aware model, first note that the output entropy for the dynamics model is not the right uncertainty we want to model. There are two types:
- Statistical uncertainty, where the data has noisy labels for an input. Here, the model is uncertain about the data.
- Model uncertainty, where the model predicts out-of-distribution predictions. Here, weโre uncertain about the model.
A modelโs output entropy measures the first type of uncertainty whereas we want to find the second. That is, the model may be overconfident with out-of-distribution predictions if thereโs little statistical uncertainty in the data.
To find model uncertainty, we estimate
and we can predict the next state using
Planning
Given an uncertainty-aware model, our objective is the average reward
In practice, this amounts to sampling some weights