Getting AI to be more cautious: Where do we go next, and can we change our minds if we don't like it?
The idea is to have an agent jointly learn a forward policy and a reset policy. The forward policy maximizes the task reward, and the reset policy tries to figure out actions to take to reset the environment to a prior state. This leads to agents that learn to avoid risky actions that could irrevocably commit them to something.
"Before an action proposed by the forward policy is executed in the environment, it must be “approved” by the reset policy. In particular, if the reset policy’s Q value for the proposed action is too small, then an early abort is performed: the proposed action is not taken and the reset policy takes control," they write.
See the full story here: https://us13.campaign-archive.com/?u=67bd06787e84d73db24fb0aa5&id=43052be12b&utm_source=MIT+Technology+Review&utm_campaign=edc326f0a0-The_Download&utm_medium=email&utm_term=0_997ed6f472-edc326f0a0-153894145