|Context||NIPS 2015 Oral|
Can we synthesize complex movement from simple goals? Different character designs may require different strategies for movement. Ideally we'd like to discover these strategies automatically; being able to do so would allow for versatile robots control, performing virtual biomechanics control, and for computer graphics. Interactive controllers can be represented as control policies, some function with some parameters which maps the state of the agent (for example, all of the joint angles and velocities) to some actions (for example, joint torques). Applying the actions results in a new state, for which another action is produced, and so on. There has been a lot of research done on finding these policies, for example general search/optimization, which is very expensive, or incorporating structure based on the design of the agent but these limit the generality. This work makes use of trajectory optimization and supervised learning to learn action policies.
Given a character agent which starts in some initial configuration, with some task it should achieve, we'd like to find a motion which solves this high-level task. The parameters to optimize over are then the required actions to achieve the task. This optimization is subject to some constraints, such as kinematics, dynamics, and contact-invariance. We can impose these as soft constraints part of our objective. This allows for an action to be carried out, but only given a particular initial state, not a general policy.
Given many “trajectories” (tasks with solutions found with trajectories optimization), we can collect state and action pairs and use them as training data. One problem with this is that this dataset can be inconsistent, because for example there are many ways to walk forward one meter and interpolating between the strategies is not always possible. To mitigate this, the policies and trajectory optimization can be jointly solved. This is a large collection of optimization problems, but can be solved alternating between learning policies and trajectory optimization. This can be done in an asynchronous way, and also without the full dataset being loaded into memory. Once the policies are made, new trajectories can be generalized. Adding noise to during training is important because it allows it to know what to do in nearby states.