Wednesday, February 15, 2012

Incremental exploration

Learn from demonstration typically means learning from training data that are in the form of a relatively small number of complex sequences of observations and potentially actions. The strength of this learning paradigm is that the data provided is related to the crucial areas of the problem space. In the case of reinforcement learning, this would involve the key reward states and effective paths to these states from relevant starting states. However, due to the restricted part of a problem space that can be covered using this form of learning, it typically leads to brittle behaviours that are not able to compensate for perturbations that place the robot outside the known area. One solution to this problem is to use the training data to learn a policy that generalises across large areas of the problem space, such as the Nonlinear Dynamical Systems presented by Ijspeert et al. [3]. Another approach is to hard code a mechanism for returing to the known area such as the extension to Gaussian Mixture Models presented by Calinon [2]. Abbeel and Ng [1], argued, from their experience in the domain of autonomous helicopter control, that an explicit exploration policy is not required in order to improve performance up to or beyond that of the teacher. Instead, the natural perturbations would provide sufficient exploration.

Bill Smart and Leslie Kaelbling [5] developed the JAQL (Joystick and Q-learning?) algorithm to overcome this problem. The JAQL algorithms has two different learning phases. In the first phase, the robot is driven through the "interesting" parts of the problem space by a hand coded controller or by a human controller using a joystick. In the second phase, the policy learned was in control and responsible for further exploration, running in a more standard reinforcement learning mode. The JAQL algorithm has an explicit exploration policy designed to work with policies learnt from demonstration.

The JAQL exploration policy creates slight deviations from the greedy action by adding a small amount of Gaussian noise [4]. This policy creates actions that are "similar to, but different from", the greedy action.

Our RLSOM algorithm has so far been applied only to learning by demonstration, but should be capable of handling learning from exploration without other modifications that a reasonable exploration policy. This is one of the most exciting direction in which to take our research.

[1] Pieter Abbeel and Andrew Y. Ng, Exploration and apprenticeship learning in reinforcement learning. In the Proceedings of the 22nd International Conference on Machine Learning (ICML'05), pp1-8, August 7-11, Bonn, Germany, 2005.

[2] Sylvain Calinon, Robot Programming by Demonstration: A Probabilistic Approach. EPFL/CRC Press, 2009.

[3] Auke J. Ijspeert, Jun Nakanishi and Stefan Schaal, Movement imitation with nonlinear dynamical systems in humanoid robots. In the Proceedings of the International Conference on Robotics and Automation (ICRA'02), pp1398-1403, May 11 - 15, Washington, DC, 2002.

[4] William D. Smart, Making Reinforcement Learning Work on Real Robots. Ph.D. thesis, Department of Computer Science, Brown University, 2002.

[5] William D. Smart and Leslie Pack Kaelbling, Reinforcement Learning for Robot Control. In Mobile Robots XVI (Proceedings of the SPIE 4573), pp92-103, Douglas W. Gage and Howie M. Choset (eds.), Boston, Massachusetts, 2001.

No comments: