Monday, September 27, 2010
This feature has been described by Ijspeert et al. (2002) as an attractor landscape where the the learned policy will produce trajectories that move towards the ideal trajectory. This can be achieved through designing reward functions that promote the ideal trajectory. Algorithms that explore the landscape around this will produce an attractor landscape through the discounted rewards.
Thursday, September 23, 2010
The features so far are:
- Sequential - Sequences of events in time. This is something artificial neural networks (ANNs) are typically not very good at.
- Hierarchical - Hierarchical structures for long term memory (LTM). This enables reuse of low level behaviours and efficient encoding of observations. It is a well studied problem in reinforcement learning (RL) and Barto and Mahadevan (2003) have published a good review paper.
- Incremental - Observations are processed in the sequence they are made without storing and revisiting old observations, improving the existing system before further observations are made. Reinke and Michalski (1988) first introduced this concept w.r.t. their incremental AQ algorithm for learning concept descriptions. Many ANNs are statistic, i.e., they need to repeatedly pass through batches of observations.
- One-shot - Able to learn from a single, or a small number of, observations. Bayesian approaches to this problem have been published by Fei-Fei et al. (2003) and Maas and Kemp (2009). Instance-based learning algorithms such as the Nearest Sequence Memory (NSM) algorithm presented by McCallum (1996) are an extreme form of one-shot learning. Wu and Demiris (2010) recently published an algorithm that is both hierarchical and one-shot.
- Auto-associative - Retrieving memory content using the content itself as a reference, e.g., retieving a stored image of a person's face using another image or a partially obscured image of that face. This is what neural networks are good at and it makes them able to handle noisy real world date. This is something RL algorithms are typically not so good at.
- Future reward prediction - Predicting what actions will optimise future rewards in what world states. This is a core feature of RL algorithms (Sutton & Barto 1998).
- Hidden state identification - Different world states can produce identical observations, e.g., two different corridors in a building can look exactly the same. The only way to tell these states apart is to remember observations from the past, e.g., you know what corridor you're in because you remember what floor you took the elevator to. Some, but not all, RL algorithms have this feature.
Cohen et al. (2002) have presented the Constructivist Learning Architecture (CLA) that was both sequential, hierarchical and auto-associative, using decaying node activities in a hierarchical self-organising map (SOM) or Kohonen network (Kohonen 2001).
My paper (2010) presented a hierarchical extension of McCallum's NSM algorithm that was both sequential, one-shot and hierarchical with future reward prediction and hidden state identification capabilities.
Pierris and I (2010) published an algorithm that used a version of the Cohen's CLA algorithm, the Compressed Sparse Code (CoSCo) SOM, to reproduce a humanoid robot motion demonstrated by a human teacher. This work went beyond the original work of Cohen et al. in that it repeatedly used the learned SOM for action selection in order to reproduce the motion.
In the future I aim to use the mechanisms I developed for the hierarchical NSM algorithm to extend Chaput's decaying activity hierarchical SOMs so that it can support future reward discount and hidden state identification. This will create an algorithm suitable for RL problems such as reaching.