## Monday, March 12, 2018

### MCTS - Further Improvements

There is a rich literature on how to improve MCTS beyond the different confidence bound values presented in by previous blog post on MCTS Confidence Bounds.

## Single Player Games

Most MCTS algorithms have been developed to search two-player games such as Go.  When applying MCTS principles to single-player games a a number of issues have been considered.

Some work discusses the option of discounting the reward value, using a discount factor, $\gamma$, during a backup.  This is considered important in two-player games where the opponent may be unpredictable [2], but less so for single-player games where action selection and state transition is more deterministic.

The more deterministic environment in single-player games also makes it more attractive to chose a child based on the maximum simulation reward for each child, rather than the mean reward.  This requires each note to record the maximum reward as well as the mean.  In CADIAPLAYER [2] the mean is still used during simulation, but the max is used for the final selection of an action to play.

The CADIAPLAYER also adds the complete search path to the tree, rather than a single-node extension, for searches that lead to optimal solutions.

## Recursive Searches

Instead of doing a large number of simulations and select the final action based on the observed statistics, algorithms such as Nested Monte-Carlo Tree Search (NMCS) [9] make further searches starting from the best node identified in the search before.  This increases the proportion of searches around the most promising path but reports better  results on some problems than MCTS.

Another recursive solution is Nested Policy Rollout Adaptation [10] which uses a policy to guide the child selection in the typically random rollout phase.  It also updates the given policy using a gradient descent step after each recursive call.

## Combining MCTS with Learning

MCTS can be effectively combined with RL algorithms for learning RL policies or value functions.  Due to their ability to compress information and generalise states and actions, such algorithms can provide global, long-term representations that complement the detailed uncompressed, focused and short-term representation of MCTS.
A typical setup used by AlpaZero [8], Expert Iteration [1] and Dyna-2 [7] is for MCST to provide state, policy and/or reward samples for supervised training of long-term policies or value functions, while the policy and/or value functions provide heuristics to guide the action selection in MCTS.

## References

[1] Thomas Anthony and David Barber (2017) Thinking Fast and Slow with Deep Learning and Tree Search, NIPS 2017, pp5360-5370.

[2] Yngvi Bjornsson and Hilmar Finnsson (2009) CADIAPLAYER: A Simulation-Based General Game Player,

[3] B. Bouzy and B. Helmstetter (2004) Advances in Computer Games, vol 135. Springer.

[4] B. Brügmann (1993) Monte Carlo Go, technical report.

[5] Remi Coulom (2006) Efficient selectivity and backup operators in Monte-Carlo tree search, in the proceedings of the 5th International Conference on Computers and Games, pp72-83.

[6] Alan Fern and Paul Lewis (2011) Ensemble Monte-Carlo planning: an empirical study, in the Proceedings of the Twenty-First International Conference on International Conference on Automated Planning and Scheduling, pp58-65.

[7] David Silver, Richard S. Sutton and Martin Müller (2008) Sample-based learning and search with permanent and transient memories, in the proceedings of the 25th international conference on Machine learning (ICML), New York, USA, pp968-975.

[8] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2017) Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, arXiv:1712.01815 [cs.AI].

[9]  Tristan Cazenave (2009) Nested Monte-Carlo Search, in the Proceedings of the 21st International Joint Conference on Artifical intelligence (IJCAI), pp456-461.

[10] Christopher D. Rosin (2011) Nested Rollout Policy Adaptation for Monte Carlo Tree Search, in the Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI), pp649-654.

### MCTS - Confidence Bounds

Most Monte-Carlo Tree Search algorithms use some version of a confidence bound on the value of a node to decide where to focus the search, i.e., which node to expand next.  See the

## Upper Confidence Bounds for Trees (UCT)

UCT is one family of MCTS algorithms.  In particular, UCT algorithms use the maximum UCT value to chose between the children of a node.  The UCT  value is defined in Equation \eqref{eq:uct}.

$$UCT = \bar{X_j}+2C_p\sqrt{\frac{2\ln n}{n_j} } \tag{1}\label{eq:uct}$$

where $\bar{X_j}$ is the mean reward from branch $j$, $n$ is the number of times the current (parent) node has been visited and $n_j$ is the number of times branch $j$ has been visited.  $C$ is a constant exploration term greater than 0 which decides the amount of exploration done.  An $n_j = 0$ implies a $UCT$ value of $\infty$ so unvisited branches will always be explored at least once.
Given enough resources, UCT converges to Minimax and provides the optimal solution.

The $\sqrt{\frac{2\ln n}{n_j} }$ element of Equation \eqref{eq:uct} describes the upper confidence bound or $UCB1$.

## Improving on $UCB1$

It is possible to improve theoretically on $UCB1$, making tighter bounds, by replacing the $UCB1$ element with the expression in Equation \eqref{eq:ucb_tuned}.

$$UCB_{Tuned} = \sqrt{\frac{\ln n}{n_j} min\{\frac{1}{4},V_j(n_j)\}} \tag{2}\label{eq:ucb_tuned}$$

where $V_j(s)$ is given in Equation \eqref{eq:var}.

$$V_j(s) = (\frac{1}{2}\sum_{\tau=1}^s X_{j,\tau}^2) - \bar{X}_{j,s}^2 + \sqrt{\frac{2 \ln t}{s}} \tag{3}\label{eq:var}$$

Another alternative for improvement on the bounds is to use a Bayesian framework where the algorithm maximised $B_i$ as defined in Equation \eqref{eq:Bayes}.

$$B_i = \mu_i + \sqrt{\frac{2\ln N}{n_i} } \sigma_i \tag{4}\label{eq:Bayes}$$

where $\mu_i$ is the mean of an extremum (minimax) distribution $P_i$ and $\sigma_i$ is the square root of th evariance of $P_i$.  The Bayesian bounds has been shown to be better than $UBC1$ but are also slower to calculate.

## Handling Poor Early Estimates

It has also been suggested that it is a good idea to wait to follow the UCT values until the sample size behind them suggests that they are meaningful.  This is especially relevant when there are alternative policies available, e.g., heuristics or a separately learned policy.  This approach is called Progressive Bias.

Initial a node's statistics with something other than 0s in general, is called Search Seeding.  It has been found that such priors work best when provided by a function approximation as done when combining MCTS with learning.

An effective amelioration of this issue as suggested in a post by Brian Lee, one of the contributors to the MiniGo implementation of the AlphaGo algorithm, on the the computer-go mailing list.  When expanding a node, the values of its children is set to that of their parent node with appropriate mechanisms for handling the fact that these values do not imply an increase in the number of visits to the parent node.

## Avoiding Disappearing Node Selection Probabilities

A problem with using UCT estimiates as the foundation of a probability distribution for node selection, but not related to  is that it can be difficult to ensure exploration on lower levels as the combined probabilities across the levels get increasingly small.  First Play Urgency introduced a a constant probability of exploring untried nodes at least once.

## References

[1] Thomas Anthony and David Barber (2017) Thinking Fast and Slow with Deep Learning and Tree Search, NIPS 2017, pp5360-5370.

[2] Yngvi Bjornsson and Hilmar Finnsson (2009) CADIAPLAYER: A Simulation-Based General Game Player,

[3] B. Bouzy and B. Helmstetter (2004) Advances in Computer Games, vol 135. Springer.

[4] B. Brügmann (1993) Monte Carlo Go, technical report.

[5] Remi Coulom (2006) Efficient selectivity and backup operators in Monte-Carlo tree search, in the proceedings of the 5th International Conference on Computers and Games, pp72-83.

[6] Alan Fern and Paul Lewis (2011) Ensemble Monte-Carlo planning: an empirical study, in the Proceedings of the Twenty-First International Conference on International Conference on Automated Planning and Scheduling, pp58-65.

[7] David Silver, Richard S. Sutton and Martin Müller (2008) Sample-based learning and search with permanent and transient memories, in the proceedings of the 25th international conference on Machine learning (ICML), New York, USA, pp968-975.

[8] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2017) Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, arXiv:1712.01815 [cs.AI].

[9]  Tristan Cazenave (2009) Nested Monte-Carlo Search, in the Proceedings of the 21st International Joint Conference on Artifical intelligence (IJCAI), pp456-461.

[10] Christopher D. Rosin (2011) Nested Rollout Policy Adaptation for Monte Carlo Tree Search, in the Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI), pp649-654.

## Friday, March 02, 2018

### Monte Carlo Tree Search (MCTS) - Core Concepts

This is an introduction/summary of MCTS mainly for myself to learn about them.  MCTS is, of course an integral part of AlphaZero [8], but is also used in other successful algorithms such as Expert Iteration [1].

## Concepts and Properties

Fundamentally, MCTS is an algorithm for searching state-action spaces for optimal action sequences and in  that respect very similar to Reinforcement Learning (RL).  MCTS represents states as nodes and actions as edges.  The underlying assumptions is that the true value of actions can be estimated through repeated sampling (Monte Carlo) and that the estimated values can be used to guide the search and make it approximate a best-first search.

MCTS is an aheuristic search algorithm in that it performs a focused search but without requiring domain specific heuristics.  This means MCTS can be applied effectively to problems where such heuristics aare difficult to specify, e.g., the game of Go.  It is possible, however, to use MCTS with domain specific heuristics in order to further improve its performance.

The problem state to be explored becomes the root of the tree and the tree is built incrementally by performing a number of simulations.  The tree is used to estimate the state values and as the tree grows, the estimated values become increasingly accurate.

A simulation is a sequence of action starting from the root and ending when a given termination condition is met.  A simulation may contain four phases:
1. Selection (or the tree phase)
2. Expansion
3. Rollout (or playout)
4. Backup (or backpropagation)
The selection identifies the most promising leaf node in the current tree by traversing it according to the tree policy until a leaf is reached.

During rollouts, actions are selected according to a default policy, commonly according to a flat random distribution, until a termination state is reached.  The value of this state, e.g., reward, is then recorded and used to update the estimated value of the nodes in the tree.  The algorithm does not calculate the value of other nodes visited during the rollout.

The algorithm typically does not complete the search but stops after a given number of simulations or another factor describing a limited computational budget.  When the algorithm is done searching it returns the best child, i.e., the action that has the highest value.  The best child has be defined according to the following criteria:
1. Max child - The child with the highest value
2. Robust child - The child with the highest visit count
3. Max-robust child - The child with both 1. and 2.  If no such child exists, keep searching until it does
4. Secure child - Select the child according to a maximum lower confidence bound
When using MCTS to solve a problem or play a game, the action returned would be taken and another search would be initiated with the resulting state as the root of a new search tree.

### Selection

Selection is done within the search tree based on the tree policy.  In order to try all available actions while, at the same time, narrowing down on the most promising ones, the tree policy must select actions in a way that combines exploration and exploitation.

A range of values can be used to select a child.  The most common is probably Upper Confidence Bounds for Trees ($UCT$), as presented below.  This value requires each node in the tree to record two values; $n$, the number of times it has been selected and $\bar{X_j}$, the average reward from the simulations that selected it.

A normalised sum of the $UCT$ values can act as a probability distribution for selection, potentially with a softmax selection to control the level of exploration/exploitation or a Boltzmann factor to vary this level over time.

In progressive pruning, the standard deviation of the reward is also recorded so that statistical comparisons of child nodes can be done.  Nodes that are significantly inferior to other nodes are then pruned from the search tree.

With many or continuous action, it is possible to let a node represent many actions and to only split the node into separate nodes whenever it has been visited a number of times so that meaningful values for the new branches can been estimated.  This is called progressive widening [5].

When a number of actions are similar, it can be beneficial to treat them in a single branch of the search tree, thus reducing the branching factor.  Childs, Brodeur & Kocsis [1] introduced the concept of a move group to achieve this.

### Backup

During the backup phase, the MCTS algorithm records the values used by the tree policy so that the next simulation can make use of the information gained during the current simulation.

If new information is backed up immediately, the algorithm is anytime indicating that ...

Some algorithms, such as Gobble [4], back up the final value to any occurrence of a move within the search tree independent of the order/sequence of moves it occurs in.  This is called the all-moves-as-first heuristic and works when the order of moves does not significantly affect the value of a move.

## References

[1] Thomas Anthony and David Barber (2017) Thinking Fast and Slow with Deep Learning and Tree Search, NIPS 2017, pp5360-5370.

[2] Yngvi Bjornsson and Hilmar Finnsson (2009) CADIAPLAYER: A Simulation-Based General Game Player,

[3] B. Bouzy and B. Helmstetter (2004) Advances in Computer Games, vol 135. Springer.

[4] B. Brügmann (1993) Monte Carlo Go, technical report.

[5] Remi Coulom (2006) Efficient selectivity and backup operators in Monte-Carlo tree search, in the proceedings of the 5th International Conference on Computers and Games, pp72-83.

[6] Alan Fern and Paul Lewis (2011) Ensemble Monte-Carlo planning: an empirical study, in the Proceedings of the Twenty-First International Conference on International Conference on Automated Planning and Scheduling, pp58-65.

[7] David Silver, Richard S. Sutton and Martin Müller (2008) Sample-based learning and search with permanent and transient memories, in the proceedings of the 25th international conference on Machine learning (ICML), New York, USA, pp968-975.

[8] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2017) Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, arXiv:1712.01815 [cs.AI].

[9]  Tristan Cazenave (2009) Nested Monte-Carlo Search, in the Proceedings of the 21st International Joint Conference on Artifical intelligence (IJCAI), pp456-461.

[10] Christopher D. Rosin (2011) Nested Rollout Policy Adaptation for Monte Carlo Tree Search, in the Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI), pp649-654.

## Wednesday, November 23, 2016

### Hierarchical Reinforcement Learning: A Literature Summary

This is a quick summary of current work on hierarchical reinforcement learning (RL) aimed at students choosing to do hierarchical RL projects under my supervision.

The most common formalisation of hierarchical RL in terms of semi-MDPs was given by Sutton, Precup and Singh

There is also a summary of this area by

In 2015, Pierre-Luc Bacon, Jean Harb and Doina Precup published an article entitled 'The Option-Critic Architecture', describing an algorithm for automatically sub-dividing ans solving an RL problem.

## Wednesday, October 26, 2016

### Spatio-Temporal Data from Reinforcement Learning

Applying RL algorithms, in a spatial POMDP domains produces spatio-temporal data that it is necessary to analyse and organise in order to produce effective control policies.

There has recently been a great amount of progress in analysing cortical representations of space and time in terms of place-cells, gird cells.  This work has the potential to inform the area of RL in terms of efficient encoding and reuse of spatial data.
The overlap between RL and the neuroscience of mapping local space is particularly interesting as RL can produce raw spatio-temporal data from local sensors.  This provides us with an opportunity to analyse, explore and identify the computational and behavioural principles that enable efficient learning of spatial behaviours.

### Neuroscience - Mapping local space

A great introduction to this work is available through the lectures from three Nobel price winners in this area:
There is also a TED talk from 2011 on this subject by Neil Burgess from UCL (in O'Keefe's group) entitled How your brain tells you where you are.  Burgess has a range of more general papers on spatial cognition, including:

A brief colloquial presentation of this research entitled 'Discovering grid cells' is available from the Kavli Insitute of Systems Neuroscience's Centre for Neural Computation.

There was also a nice review article from in the Annual Review of Neuroscience entitled 'Place Cells, Grid Cells, and the Brain's Spatial Representation System', Vol. 31:69-89, 2008, by Edvard I. Moser, Emilio Kropff and May-Britt Moser.

There was also a Hippocampus special issue in grid-cells in 2008 edited by Michael E. Hasselmo, Edvard I. Moser and May-Britt Moser.

Recently there was another summary article in Nature Reviews Neuroscience entitled 'Grid cells and cortical representation', Vol. 15:466–481, 2014, by Edvard I. Moser, Yasser Roudi, Menno P. Witter, Clifford Kentros, Tobias Bonhoeffer and May-Britt Moser.

Further relevant work has recently been presented in an article entitled 'Grid Cells and Place Cells: An Integrated View of their Navigational and Memory Function' in Trends in Neurosciences, Vol. 38(12):763–775, 2015, by Honi Sanders, César Rennó-Costa, Marco Idiart and John Lisman.

A more general introcution to

### Computational Approaches

There is a review article on computational approaches to these issues entitled 'Place Cells, Grid Cells, Attractors, and Remapping' in Neural Plasticity, Vol. 2011, 2011 by Kathryn J. Jeffery.

Other relevant articles:

• 'Impact of temporal coding of presynaptic entorhinal cortex grid cells on the formation of hippocampal place fields' in Neural Networks, 21(2-3):303-310, 2008, by  Colin Molter and Yoko Yamaguchi.
• 'An integrated model of autonomous topological spatial cognition' in Autonomous Robots, 40(8):1379–1402, 2016, by Hakan Karaoğuz and Işıl Bozma.
• In 2003, in a paper entitled 'Subsymbolic action planning for mobile robots: Do plans need to be precise?', John Pisokas and Ulrich Nehmzow used the topology-preserving properties of self-organising maps to create spatial proto-maps that supported sub-symbolic action planning in a mobile robot.
• A paper entitled Emergence of multimodal action representations from neural network self-organization by German I. Parisi, Jun Tani, Cornelius Weber and Stefan Wermter includes an intteresting section called 'A self-organizing spatiotemporal hierarchy' wich addresses the automated structuring of spetio-temporal data.

## Monday, April 11, 2016

### A short bibliography BBAI and hierarchical RL with SOMs

This bibliography is meant for anyone who joins my research group to work on hierarchical reinforcement learning algorithms or related areas,

### My publications

• Georgios Pierris and Torbjørn S. Dahl, Learning Robot Control based on a Computational Model of Infant Cognition. In the IEEE Transactions on Cognitive and Developmental Systems, accepted for publication, 2016.
• Georgios Pierris and Torbjørn S. Dahl, Humanoid Tactile Gesture Production using a Hierarchical SOM-based Encoding. In the IEEE Transactions on Autonomous Mental Development, 6(2):153-167, 2014.
• Georgios Pierris and Torbjørn S. Dahl, A Developmental Perspective on Humanoid Skill Learning using a Hierarchical SOM-based Encoding. In the Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN'14), pp708-715, Beijing, China, July 6-11, 2014.
• Torbjørn S. Dahl, Hierarchical Traces for Reduced NSM Memory Requirements. In the Proceedings of the BCS SGAI International Conference on Artificial Intelligence, pp165-178, Cambridge, UK, December 14-16, 2010.

### Relevant papers

• Daan Wierstra, Alexander Forster, Jan Peters and Jurgen Schmidhuber, Recurrent Policy Gradients.  In Logic Journal of IGPL, 18:620-634, 2010. [pdf from IDSIA]
• Andrew G. Barto and Sridhar Mahadevan, Recent advances in hierarchical reinforcement learning, Discrete Event Dynamic Systems, 13(4):341-379, 2003. [pdf from Citeseer]
• Harold H. Chaput and Benjamin Kuipers and Risto Miikkulainenn Constructivist learning: A neural implementation of the schema mechanism.  In the Proceedings of the Workshop on Self-Organizing Maps (WSOM03), Kitakyushu, Japan, 2003. [pdf from Citeseer]
• Leslie B. Cohen, Harold H. Chaput and Cara H. Cashon, A constructivist model of infant cognition, Cognitive Development, 17:1323–1343, 2002 [pdf from ResearchGate]
• Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning, In Artificial Intelligence, 112:181–211, 1999. [pdf from the University of Alberta]
• Patti Maes, How to do the right thing, Connection Science Journal, 1:291-323, 1989. [pdf from Citeseer]
• Rodney A. Brooks, A Robust Layered Control System for a Mobile Robot, IEEE Journal of Robotics and Automation, 2(1):14-23, 1986. [pdf of MIT AI Memo 864]

### Books

• Joaquin M. Fuster, Cortex and mind: Unifying cognition, Oxford University Press, 2003. [pdf from ResearcgGate]
• Richard S. Sutton and Andrew G. Barto, Reinforcement learning: An introduction, MIT Press, 1998. [pdf of unfinished 2nd edition]
• G. L. Drescher, Made-up minds, MIT Press, 1991 [pdf of MIT dissertation] - An actual constructivist architecture.

## Wednesday, January 06, 2016

### Sequence Similarity for Hidden State Estimation

Little work has been done on comparing long- and short-term memory (LTM and STM) traces in the context of hidden state estimation in POMDPs.  Belief-state algorithms use a probabilistic step-by-step approach which should be optimal, but doesn't scale well and has an unrealistic requirement for knowledge of the state space underlying observations,

The instance-based Nearest Sequence Memory (NSM) algorithm performs remarkably well without any knowledge of the underlying state space.  Instead is compares previously observed sequences of observations and actions in LTM with a recently observed sequence in STM to estimate the underlying state.  The NSM algorithm uses a count of matching observation-action records as a metric for sequence proximity.

In problems where certain observation-actions are particularly salient, e.g., passing through a 'door' in Sutton's world, or picking up a passenger in the Taxi problem, a simple match count is not a particularly good sequence proximity metric and, as a result, I have recently been casting around for other work on such metrics.