Intro
Monte Carlo Tree Search (MCTS) works well in practice but poses theoretical challenges. In this writing, I want to describe MCTS algorithm, and why this algorithm works.
Open-loop planning algorithms like MCTS, can plan future actions from an initial state $s_0$. They assume access to a model of the environment, either stochastically or deterministically. By contrast, typical reinforcement learning algorithm is closed-loop: at each time step it selects an action based on the current state $s_t$
Intuition
MCTS selects the action expected to yeild the highest return. However, evaluating every state’s value is usually infeasible; if all values were known, the agent could simply choose the best action. Thus we must estimate values, and rollout make this possible. A single rollout can be inaccurate, but MCTS mitigates this by repeating rollouts and expanding the tree: as the number of visits increases, the estimates become more accurate.
For deep trees, computation can be expensive: with a fixed branching factor (number of actions), the cost grows exponentially with depth. I believe there might be some sort of trade-off methologies. I plan to discuss these after further study.
Pseudocode
\begin{algorithm}
\caption{MCTS: Selection–Expansion–Rollout (from $s_0$)}
\begin{algorithmic}
\STATE current $\leftarrow$ $s_0$
\WHILE{not Leaf(current)}
\STATE current $\leftarrow$ $\arg\max_{s_i \in \mathcal{C}(\text{current})}\; \text{UCB1}(s_i)$
\ENDWHILE
\IF{$N(\text{current}) = 0$}
\Return Rollout(current) \Comment{unvisited leaf $\rightarrow$ rollout}
\ELSE
\FOR{each action $a$ available from current}
\STATE addNewStateToTree(current, $a$)
\ENDFOR
\STATE current $\leftarrow$ firstNewChild(current)
\Return Rollout(current) \Comment{expand then rollout}
\ENDIF
\end{algorithmic}
\end{algorithm}
\begin{algorithm}
\caption{Rollout$(s_i)$}
\begin{algorithmic}
\STATE $s \gets s_i$
\WHILE{true}
\IF{$\operatorname{Terminal}(s)$}
\Return $\operatorname{Value}(s)$
\ENDIF
\STATE $a \gets \operatorname{Random}(\operatorname{AvailableActions}(s))$
\STATE $s \gets \operatorname{Simulate}(a, s)$
\ENDWHILE
\end{algorithmic}
\end{algorithm}
Note. The pseudocode above shows how to run MCTS algorithm step by step so that the agent can choose the action with the highest estimated value. A node’s value is typically computed as the averaged sum of leaves’ values including the node’s own value.
UCB1
\[\text{UCB1}(s_t) = \frac{Q(s_t)}{N(s_t)} + C\sqrt{\frac{ln(N(s_{t-1}))}{N(s_t)}}\]$ Q(s_t) $: cumulative return
$ N(s_t) $: The number of visits
The first term is the empirical mean (exploitation).
The second term encourages exploration of less-visited nodes and shrinks as $N(s_t)$ grows.
Example
Step 0: Initialization
Q=0
N=0"] end
Step 1: Expand
Q=0.00
N=0"] t1_s0 -->|a1 = 0| t1_s1L["s1
Q=0
N=0"] t1_s0 -->|a1 = 1| t1_s1R["s1
Q=0
N=0"] end
Step 2: Rollout & Backpropagation
Q=20
N=1"] t2_s0 -->|a1 = 0| t2_s1L["s1
Q=20
N=1"] t2_s0 -->|a1 = 1| t2_s1R["s1
Q=0
N=0"] t2_s1L -.->|"π(a_t | s_t)"| t2_terL["s_ter"] end
Step 3: Rollout & Backpropagation
Q=15
N=2"] t2_s0 -->|a1 = 0| t2_s1L["s1
Q=20
N=1"] t2_s0 -->|a1 = 1| t2_s1R["s1
Q=10
N=1"] t2_s1R -.->|"π(a_t | s_t)"| t2_terR["s_ter"] end
Step 4: Expand
Q=15
N=2"] t3_s0 -->|a1 = 0| t3_s1L["s1
Q=20
N=1"] t3_s0 -->|a1 = 1| t3_s1R["s1
Q=10
N=1"] t3_s1L -->|a2 = 0| t3_s2LL["s2
Q=0
N=0"] t3_s1L -->|a2 = 1| t3_s2LR["s2
Q=0
N=0"] end
Step 5: Rollout & Backpropagation
Q=10
N=3"] t4_s0 -->|a1 = 0| t4_s1L["s1
Q=10
N=2"] t4_s0 -->|a1 = 1| t4_s1R["s1
Q=10
N=1"] t4_s1L -->|a2 = 0| t4_s2LL["s2
Q=0
N=1"] t4_s1L -->|a2 = 1| t4_s2LR["s2
Q=0
N=0"] t4_s2LL -.->|"π(a_t | s_t)"| t4_terL["s_ter"] end
Step 6: Expand
Q=10
N=3"] t5_s0 -->|a1 = 0| t5_s1L["s1
Q=10
N=2"] t5_s0 -->|a1 = 1| t5_s1R["s1
Q=10
N=1"] t5_s1L -->|a2 = 0| t5_s2LL["s2
Q=0
N=1"] t5_s1L -->|a2 = 1| t5_s2LR["s2
Q=0
N=0"] t5_s1R -->|a2 = 0| t5_s2RL["s2
Q=0
N=0"] t5_s1R -->|a2 = 1| t5_s2RR["s2
Q=0
N=0"] end
Step 7: Rollout & Backpropagation
Q=11
N=4"] t6_s0 -->|a1 = 0| t6_s1L["s1
Q=10
N=2"] t6_s0 -->|a1 = 1| t6_s1R["s1
Q=12
N=2"] t6_s1L -->|a2 = 0| t6_s2LL["s2
Q=0
N=1"] t6_s1L -->|a2 = 1| t6_s2LR["s2
Q=0
N=0"] t6_s1R -->|a2 = 0| t6_s2RL["s2
Q=14
N=1"] t6_s1R -->|a2 = 1| t6_s2RR["s2
Q=0
N=0"] t6_s2RL -.->|"π(a_t | s_t)"| t6_terL["s_ter"] end