Jaehyun Jeong

RLPD: Reinforcement Learning with Prior Data

2026-04-02T00:00:00+09:00

Can we simply apply existing off-policy methods to leverage offline data when learning online, without offline RL pre-training or explicit imitation terms that privilege the prior offline data? The primary objective of the authors is to answer this question. However, to do this, the authors had to solve three main problems.

Expensive expert data from real robots
Sparsity of reward signal in robotics
Poor sample efficiency with offline data

There have been methods to address these problems with pre-training and imitation-term. Yet, they are not sample-efficient and require reliable data source and doing so is very expensive. On top of that, these algorithms are very sensitive to OOD(out-of-distribution) due to their learning dynamics.

With these challenges in mind, RLPD provides robustness in dataset quality and sample-efficiency. More precisely, RLPD can train with suboptimal data and expert data and even with off-policy data. The authors propose three distinct methods to mitigate these problems based on SAC(Soft Actor-Critic). These are called the “Symmetric Sampling”, “Layer Normalization”, and “Random Ensemble Distillation”.

Symmetric Sampling

Overall idea of symmetric sampling is extremely simple. It constructs each batch with 50% samples from the replay buffer and 50% from the offline data buffer. In spite of this simplicity, it resolves the OOD problem in a stable manner, alleviating the restriction of data with sub-optimal trajectories.

Offline data: Expert Demonstration (small) + Sub-optimal trajectories (large) collected by sub-optimal policies.

Layer Normalization

Batch normalization normalizes each feature value through samples. Unlike batch normalization, layer normalization normalizes each sample’s values through layer outputs. To provide further clarification, layer normalization calculates mean and standard deviation from a sample’s activations across the layer’s outputs, rather than across a batch.

Through this method, RLPD can mitigate Q-value overestimation problem in OOD observations. This is because layer normalization constrains the Q-value within the weight norm, as shown below.

$Q^*(s, a) = \sum_{s', r} p(s', r | s, a) \left[ r + \gamma \max_{a'} Q^*(s', a') \right]$
Bellman optimal equation triggers Q-value overestimation

$\begin{aligned} \|Q_{\theta,w}(s, a)\| &= \|w^T \mathrm{relu}(\psi_\theta(s, a))\| \\ &\le \|w\| \|\mathrm{relu}(\psi_\theta(s, a))\| \le \|w\| \|\psi(s, a)\| \\ &\le \|w\| (\because Layer Norm) \end{aligned}$
Layer normalization constrains Q-values within the weight norm

Random Ensemble Distillation + High UTD(update-to-data) ratio

UTD means the number of updates per batch. As a result of high UTD, the algorithm can use data more efficiently, and it means more sample-efficient learning. Ironically, other studies have shown that it can lead to statistical overfitting (Li et al., 2022) due to repeated updates on the same samples. To ameliorate this, authors have suggested to use random ensemble distillation.

Random ensemble distillation addresses overfitting similarly to DDQN and TD3, by maintaining multiple value functions. In the context of random ensemble distillation, it maintains an ensemble of $E$ Q-models, randomly selects 2 for the update step, and averages all $E$ Q-models when updating the policy to estimate the true Q-value

RLPD(Reinforcement Learning with Prior Data) Algorithm

Green lines refer shared methods for all tasks and purple lines are task specific methods. The purple lines are optional and can be applied depending on the task.

As you can see in the pseudocode, the algorithm is a combination of SAC, TD3, and the features I introduced above. This incorporates the clipped double Q-learning from TD3 and entropy maximization from SAC.

Experiments

In the experiments, the authors tried to answer the following questions.

Is RLPD competitive with prior work despite using no pre-training nor having explicit constraints?
Does RLPD transfer to pixel-based environments?
Does LayerNorm mitigate value divergence?

Let’s see the detailed results and the analysis.

RLPD’s competitiveness with prior data without pre-training nor explicit constraints?

SACfD initializes the online replay buffer with the offline data

RLPD achieves 2.5$\times$ the performance on the sparse Adroit ‘Door’ task.

Does RLPD transfer to pixels?

V-D4RL (Lu et al., 2022), an offline dataset with only pixel observations.

To evaluate the performance in pixel-based environments, they applied RLPD to V-D4RL(DeepMind Control Suite with visual observations only). In these environments, the authors proved that RLPD provides consistent improvements over online approaches, greatly improving over a BC baseline as well.

Also, they demonstrate a remarkable improvement in performance with the offline dataset and high UTD(update-to-data) ratio. It is worth noting that UTD=10 means 10 times updates per batch.

Does LayerNorm mitigate value divergence?

In Adroit domain, LayerNorm plays a crucial role for strong performance. Excluding LayerNorm escalates variance and reduces mean performance. In addition, in AntMaze and Humanoid Walk environments, LayerNorm diminishes excessive extrapolation.

References

Ball, P. J., Smith, L., Kostrikov, I., & Levine, S. (2023). Efficient online reinforcement learning with offline data. In International Conference on Machine Learning (pp. 1577-1594). PMLR.

Decoding RECAP: A Theoretical Look at $π^{*}_{0.6}$’s Reinforcement Learning Approach

2026-03-29T00:00:00+09:00

In this post, I want to explore RECAP(RL with Experience and Corrections via Advantage-conditioned Policies) which incorporates advantage estimation with imitation learning like actor-critic method in RL. In RECAP algorithm, advantage of actions are calculated through value network and feed this information into VLM backbone as improvement indicator. I believe that’s the overall concept of this method. However, this simple idea addresses the fundamental problem of combining RL with flow matching.

To begin with, I will first explain why you to know why we need RL in the pi 0.6 model and why combining RL with pi 0.6 was challenging. On top of that, I want to talk about the details of this method through equations.

Why RL?

In the field of Physical Intelligence, pretrained models have shown performance improvements in a number of tasks such as folding a laundry and assemble a box. Even so, pretraining + fine tune strategy was highly sensitive to the environment setting while having performance ceiling. In addition to that, if the robot encounters an unseen observations, it is susceptible to distribution shift due to the lack of data. However, applying RL can overcome these problems with human-intervention and self-experience. This is because that RL method can collect and learn from existing policy and human intervention.

Difficulties in RL + flow matching approach

To understand why the flow matching is hard to combine with RL, we need to understand the most popular RL method’s approach.

Probability distribution

PPO and SAC are the most common method in RL, and below is the key equations in these two algorithms. As you can verify, they needs $ \pi_{\theta}(a_t \mid s_t) $, which is the action distribution given by state $s_t$.

$L^{CPI}(\theta) = \hat{\mathbb{E}}_{t} \left[ \frac{\pi_{\theta}(a_{t} \mid s_{t})}{\pi_{\theta_{\mathrm{old}}}(a_{t} \mid s_{t})} \hat{A}_{t} \right] = \hat{\mathbb{E}}_{t} \left[ r_{t}(\theta) \hat{A}_{t} \right]$
PPO’s loss function
$J_{\pi}(\phi)=\mathbb{E}_{s_{t}\sim\mathcal{D}}[\mathbb{E}_{a_{t}\sim\pi_{\phi}}[\alpha \log(\pi_{\phi}(a_{t}|s_{t}))-Q_{\theta}(s_{t},a_{t})]]$
SAC’s objective function

As you already know, flow matching method generates continuous actions through integration and it is not efficient to compute exact action log-probability. This is the primary reason why simple RL + flow matching does not work.

Theoretical details of RECAP

However, the authors proposed a straightforward approach to avoid this problem. They just implemented value network and they labeled actions as “positive” or “negative” with this value network. To be more precise, value network returns value which is the expected cumulative sum of rewards, and calculate advantage(how good the action is compared to expected value), and, finally, it categorizes the top 30% of advantages as positive and the bottom 30% as negative, passing this signal to the vlm backbone.

To train this value function they used reward function like below. I also note that success and failure are decided by human. $-C_{\text{fail}}$ is a large enough negative value so that the value can be distinguished from positive observation, and the agent receives a -1 reward at every step to encourage reaching the goal as quickly as possible.

\[r_t = \begin{cases} 0 & \text{if } t = T \text{ and success} \\ -C_{\text{fail}} & \text{if } t = T \text{ and failure} \\ -1 & \text{otherwise.} \end{cases}\]

When they calculate the value function, they used distributional value function. It’s slightly different from a standard value function which returns a real number. They divided values into $B = 201$ bins and trained the value function like a bin classifier, and to calculate the real value it calculates expected value. The detailed equations are as follows.

\[\begin{flalign} & \min_{\phi} \mathbb{E}_{\tau \in \mathcal{D}} \left[ \sum_{\mathbf{o}_{t} \in \tau} H(R_{t}^{B}(\tau), p_{\phi}(V \mid \mathbf{o}_{t}, \ell)) \right] \\ & V(\mathbf{o}_{t}, \ell) = \sum_{b=1}^{B} p_{\phi}(V=b \mid \mathbf{o}_{t}, \ell) \cdot v(b) \\ & v(b) = V_{\min} + (b - 1) \frac{V_{\max} - V_{\min}}{B - 1} \quad \text{for } b \in \{1, 2, \dots, B\} \end{flalign}\]

Once the value function is trained, advantages are computed and the top 30% are labeled as “positive”.

\[A^{\pi}(\mathbf{o}_t, \mathbf{a}_t) = \mathbb{E}_{\rho_{\pi}(\tau)} \left[ \sum_{t'=t}^{t+N-1} r_{t'} + V^{\pi}(\mathbf{o}_{t+N}) \right] - V^{\pi}(\mathbf{o}_t)\]

By labeling observations as positive or negative, the model learns to distinguish good actions from bad ones without computing explicit action log-probabilities. In conclusion, RECAP elegantly circumvents the core incompatibility between flow matching and standard RL objectives.

Real application of RECAP

\begin{algorithm}
\caption{RL with Experience and Corrections via Advantage-conditioned Policies (RECAP)}
\begin{algorithmic}
\REQUIRE multi-task demonstration dataset $\mathcal{D}_{\text{demo}}$
\STATE Train $V_{\text{pre}}$ on $\mathcal{D}_{\text{demo}}$ using Eq. 1
\STATE Train $\pi_{\text{pre}}$ on $\mathcal{D}_{\text{demo}}$ using Eq. 3 and $V_{\text{pre}}$
\STATE Initialize $\mathcal{D}_\ell$ with demonstrations for $\ell$
\STATE Train $V_\ell^0$ from $V_{\text{pre}}$ on $\mathcal{D}_\ell$ using Eq. 1
\STATE Train $\pi_\ell^0$ from $\pi_{\text{pre}}$ on $\mathcal{D}_\ell$ using Eq. 3 and $V_\ell^0$
\FOR{$k = 1$ to $K$}
  \STATE Collect data with $\pi_\ell^{k-1}$, add it to $\mathcal{D}_\ell$
  \STATE Train $V_\ell^k$ from $V_{\text{pre}}$ on $\mathcal{D}_\ell$ using Eq. 1
  \STATE Train $\pi_\ell^k$ from $\pi_{\text{pre}}$ on $\mathcal{D}_\ell$ using Eq. 3 and $V_\ell^k$
\ENDFOR
\end{algorithmic}
\end{algorithm}

The algorithm starts with training value network in the pretraining data, and train the policy. After that, it fine-tunes both the value function and policy for each task. Then in the for loop with k they collect more data with the policy while intervening bad actions by human. In the collecting process in the loop, robots collect data with their policy but a human corrects the robot’s actions when they appear unsafe or clearly wrong.

In line 7, human interventions are always labeled as positive, under the assumption that actions provided by a human are correct and other actions that the policy has generated are classified with advantage function and if it’s in 30%, actions are positive. Otherwise, actions are negative.

Conclusion

At first glance, I thought that it’s not like RL since it pretrains policy with imitation learning and, even at the last, collecting data contains human intervention. On top of that, it’s not a reward maximizing algorithm. Reward function is just for value function training. I believe that it explains how hard to reach a goal without guidance of human. With full RL method like PPO and SAC, humanoid walking environment can be solved with motion which is far from natural human gait. In the case of robotics, their tasks like pick&place and folding laundry are very difficult compared to walking, and they give very sparse rewards since the reward is only provided when the task is done. As a result, researchers have made sophisticated imitation learning method inspired by RL method, and fully automated data collecting and training loop is the future challenge humanity must solve for general-purpose robotic systems.

References

Physical Intelligence et al., “$\pi^{*}_{0.6}$: a VLA That Learns From Experience,” arXiv:2511.14759, 2025. https://arxiv.org/abs/2511.14759
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O., “Proximal Policy Optimization Algorithms,” arXiv:1707.06347, 2017. https://arxiv.org/abs/1707.06347
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., and Levine, S., “Soft Actor-Critic Algorithms and Applications,” arXiv:1812.05905, 2018. https://arxiv.org/abs/1812.05905

Cantor’s Diagonal Argument: Not All Infinities Are Equal

2026-01-13T00:00:00+09:00

One of the biggest surprises that I encountered while majoring in applied mathematics was the statement that “the cardinal numbers of $\mathbb{N}$ (the set of natural numbers) and $\mathbb{Z}$ (the set of integers) are equal”. The cardinal number of a set is defined as the size of the set. For finite sets, if a set $A$ is empty then the cardinal number of $A$ is 0, and if the set $A$ has $k$ elements then the cardinal number of $A$ is $k$. However, for infinite sets $A$ and $B$, their cardinal numbers are equal if and only if there exists a one-to-one correspondence (bijective) from $A$ to $B$. We can deduce that aforementioned statement is true from the definition of cardinal number. The really interesting part of this theorem is that the mismatch between our intuition and a rigorous mathematical concepts, since our brains tend to believe, Intuitively, that the size of $\mathbb{Z}$ should be twice the size of $\mathbb{N}$, plus one.

Additionally, it is also worth noting that “the cardinal numbers of $\mathbb{N}$ and $\mathbb{R}$ are different”. This also blew my mind, and it explained how countable and uncountable sets can be different. In this post, I want to prove that the cardinal numbers of $\mathbb{N}$ and $\mathbb{R}$ are different by proving the theorems below.

An open interval (0, 1) is not denumerable.
The sets (0, 1) and $\mathbb{R}$ are equipotent. (have the same size)

Def) Denumerable set

A set $S$ is denumerable if and only if there exists a bijective function from $S$ to $\mathbb{N}$.

NOTE: in this case, we denote $S \sim \mathbb{N}$, and we say the sets A and B are equipotent.

Th 1) The open unit interval (0, 1) of real numbers is nondenumerable

$ \underline{\text{proof}} $

$ \forall x \in (0, 1) \quad \exists x_1, x_2, x_3 \dots \in {0, 1, \dots, 9} \quad \text{s.t.} \quad x = 0.x_1 x_2 x_3 \dots $

$ (\text{For example, } x = \frac{1}{3} = 0.333\dots \implies x_1 = 3 \land x_2 = 3 \land \dots) $

$ \text{we will treat repeating zeros } (\text{such as } \frac{1}{4} = 0.25000\dots) \text{ by decreasing the last non-zero digit by 1 } $

$ (\text{as } \frac{1}{4} = 0.24999\dots) $

$ \text{Under this agreement, assume that } (0, 1) \text{ is denumerable so that} $

$ \exists \text{ bijective } f: \mathbb{N} \to (0, 1) \quad \text{s.t.} $

$ f(1) = 0.x_{11} x_{12} x_{13} \dots $

$ f(2) = 0.x_{21} x_{22} x_{23} \dots $

$ f(3) = 0.x_{31} x_{32} x_{33} \dots $

$ \vdots $

$ f(k) = 0.x_{k1} x_{k2} x_{k3} \dots $

$ \text{and let } z \in (0, 1) \text{ be defined as follows:} $

\[z = 0.z_1 z_2 z_3 \dots \quad \text{s.t.} \quad \forall k \in \mathbb{N}, \quad \begin{cases} z_k = 1 & (x_{kk} \neq 1) \\ z_k = 2 & (x_{kk} = 1) \end{cases}\]

$ \text{then } \forall n \in \mathbb{N}, \quad f(n) \neq z \quad (\because \forall n \in \mathbb{N}, \quad x_{nn} \neq z_n \implies f(n) \neq z) $

$ \therefore \text{This contradicts our assumption} \quad \blacksquare $

Th 2) Open intervals (0, 1) and (-1, 1) are equipotent.

$ 1) \ (0,1) \sim (-1,1) $

$ \underline{\text{proof}} $

$ \text{The function } f: (0,1) \to (-1,1) \text{ given by } f(x) = 2x - 1 \text{ is one-to-one correspondence.} $

$ \because $

$ \forall x_1, x_2 \in (0,1) \quad \text{s.t.} \quad x_1 \neq x_2 $

$ f(x_1) = 2x_1 - 1 \neq 2x_2 - 1 = f(x_2) \ (\because 2x_1 - 1 = 2x_2 - 1 \iff 2x_1 = 2x_2 \iff x_1 = x_2) $

$ \therefore f \text{ is injective} $

$ \forall y \in (-1,1) \ \exists x \in (0,1) \quad \text{s.t.} \quad $

$ y = 2x - 1 \ (\because -1 \lt y \lt 1 \Rightarrow 0 \lt y+1 \lt 2 \Rightarrow 0 \lt \frac{y+1}{2} \lt 1) $

$ \therefore f \text{ is surjective} \quad \blacksquare $

Th 3) The open intervals (-1, 1) and $\mathbb{R}$ are equipotent.

$ 2) \ (-1,1) \sim \mathbb{R} $

$ \underline{\text{proof}} $

$ \text{The function } g: (-1,1) \to \mathbb{R} \text{ given by } g(x) = \tan(\frac{\pi}{2}x) \text{ is one-to-one correspondence} $

$ \because $

$ \forall x_1, x_2 \in (-1,1) \quad \text{s.t.} \quad x_1 \neq x_2 $

$ g(x_1) \neq g(x_2) \ (\because \tan(x) \text{ is one-to-one correspondence}) $

$ \therefore g \text{ is injective} $

$ \forall y \in \mathbb{R} \ \exists x \in (-1,1) \quad \text{s.t.} \quad g(x) = y $

$ \left( \because \exists x’ \quad \text{s.t.} \quad \tan(x’)=y \text{ and } \frac{\pi}{2}x = x’ \Rightarrow x = \frac{2}{\pi}x’ \right) $

$ \therefore g \text{ is surjective} \quad \blacksquare $

Conclusion

$ \text{By Th 1, Th 2, and Th 3} $

$ \mathbb{N} \nsim (0,1) \text{ and } (0,1) \sim \mathbb{R} $

$ \therefore \mathbb{N} \nsim \mathbb{R} \quad \blacksquare $

Basic Guide to build and run ROS 2 Services (Python & C++)

2025-12-21T00:00:00+09:00

If you don’t know about ROS 2 Topics, go to this page and learn.

Topics are used for data streams (unidirectional), and Services are used for a client/server interactions (bidirectional).

First , Services can work in a synchronous or asynchronous manner. If the service is synchronous, the client sends a Request and blocks until receiving a response. However, if the service is asynchronous, the client sends a Request, registers a callback function for the response and continues its execution. When the server responds, the callback function is triggered.

Furthermore, you define services by name and a pair of messages. One message is the Request and other message is the Response.

Finally, only one server can exist for a given service name.

Simple Python code

server

#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from example_interfaces.srv import AddTwoInts


class AddTwoIntsServerNode(Node):
    def __init__(self):
        super().__init__("add_two_ints_server")
        self.server_ = self.create_service(
            AddTwoInts,
            "add_two_ints",  # Use a verb for service name
            self.callback_add_two_ints,
        )
        self.get_logger().info("Add Two Ints server has been started")

    def callback_add_two_ints(
        self,
        request: AddTwoInts.Request,
        response: AddTwoInts.Response
    ):
        response.sum = request.a + request.b
        self.get_logger().info(
            str(request.a) + " + " + str(request.b) + " = " + str(response.sum)
        )
        return response


def main(args=None):
    rclpy.init(args=args)
    node = AddTwoIntsServerNode()
    rclpy.spin(node)
    rclpy.shutdown()


if __name__ == "__main__":
    main()

Client

Non-OOP method

#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from example_interfaces.srv import AddTwoInts


def main(args=None):
    rclpy.init(args=args)
    node = Node("add_two_ints_client_no_oop")

    client = node.create_client(
        AddTwoInts,
        "add_two_ints"
    )
    while not client.wait_for_service(1.0):
        node.get_logger().warn("Waiting for Add Two Ints server...")

    request = AddTwoInts.Request()
    request.a = 3
    request.b = 8

    future = client.call_async(request)  # client.call (sync)
    # Spin until getting the response
    rclpy.spin_until_future_complete(node, future)

    response = future.result()
    node.get_logger().info(
        str(request.a) + " + " + str(request.b) + " = " + str(response.sum)
    )

    rclpy.shutdown()


if __name__ == "__main__":
    main()

OOP method

#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from example_interfaces.srv import AddTwoInts
from functools import partial


class AddTwoIntsClient(Node):
    def __init__(self):
        super().__init__("add_two_ints_client")
        self.client_ = self.create_client(AddTwoInts, "add_two_ints")

    def call_add_two_ints(self, a, b):
        while not self.client_.wait_for_service(1.0):
            self.get_logger().warn("Waiting for Add Two Ints server...")

        request = AddTwoInts.Request()
        request.a = a
        request.b = b

        future = self.client_.call_async(request)
        # To add another argument, arguments must be wrapped with partial
        future.add_done_callback(partial(
            self.callback_call_add_two_ints, request=request
        ))

    def callback_call_add_two_ints(self, future, request):
        response = future.result()
        self.get_logger().info(
            str(request.a) + " + " + str(request.b) + " = " + str(response.sum)
        )


def main(args=None):
    rclpy.init(args=args)
    node = AddTwoIntsClient()
    node.call_add_two_ints(2, 7)
    node.call_add_two_ints(1, 4)
    node.call_add_two_ints(10, 20)
    rclpy.spin(node)
    rclpy.shutdown()


if __name__ == "__main__":
    main()

NOTE: In the OOP method, “rclpy.spin_until_future_complete(node, future)” is not required since the class is already spinning. Instead of this, it is required to add a callback function using “future.add_done_callback”.

Simple C++ code

Server

#include "rclcpp/rclcpp.hpp"
#include "example_interfaces/srv/add_two_ints.hpp"

using namespace std::placeholders;


class AddTwoIntsServerNode : public rclcpp::Node{
public:
    AddTwoIntsServerNode() : Node("add_two_ints_server")
    {
        server_ = this->create_service<example_interfaces::srv::AddTwoInts>(
            "add_two_ints",
            std::bind(&AddTwoIntsServerNode::callbackAddTwoInts, this, _1, _2)
        );
        RCLCPP_INFO(this->get_logger(), "Add Two Ints Service has been started");
    }
private:
    void callbackAddTwoInts(
        const example_interfaces::srv::AddTwoInts::Request::SharedPtr request,
        const example_interfaces::srv::AddTwoInts::Response::SharedPtr response)
    {
        response->sum = request->a + request->b;
        RCLCPP_INFO(this->get_logger(), "%d + %d = %d", (int)request->a, (int)request->b, (int)response->sum);
    }

    rclcpp::Service<example_interfaces::srv::AddTwoInts>::SharedPtr server_;
};


int main(int argc, char **argv){
    rclcpp::init(argc, argv);
    auto node = std::make_shared<AddTwoIntsServerNode>();
    rclcpp::spin(node);
    rclcpp::shutdown();
    return 0;
}

Client

Non-OOP method

#include "rclcpp/rclcpp.hpp"
#include "example_interfaces/srv/add_two_ints.hpp"

using namespace std::chrono_literals;


int main(int argc, char **argv){
    rclcpp::init(argc, argv);
    auto node = std::make_shared<rclcpp::Node>("add_two_ints_client_no_oop");

    auto client = node->create_client<example_interfaces::srv::AddTwoInts>("add_two_ints");
    while(!client->wait_for_service(1s)){
        RCLCPP_WARN(node->get_logger(), "Waiting for the server...");
    }

    auto request = std::make_shared<example_interfaces::srv::AddTwoInts::Request>();
    request->a = 6;
    request->b = 2;

    auto future = client->async_send_request(request);
    rclcpp::spin_until_future_complete(node, future);

    auto response = future.get();
    RCLCPP_INFO(node->get_logger(), "%d + %d = %d", (int)request->a, (int)request->b, (int)response->sum);

    rclcpp::shutdown();
    return 0;
}

OOP method

#include "rclcpp/rclcpp.hpp"
#include "example_interfaces/srv/add_two_ints.hpp"

using namespace std::chrono_literals;
using namespace std::placeholders;


class AddTwoIntsClientNode : public rclcpp::Node{
public:
    AddTwoIntsClientNode() : Node("add_two_ints_client")
    {
        client_ = this->create_client<example_interfaces::srv::AddTwoInts>("add_two_ints");
    }

    void callAddTwoInts(int a, int b){
        while(!this->client_->wait_for_service(1s)){
            RCLCPP_WARN(this->get_logger(), "Waiting for the server...");
        }

        auto request = std::make_shared<example_interfaces::srv::AddTwoInts::Request>();
        request->a = a;
        request->b = b;

        client_->async_send_request(
            request,
            std::bind(&AddTwoIntsClientNode::callbackCallAddInts, this, _1)
        );
    }

private:

    void callbackCallAddInts(rclcpp::Client<example_interfaces::srv::AddTwoInts>::SharedFuture future){
        auto response = future.get();
        RCLCPP_INFO(this->get_logger(), "Sum: %d", (int)response->sum);
    }

    rclcpp::Client<example_interfaces::srv::AddTwoInts>::SharedPtr client_;
};


int main(int argc, char **argv){
    rclcpp::init(argc, argv);
    auto node = std::make_shared<AddTwoIntsClientNode>();
    node->callAddTwoInts(10, 5);
    node->callAddTwoInts(10, 15);
    node->callAddTwoInts(12, 7);
    rclcpp::spin(node);
    rclcpp::shutdown();

    return 0;
}

ROS 2 commands for services

ros2 service -h

ros2 service list
# OUT
# example_interfaces/srv/AddTwoInts

# put the output into ros2 interface command
ros2 interface show example_interfaces/srv/AddTwoInts
# OUT
# int64 a
# int64 b
# ---
# int64 sum

# Then you can test this server like the below command
ros2 service call /add_two_ints example_interfaces/srv/AddTwoInts "{a: 7, b: 3}"
# OUT
#
# waiting for service to become available...
# requester: making request: example_interfaces.srv.AddTwoInts_Request(a=7, b=3)
#
# response:
# example_interfaces.srv.AddTwoInts_Response(sum=10)

# Service name can be changed with an argument below
ros2 run   --ros-args -r :=
# Client can change the service name with the argument below
ros2 run   --ros-args -r :=

Basic Guide to build and run ROS 2 Topics (Python & C++)

2025-12-09T00:00:00+09:00

A Topic is a receiver of a signal from a publisher (node). The publisher is able to send data to the topic while not knowing which subscribers(nodes) receive this data. Similarly, subscribers do not know which nodes send the data to the topic. On top of that, Nodes’ capability of sending data is not restricted to sending to single topic but sending to multiple topics to different topics. In addition to that, the data stream is unidirectional. Data can be sent to subscriber but cannot be returned to the publisher.

Technically, ROS 2 messages are transferred using middleware named DDS. However, users do not need to handle DDS as libraries such as RCL provide abstraction.

Simple Python code

Publisher

#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from example_interfaces.msg import String


class RobotNewsStationNode(Node):
    def __init__(self):
        super().__init__("robot_news_station")  # Choosing the same node name with file name is quite common.
        self.robot_name = "C3PO"
        self.publisher_ = self.create_publisher(
            String,
            "robot_news",
            10
        )
        # 0.5 means twice per seconds
        self.timer_ = self.create_timer(0.5, self.publish_news)
        self.get_logger().info("Robot News Station has been started.")

    def publish_news(self):
        msg = String()
        msg.data = f"Hi, this is {self.robot_name} from the robot news station."
        self.publisher_.publish(msg)


def main(args=None):
    rclpy.init(args=args)
    node = RobotNewsStationNode()
    rclpy.spin(node)
    rclpy.shutdown()


if __name__ == "__main__":
    main()

NOTE: Do not forget to add “example_interfaces” library in the package.xml file for String message type and install the node in the setup.py.

Subscriber

#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from example_interfaces.msg import String


class SmartphoneNode(Node):
    def __init__(self):
        super().__init__("smartphone")
        self.subscriber_ = self.create_subscription(
            String,
            "robot_news",
            # When the subscriber receives the message
            self.callback_robot_news,
            # queue size
            10
        )
        self.get_logger().info("Smartphone has been started.")

    def callback_robot_news(self, msg: String):
        self.get_logger().info(msg.data)


def main(args=None):
    rclpy.init(args=args)
    node = SmartphoneNode()
    rclpy.spin(node)
    rclpy.shutdown()


if __name__ == "__main__":
    main()

Simple C++ code

Publisher

#include "rclcpp/rclcpp.hpp"
#include "example_interfaces/msg/string.hpp"

using namespace std::chrono_literals;

class RobotNewsStationNode : public rclcpp::Node{
public:
    RobotNewsStationNode() : Node("robot_news_station"), robot_name_("R2D2")
    {
        publisher_ = this->create_publisher<example_interfaces::msg::String>("robot_news", 10);
        timer_ = this->create_wall_timer(0.5s, std::bind(&RobotNewsStationNode::publishNews, this));
        RCLCPP_INFO(this->get_logger(), "Robot News Station has been started");
    }

private:
    void publishNews(){
        auto msg = example_interfaces::msg::String();
        msg.data = std::string("Hi, this is ") + robot_name_ + std::string(" from the robot news station.");
        publisher_->publish(msg);
    }

    std::string robot_name_;
    rclcpp::Publisher<example_interfaces::msg::String>::SharedPtr publisher_;
    rclcpp::TimerBase::SharedPtr timer_;
};

int main(int argc, char **argv){
    rclcpp::init(argc, argv);
    auto node = std::make_shared<RobotNewsStationNode>();
    rclcpp::spin(node);
    rclcpp::shutdown();
    return 0;
}

Subscriber

#include "rclcpp/rclcpp.hpp"
#include "example_interfaces/msg/string.hpp"

using namespace std::placeholders;

class SmartphoneNode : public rclcpp::Node{
public:
    SmartphoneNode() : Node("smartphone")
    {
        subscriber_ = this->create_subscription<example_interfaces::msg::String>(
            "robot_news",
            10,
            std::bind(&SmartphoneNode::callbackRobotNews, this, _1)
        );
        RCLCPP_INFO(this->get_logger(), "Smartphone has been started.");
    }

private:
    void callbackRobotNews(const example_interfaces::msg::String::SharedPtr msg){
        RCLCPP_INFO(this->get_logger(), "%s", msg->data.c_str());
    }

    rclcpp::Subscription<example_interfaces::msg::String>::SharedPtr subscriber_;
};

int main(int argc, char **argv){
    rclcpp::init(argc, argv);
    auto node = std::make_shared<SmartphoneNode>();
    rclcpp::spin(node);
    rclcpp::shutdown();
    return 0;
}

Bags

Suppose you are building robot software with ROS 2 and a robot. Then you need the robot to code and test with. But “Bag” provides very handy features in this case. ROS 2 Bag can save data from topic with any amount of time, then can replay these data as many times as you want.

# Help
ros2 bag -h

# Record topics
ros2 bag record   ...
# Record topics with custom record name
ros2 bag record -o    ...
# Record all topics
ros2 bag record -a

# Play a record
ros2 bag play 

# Print record Information
ros2 bag info 

Basic Commands for ROS 2

2025-12-06T00:00:00+09:00

Run nodes

ros2 run

NOTE: “-h” option shows arguments and options like below

ros2 -h
ros2 run -h
ros2 node -h

Checking running nodes

ros2 node list

Check running nodes

ros2 node info

WARNING: It is not encouraged to run two nodes with identical names. These could run at the same time, but they will show the message like below

WARNING: Be aware that there are nodes in the graph that share an exact name, which can have unintended side effects.

Running nodes with the same node name

ros2 run   --ros-args -r __node:=

“-r” can be replaced with “–remap”

Building commands

The basic build command is

colcon build

Build all packages

NOTE: The build command should only be executed in the project folder that contains the src folder

The command below builds only the selected package.

colcon build --packages-select

Building commands only for Python

colcon build --packages-select  --symlink-install

”–symlink-install” option makes the package run with source file. Therefore, rebuilding is unnecessary when the Python file changed.

ROS 2 with GUI

The commands below open the GUI tools for ROS 2.

rqt
rqt_graph  # Shows a graph of packages

Topics

The command below shows currently running topics.

ros2 topic list

To see what topic is recieving, run the command below.

ros2 topic echo

ros2 topic info

The commands below print frequency and bandwidth of the topic.

ros2 topic hz 
ros2 topic bw 

The command below instantly publish a topic.

ros2 topic pub -r    
# Like this one
ros2 topic pub -r 5 /robot_news example_interfaces/msg/String "{data: 'Hello from the terminal'}"

The command below can change the topic name.

ros2 run my_py_pkg robot_news_station --ros-args -r __node:=my_station -r robot_news:=abc

robot_news to abc

NOTE: In the same way, node, topic publisher, and topic reciever can be remaped with -r option.

Interfaces

The command below returns the interface information.

ros2 interface

ros2 interface show geometry_msgs/msg/Twist

Bags

# Help
ros2 bag -h

# Record topics
ros2 bag record   ...
# Record topics with custom record name
ros2 bag record -o    ...
# Record all topics
ros2 bag record -a

# Play a record
ros2 bag play 

# Print record Information
ros2 bag info 

Services

ros2 service call   
# Like this one
ros2 service call /add_two_ints example_interfaces/srv/AddTwoInts "{a: 3, b: 7}"

Basic Guide to build and run ROS 2 Nodes (Python & C++)

2025-12-06T00:00:00+09:00

Nodes are subprograms in an application, responsible for only one thing. Nodes communicate with each other through topics, services, and parameters. Like OOP, nodes reduce code complexity, and provide low fault tolerance. Even further, nodes can be written in many different programming languages including Python and C++. Nodes should have a single purpose while communicating each nodes.

In ros2, a package is an independent unit in an application. packages contain nodes, enabling inter-package communication.

flowchart TB

  %% -------------------------
  %% Package 1
  %% -------------------------
  subgraph Pkg1["sensing_pkg"]
    direction TB
    S1["Node: sensor_reader"]
    S2["Node: imu_reader"]
  end

  %% -------------------------
  %% Package 2
  %% -------------------------
  subgraph Pkg2["processing_pkg"]
    direction TB
    P1["Node: data_filter"]
    P2["Node: state_estimator"]
  end

  %% -------------------------
  %% Package 3
  %% -------------------------
  subgraph Pkg3["control_pkg"]
    direction TB
    C1["Node: controller"]
  end

  %% -------------------------
  %% Package 4
  %% -------------------------
  subgraph Pkg4["output_pkg"]
    direction TB
    O1["Node: actuator_driver"]
    O2["Node: logger"]
  end

  %% -------------------------
  %% Node-to-node communications
  %% -------------------------
  S1 -- "/sensor_data" --> P1
  S2 -- "/imu_data" --> P1

  P1 -- "/filtered_data" --> P2
  P2 -- "/state" --> C1

  C1 -- "/control_cmd" --> O1
  C1 -- "/status" --> O2

Write code for nodes. (Python)

#!/usr/bin/env python3
import rclpy
from rclpy.node import Node


class MyNode(Node):
    def __init__(self):
        super().__init__("py_test")  # Create a node
        self.get_logger().info("Hello world")  # Logging with the node
        # Run timer_callback every 1 second.
        self.create_timer(1.0, self.timer_callback)

    def timer_callback(self):
        self.get_logger().info("Hello")


def main(args=None):
    rclpy.init(args=args)

    node = MyNode()
    rclpy.spin(node)  # makes the node keep running

    rclpy.shutdown()

if __name__ == "__main__":
    main()

Install a node to a package.

from setuptools import find_packages, setup

package_name = 'my_py_pkg'

setup(
    name=package_name,
    version='0.0.0',
    packages=find_packages(exclude=['test']),
    data_files=[
        ('share/ament_index/resource_index/packages',
            ['resource/' + package_name]),
        ('share/' + package_name, ['package.xml']),
    ],
    install_requires=['setuptools'],
    zip_safe=True,
    maintainer='jj',
    maintainer_email='jj@todo.todo',
    description='TODO: Package description',
    license='TODO: License declaration',
    extras_require={
        'test': [
            'pytest',
        ],
    },
    ''' Where I've changed '''
    entry_points={
        'console_scripts': [
            "py_node = my_py_pkg.my_first_node:main"  # node_name = path_to_py_file:function_name
        ],
    },
)

setup.py

and run below so that you can

colcon build --packages-select my_py_pkg  # Build with a node
source ./install/setup.bash  # Run setup.bash whenever finished building.
ros2 run my_py_pkg py_node

You should see output similar to the following.

[INFO] [1764994343.482507922] [py_test]: Hello world

NOTE: py_test is a “node name” and py_node is an “execution name”. Node name is defined in the node Python code, and excution name is defined in the setup.py file

NOTE: Remember that below commands should be run every time the code for the node is fixed.

colcon build --packages-select my_py_pkg
source ./install/setup.bash
ros2 run my_py_pkg my_node

Write code for nodes. (C++)

#include "rclcpp/rclcpp.hpp"

class MyNode : public rclcpp::Node{
public:
    MyNode() : Node("cpp_test"), counter_(0)
    {
        RCLCPP_INFO(this->get_logger(), "Hello world");
        timer_ = this->create_wall_timer(
            std::chrono::seconds(1),
            std::bind(&MyNode::timerCallback, this)
        );
    }
private:
    void timerCallback(){
        RCLCPP_INFO(this->get_logger(), "Hello %d", counter_);
        counter_++;
    }
    rclcpp::TimerBase::SharedPtr timer_;
    int counter_;
};

int main(int argc, char **argv){
    rclcpp::init(argc, argv);

    auto node = std::make_shared<MyNode>();
    rclcpp::spin(node);

    rclcpp::shutdown();
    return 0;
}

cmake_minimum_required(VERSION 3.8)
project(my_cpp_pkg)

if(CMAKE_COMPILER_IS_GNUCXX OR CMAKE_CXX_COMPILER_ID MATCHES "Clang")
  add_compile_options(-Wall -Wextra -Wpedantic)
endif()

# find dependencies
find_package(ament_cmake REQUIRED)
find_package(rclcpp REQUIRED)

# Where I've changed
add_executable(cpp_node src/my_first_node.cpp)
ament_target_dependencies(cpp_node rclcpp)

install(TARGETS
  cpp_node
  DESTINATION lib/${PROJECT_NAME}
)

ament_package()

my_cpp_pkg/CMakeLists.txt

Monte Carlo Tree Search

2025-09-24T00:00:00+09:00

Intro

Monte Carlo Tree Search (MCTS) works well in practice but poses theoretical challenges. In this writing, I want to describe MCTS algorithm, and why this algorithm works.

Open-loop planning algorithms like MCTS, can plan future actions from an initial state $s_0$. They assume access to a model of the environment, either stochastically or deterministically. By contrast, typical reinforcement learning algorithm is closed-loop: at each time step it selects an action based on the current state $s_t$

Intuition

MCTS selects the action expected to yeild the highest return. However, evaluating every state’s value is usually infeasible; if all values were known, the agent could simply choose the best action. Thus we must estimate values, and rollout make this possible. A single rollout can be inaccurate, but MCTS mitigates this by repeating rollouts and expanding the tree: as the number of visits increases, the estimates become more accurate.

For deep trees, computation can be expensive: with a fixed branching factor (number of actions), the cost grows exponentially with depth. I believe there might be some sort of trade-off methologies. I plan to discuss these after further study.

Pseudocode

\begin{algorithm}
\caption{MCTS: Selection–Expansion–Rollout (from $s_0$)}
\begin{algorithmic}
\STATE current $\leftarrow$ $s_0$
\WHILE{not Leaf(current)}
  \STATE current $\leftarrow$ $\arg\max_{s_i \in \mathcal{C}(\text{current})}\; \text{UCB1}(s_i)$
\ENDWHILE
\IF{$N(\text{current}) = 0$}
  \Return Rollout(current) \Comment{unvisited leaf $\rightarrow$ rollout}
\ELSE
  \FOR{each action $a$ available from current}
    \STATE addNewStateToTree(current, $a$)
  \ENDFOR
  \STATE current $\leftarrow$ firstNewChild(current)
  \Return Rollout(current) \Comment{expand then rollout}
\ENDIF
\end{algorithmic}
\end{algorithm}

\begin{algorithm}
\caption{Rollout$(s_i)$}
\begin{algorithmic}
\STATE $s \gets s_i$
\WHILE{true}
  \IF{$\operatorname{Terminal}(s)$}
    \Return $\operatorname{Value}(s)$
  \ENDIF
  \STATE $a \gets \operatorname{Random}(\operatorname{AvailableActions}(s))$
  \STATE $s \gets \operatorname{Simulate}(a, s)$
\ENDWHILE
\end{algorithmic}
\end{algorithm}

Note. The pseudocode above shows how to run MCTS algorithm step by step so that the agent can choose the action with the highest estimated value. A node’s value is typically computed as the averaged sum of leaves’ values including the node’s own value.

UCB1
\[\text{UCB1}(s_t) = \frac{Q(s_t)}{N(s_t)} + C\sqrt{\frac{ln(N(s_{t-1}))}{N(s_t)}}\]
$ Q(s_t) $: cumulative return

$ N(s_t) $: The number of visits

The first term is the empirical mean (exploitation).
The second term encourages exploration of less-visited nodes and shrinks as $N(s_t)$ grows.

Example

Step 0: Initialization

%%{init:{'flowchart':{'useMaxWidth':false,'htmlLabels':true}}}%% flowchart TB subgraph T0[" "] direction TB t0_s0["s0
Q=0
N=0"] end

Step 1: Expand

%%{init:{'flowchart':{'useMaxWidth':false,'htmlLabels':true}}}%% flowchart TB subgraph T1[" "] direction TB t1_s0["s0
Q=0.00
N=0"] t1_s0 -->|a1 = 0| t1_s1L["s1
Q=0
N=0"] t1_s0 -->|a1 = 1| t1_s1R["s1
Q=0
N=0"] end

Step 2: Rollout & Backpropagation

%%{init:{'flowchart':{'useMaxWidth':false,'htmlLabels':true}}}%% flowchart TB subgraph T2[" "] direction TB t2_s0["s0
Q=20
N=1"] t2_s0 -->|a1 = 0| t2_s1L["s1
Q=20
N=1"] t2_s0 -->|a1 = 1| t2_s1R["s1
Q=0
N=0"] t2_s1L -.->|"π(a_t | s_t)"| t2_terL["s_ter"] end

Step 3: Rollout & Backpropagation

%%{init:{'flowchart':{'useMaxWidth':false,'htmlLabels':true}}}%% flowchart TB subgraph T2[" "] direction TB t2_s0["s0
Q=15
N=2"] t2_s0 -->|a1 = 0| t2_s1L["s1
Q=20
N=1"] t2_s0 -->|a1 = 1| t2_s1R["s1
Q=10
N=1"] t2_s1R -.->|"π(a_t | s_t)"| t2_terR["s_ter"] end

Step 4: Expand

%%{init:{'flowchart':{'useMaxWidth':false,'htmlLabels':true}}}%% flowchart TB subgraph T3[" "] direction TB t3_s0["s0
Q=15
N=2"] t3_s0 -->|a1 = 0| t3_s1L["s1
Q=20
N=1"] t3_s0 -->|a1 = 1| t3_s1R["s1
Q=10
N=1"] t3_s1L -->|a2 = 0| t3_s2LL["s2
Q=0
N=0"] t3_s1L -->|a2 = 1| t3_s2LR["s2
Q=0
N=0"] end

Step 5: Rollout & Backpropagation

%%{init:{'flowchart':{'useMaxWidth':false,'htmlLabels':true}}}%% flowchart TB subgraph T4[" "] direction TB t4_s0["s0
Q=10
N=3"] t4_s0 -->|a1 = 0| t4_s1L["s1
Q=10
N=2"] t4_s0 -->|a1 = 1| t4_s1R["s1
Q=10
N=1"] t4_s1L -->|a2 = 0| t4_s2LL["s2
Q=0
N=1"] t4_s1L -->|a2 = 1| t4_s2LR["s2
Q=0
N=0"] t4_s2LL -.->|"π(a_t | s_t)"| t4_terL["s_ter"] end

Step 6: Expand

%%{init:{'flowchart':{'useMaxWidth':false,'htmlLabels':true}}}%% flowchart TB subgraph T5[" "] direction TB t5_s0["s0
Q=10
N=3"] t5_s0 -->|a1 = 0| t5_s1L["s1
Q=10
N=2"] t5_s0 -->|a1 = 1| t5_s1R["s1
Q=10
N=1"] t5_s1L -->|a2 = 0| t5_s2LL["s2
Q=0
N=1"] t5_s1L -->|a2 = 1| t5_s2LR["s2
Q=0
N=0"] t5_s1R -->|a2 = 0| t5_s2RL["s2
Q=0
N=0"] t5_s1R -->|a2 = 1| t5_s2RR["s2
Q=0
N=0"] end

Step 7: Rollout & Backpropagation

%%{init:{'flowchart':{'useMaxWidth':false,'htmlLabels':true}}}%% flowchart TB subgraph T6[" "] direction TB t6_s0["s0
Q=11
N=4"] t6_s0 -->|a1 = 0| t6_s1L["s1
Q=10
N=2"] t6_s0 -->|a1 = 1| t6_s1R["s1
Q=12
N=2"] t6_s1L -->|a2 = 0| t6_s2LL["s2
Q=0
N=1"] t6_s1L -->|a2 = 1| t6_s2LR["s2
Q=0
N=0"] t6_s1R -->|a2 = 0| t6_s2RL["s2
Q=14
N=1"] t6_s1R -->|a2 = 1| t6_s2RR["s2
Q=0
N=0"] t6_s2RL -.->|"π(a_t | s_t)"| t6_terL["s_ter"] end

Pinsker’s Inequality

2025-09-06T00:00:00+09:00

Th) Pinsker’s Inequality

$\forall$ ( P, Q ): probability distributions on measurable space $( U, \Sigma )$,

$\delta(P, Q) \leq \sqrt{\frac{1}{2} D_{\text{KL}}(P \| Q)}$

$\delta(P, Q)$ : Total variation
$D_{\text{KL}}(P \| Q)$ : KL divergence

Proof)

I only prove for discrete case.

A special case of Pinsker’s Inequality first be proved for whole proof.

Special case of Pinsker’s Inequality

$ P = \begin{cases} 1 & \text{w.p. } p \\ 0 & \text{w.p. } 1-p \end{cases} $

$ Q = \begin{cases} 1 & \text{w.p. } q \\ 0 & \text{w.p. } 1-q \end{cases} $

s.t. $ p \ge q $

$ |P-Q|_1 = |p-q| + |(1-p) - (1-q)| = 2|p-q| = 2(p-q) \quad (\because p \geq q)$

$f(p,q) = p \log \frac{p}{q} + (1-p)\log \frac{1-p}{1-q} - \frac{1}{2 \ln 2}(2(p-q))^2$

and

$\frac{\partial f}{\partial q} = \frac{\partial}{\partial q}\left(p\log p - p\log q\right) + \frac{\partial}{\partial q}\left((1-p)(\log(1-p) - \log(1-q))\right) - \frac{\partial}{\partial q}\frac{1}{2\ln 2}(2(p-q))^2$

$= -\frac{p}{q \ln 2} + \frac{1-p}{(1-q)\ln 2} - \frac{1}{2\ln 2}\cdot 2(2(p-q))(-2)$

$= \frac{1}{\ln 2}\left(-\frac{p}{q} + \frac{1-p}{1-q}\right) + \frac{4}{\ln 2}(p-q)$

$= \frac{1}{\ln 2}\left(-\frac{p}{q} + \frac{1-p}{1-q} + 4(p-q)\right)$

$= -\frac{p-q}{\ln 2}\left(\frac{1}{q(1-q)} - 4\right) \le 0 \quad (\because p \ge q \land \frac{1}{q(1-q)} \ge 4)$

and

$q = p \implies f(p,q)=0$

$\therefore f(p,q)\ge 0 \quad (p \ge q)$

which means that

$f(p,q) = D_{\mathrm{KL}}(P \| Q) - \tfrac{1}{2 \ln 2} |P - Q|_1^2 \ge 0$

$\therefore D_{\mathrm{KL}}(P \| Q) \ge \tfrac{1}{2 \ln 2} |P - Q|_1^2 \quad \cdots \quad (1)$

Let

$p(x) := P_P(x)$
$q(x) := P_Q(x)$
$ A := \{x \mid p(x) \geq q(x) \} $

then define random variable

$ Z(x) := \begin{cases} 1 & (x \in A) \\ 0 & (x \notin A) \end{cases} $

then, below holds

Th) chain rule of KL divergence

$D_{\text{KL}}(P \| Q) = D_{\text{KL}}(P(Z) \| Q(Z)) + D_{\text{KL}}(P \| Q | Z)$

Proof)

$D_{\text{KL}}(P(Z) \| Q(Z))$

$= P(A) \log \frac{P(A)}{Q(A)} + P(A^c) \log \frac{P(A^c)}{Q(A^c)}$

$= \sum_{x \in A} p(x) \log \frac{P(A)}{Q(A)} + \sum_{x \notin A} p(x) \log \frac{P(A^c)}{Q(A^c)} \quad\cdots\quad \text{(2)}$

and

$D_{\text{KL}}(P \| Q \mid Z)$

$ = \mathbb{E}_{Z \sim P(Z)}\!\left[ D_{\mathrm{KL}}\!\left( P(P \mid Z=z)\,\|\,P(Q \mid Z=z) \right) \right] \quad (\text{KL divergence between two conditional probability distributions}) $

$= P(A) D_{\text{KL}}(P(P \mid Z=1) \,\|\, P(Q \mid Z=1)) + P(A^c) D_{\text{KL}}(P(P \mid Z=0) \| P(Q \mid Z=0))$

$= P(A) \sum_{x \in A} p(x \mid Z=1) \log \frac{p(x \mid Z=1)}{q(x \mid Z=1)} + P(A^c) \sum_{x \notin A} p(x \mid Z=0) \log \frac{p(x \mid Z=0)}{q(x \mid Z=0)}$

$ = \sum_{x\in A} p(x)\,\log\frac{p(x)}{q(x)}\cdot\frac{Q(A)}{P(A)}+\sum_{x\notin A} p(x)\,\log\frac{p(x)}{q(x)}\cdot\frac{Q(A^c)}{P(A^c)} \quad\cdots\quad \text{(3)}$

$ \left( \because p(x \mid Z=1) = \frac{p(x)}{P(A)} \text{, } q(x \mid Z=1) = \frac{q(x)}{Q(A)} \text{, } p(x \mid Z=0) = \frac{p(x)}{P(A^{c})} \text{, } q(x \mid Z=0) = \frac{q(x)}{Q(A^{c})}\right) $

Combine (2), (3), then

$D_{\text{KL}}(P(Z) \| Q(Z)) + D_{\text{KL}}(P \| Q \mid Z)$

$= \sum_{x \in A} p(x) \log \frac{P(A)}{Q(A)} + \sum_{x \notin A} p(x) \log \frac{P(A^c)}{Q(A^c)} + \sum_{x \in A} p(x) \log \frac{p(x)}{q(x)} \cdot \frac{Q(A)}{P(A)} + \sum_{x \notin A} p(x) \log \frac{p(x)}{q(x)} \cdot \frac{Q(A^c)}{P(A^c)}$

$= \sum_{x \in A} p(x) \left( \log \frac{p(x)}{q(x)} \cdot \frac{Q(A)}{P(A)} + \log \frac{P(A)}{Q(A)} \right) + \sum_{x \notin A} p(x) \left( \log \frac{p(x)}{q(x)} \cdot \frac{Q(A^c)}{P(A^c)} + \log \frac{P(A^c)}{Q(A^c)} \right)$

$= \sum_{x \in U} p(x) \log \frac{p(x)}{q(x)}$

$= D_{\text{KL}}(P \| Q)$

$\blacksquare$

Let

$ P_A := \begin{cases} 1 & \text{w.p. } \sum_{x \in A} p(x) \\ 0 & \text{w.p. } \sum_{x \notin A} p(x) \end{cases} $

$ Q_A := \begin{cases} 1 & \text{w.p. } \sum_{x \in A} q(x) \\ 0 & \text{w.p. } \sum_{x \notin A} q(x) \end{cases} $

Then,

$ |P - Q|_1 $

$ = \sum_x |p(x) - q(x)| $

$ = \sum_{x \in A} (p(x) - q(x)) + \sum_{x \notin A} (q(x) - p(x)) \quad (\because p(x) \geq q(x) \, \forall x \in A) $

$ = \left| \sum_{x \in A} p(x) - \sum_{x \in A} q(x) \right| + \left| \sum_{x \notin A} q(x) - \sum_{x \notin A} p(x) \right| $

$ = |P(P_A = 1) - P(Q_A = 1)| + |P(P_A = 0) - P(Q_A = 0)| $

$ = \sum_{x \in \{0,1\}} |P(P_A = x) - P(Q_A = x)| $

$ = |P(P_A) - P(Q_A)|_1 \quad\cdots\quad (4) $

Therefore, below holds

$ D_{\mathrm{KL}}(P \| Q) $

$ \ge D_{\mathrm{KL}}(P(Z) \| Q(Z)) \quad (\because\text{Chain rule of KL divergence}) $

$ = D_{\mathrm{KL}}(P(P_A) \| P(Q_A)) \quad (\because (4)) $

$ \Rightarrow D_{\mathrm{KL}}(P \| Q) \ge D_{\mathrm{KL}}(P(P_A) \| P(Q_A)) $

$ \ge \frac{1}{2 \ln 2} |P(P_A) - P(Q_A)|_1^2 \quad (\because\text{Special case of pinsker’s inequality}) $

$ = \frac{1}{2 \ln 2} |P - Q|_1^2 \quad (\because (4)) $

$ \Rightarrow \sqrt{\tfrac{1}{2} D_{\mathrm{KL}}(P \| Q)} \ge \sqrt{\tfrac{1}{4 \ln 2}} \, |P - Q|_1 \quad\cdots\quad (5) $

$ \text{Let } A^\ast \in \Sigma \quad\text{s.t.}\quad \sup_{A^\ast} |P(A^\ast) - Q(A^\ast)| = |P(A^\ast) - Q(A^\ast)| \quad (\because\text{Hahn decomposition theory}) $

$ \text{then let } p := P(A^\ast), \; q := Q(A^\ast) $

$ |P - Q|_1 = |P(A^\ast) - Q(A^\ast)| + |P((A^{\ast})^c) - Q((A^{\ast})^c)| $

$ = |P(A^\ast) - Q(A^\ast) - (P((A^{\ast})^c) + Q((A^{\ast})^c))| $

$ = |p - q - (1-p) + (1-q)| $

$ = 2(p - q) $

$ = 2 \delta(P,Q) \quad\cdots\quad (6) $

$ \therefore \sqrt{\tfrac{1}{2} D_{\mathrm{KL}}(P \| Q)} \;\;\ge\;\; \sqrt{\tfrac{1}{4 \ln 2}} \, |P - Q|_1 \quad (\because(5)) $

$ = \sqrt{\tfrac{1}{4 \ln 2}} \cdot 2 \delta(P,Q) \;\ge\; \delta(P,Q) \quad (\because(6)) $

$ \Rightarrow \sqrt{\tfrac{1}{2} D_{\mathrm{KL}}(P \| Q)} \ge \delta(P,Q) $

$\blacksquare$