Skip to content

🧠 Entropy in Reinforcement Learning

🚀 Intuition

  • Probability tells us how certain an agent is about taking an action.

  • Surprise captures how unexpected an outcome is.

  • So, a natural idea is:

$$ \text{Surprise}(a) = \frac{1}{P(a)} $$

But this isn't ideal:

  • If \(P(a) = 1\), surprise should be 0 — but \(\frac{1}{1} = 1\), which doesn’t work.
  • Instead, we define surprise using logarithms:

📐 Surprise = Log Inverse Probability

\[ \text{Surprise}(a) = \log\left(\frac{1}{P(a)}\right) = -\log P(a) \]

So, the less likely the action, the greater the surprise.


📊 Entropy = Expected Surprise

Entropy is the expected surprise over all possible actions:

\[ \begin{align*} \text{Entropy}(\pi) &= \sum_{a \in \mathcal{A}} P(a) \cdot \text{Surprise}(a) \\ &= \sum_{a \in \mathcal{A}} P(a) \cdot (-\log P(a)) \\ &= -\sum_{a \in \mathcal{A}} P(a) \cdot \log P(a) \end{align*} \]

🔁 What Entropy Tells Us

Distribution Entropy Notes
[1.0, 0.0, 0.0] 0 Fully deterministic
[0.7, 0.2, 0.1] Low Fairly confident
[0.33, 0.33, 0.34] High Very uncertain (max entropy)

📌 Entropy is highest when all actions are equally likely (pure exploration), and lowest when the policy is deterministic (pure exploitation).


🧪 PyTorch: Compute Entropy

If you have a probability distribution (e.g. from softmax), you can compute entropy like this:

import torch
import torch.nn.functional as F

# Example: logits for 3 actions
logits = torch.tensor([1.0, 0.5, -0.5])

# Get action probabilities
probs = F.softmax(logits, dim=-1)

# Compute entropy
entropy = -torch.sum(probs * torch.log(probs + 1e-8))  # +1e-8 for numerical stability

print("Entropy:", entropy.item())
  • or, using Categorical distribution:
import torch

# Example logits for a single state with 3 actions
logits = torch.tensor([1.0, 0.5, -0.5])

# Create a categorical distribution
dist = torch.distributions.Categorical(logits=logits)

# Compute entropy
entropy = dist.entropy()

print("Entropy:", entropy.item())

⚙️ When is Entropy Used in RL?

  • Policy Gradient Methods (PPO, A2C, etc.):

  • Add entropy bonus to the loss:

    total_loss = ppo_loss - entropy_coeff * entropy
    
  • Prevents policy from collapsing too early into deterministic behavior.

  • Encourages ongoing exploration especially in early training.

  • Entropy Coefficient (hyperparameter):

  • Typically a small value (e.g. 0.01, 0.001)

  • Can be annealed (decayed) over time.

🧠 Summary

  • Entropy is a measure of uncertainty in the policy.
  • Encouraging entropy helps with exploration in RL.
  • PPO uses an entropy bonus to maintain a balance between exploring and exploiting.