Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

Abstract

Large Language Model (LLM) agents face a significant challenge in long-horizon tasks due to sparse, outcome-based rewards that make credit assignment for intermediate steps difficult. We identify a fundamental problem in the learning dynamics itself: a policy gradient's magnitude is inherently coupled with its entropy, leading to inefficiently small updates for confident correct actions and potentially destabilizing large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, strongly penalizes confident errors to combat "hallucinated confidence," and attenuates updates from uncertain steps to stabilize exploration. We further introduce a "future clarity" intrinsic bonus that encourages the agent to find more predictable solution paths. Through comprehensive experiments on challenging benchmarks like WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines.

Theoretical Motivation

Our approach is motivated by a fundamental analysis of the relationship between a policy's gradient and its predictive uncertainty. Standard policy gradients possess an inherent dynamic where high-entropy (uncertain) actions naturally produce large gradients, while low-entropy (confident) actions produce small ones. This presents a dual challenge:

Confident and correct steps, which should be strongly reinforced, receive small updates, limiting learning speed.
Uncertain exploratory steps can introduce large, noisy gradients that destabilize training.

This dynamic is formally characterized by the following proposition:

Proposition 1. For a softmax policy, the expected squared L2-norm of the score function is a direct function of the policy's Rényi-2 entropy, H₂(π):

E_a~π[||∇_z log π(a|s)||²] = 1 - exp(-H₂(π))

This shows that the expected gradient norm is monotonically coupled with policy entropy. Our goal is to re-calibrate this learning signal. EMPG provides a two-part re-calibration:

1. Self-Calibrating Gradient Scaling

This component directly addresses the issue by re-calibrating the magnitude of the update based on current-step uncertainty, amplifying reliable signals and attenuating noisy ones.

2. Future Clarity Bonus

However, re-calibrating the update magnitude is only half the solution. A truly effective learning signal must also guide the agent in a useful direction. This motivates our second component, the Future Clarity Bonus, which can be formally justified through the lens of information theory. By providing an intrinsic motivation for the agent to seek low-entropy next states, the bonus encourages actions that yield high Information Gain about the optimal future path. This corresponds to a local, step-wise objective of minimizing the policy's entropy at the next state:

min H(π_θ(·|s_t+1))

This objective imbues the agent with a generalizable meta-skill: to actively seek clarity in the face of ambiguity.

The EMPG Framework

Entropy-Modulated Credit Assignment

Standard policy gradients treat all steps in a trajectory equally, which is inefficient. High-entropy (uncertain) actions produce large gradients, while low-entropy (confident) actions produce small ones. This can destabilize training and slow down learning on correct, confident steps.

EMPG solves this by modulating the credit assignment based on step-wise entropy. For a successful trajectory, it amplifies updates for confident steps and attenuates updates for uncertain ones, accelerating learning. For a failed trajectory, it strongly penalizes confident errors to fight "hallucinated confidence" while avoiding harsh penalties for uncertain exploration.

Future Clarity Bonus

EMPG also introduces a "future clarity" bonus. This intrinsic reward encourages the agent to take actions that lead to less uncertain (lower entropy) subsequent states. This guides the agent towards more predictable and stable solution paths, improving overall performance and reliability.

Conceptual diagram contrasting uniform credit assignment with EMPG's entropy-modulated approach. — EMPG's confidence-modulated credit assignment vs. baseline. (Click to enlarge)

Experimental Results

EMPG consistently outperforms strong policy gradient baselines across challenging long-horizon benchmarks.

Performance on ALFWorld and WebShop

Table of results for ALFWorld and WebShop benchmarks. — Results on ALFWorld and WebShop. (Click to enlarge)

Performance on Deep Search

Table of results for the Deep Search benchmark. — Results on Deep Search. (Click to enlarge)

Analysis

Training Stability

KL Loss dynamics during training for the Qwen2.5-32B-Instruct model.

The baseline agent (orange) suffers from late-stage instability, while the EMPG-enhanced agent (blue) remains stable throughout training. (Click to enlarge)

Step-Level Entropy Dynamics

Average entropy change after RL fine-tuning for each entropy percentile.

Unlike token-level findings, even low-entropy steps undergo significant changes, validating our step-level analysis approach. (Click to enlarge)

PyTorch-Style Pseudocode

This pseudocode outlines the core logic for calculating the EMPG advantage, demonstrating how to modulate credits based on step-level entropy.


import numpy as np
import torch

def process_token_sequences(
    token_id_tensor: torch.Tensor,
    start_end_delimiter_seq: list[int],
    target_delimiter_seq: list[int],
    head_sequence: list[int]
) -> dict:
    """
    Processes a token ID tensor to find tokens between specific delimiters and
    the end index of a leading sequence.
    Args:
        token_id_tensor (torch.Tensor): A 1D PyTorch tensor of token IDs.
        start_end_delimiter_seq (list[int]): The sequence marking the start and end
                                              of the segments to extract.
                                              For your case, this is [151644, 77091, 1699]
                                              for start, and [151645] for end.
                                              This function assumes the 'start' sequence
                                              is [151644, 77091, 1699] and the 'end' sequence
                                              is [151645] as per your previous requests.
        target_delimiter_seq (list[int]): The sequence to mark the end of segments
                                            when found after start_end_delimiter_seq.
                                            For your case, this is [151645].
        head_sequence (list[int]): The sequence to find from the beginning of the tensor,
                                   and return its exclusive end index.
                                   For your case, this is [151645, 151644, 872].
    Returns:
        dict: A dictionary containing:
              - 'between_delimiters': A list of tuples, where each tuple contains:
                                      (start_index_of_tokens, end_index_of_tokens, tokens_tensor).
                                      These are the tokens found between `start_end_delimiter_seq`
                                      and `target_delimiter_seq`.
              - 'first_head_sequence_end_index': The exclusive end index of the first
                                                 `head_sequence` found from the beginning of the tensor.
                                                 Returns -1 if not found.
    """
    if not isinstance(token_id_tensor, torch.Tensor) or token_id_tensor.ndim != 1:
        raise ValueError("token_id_tensor must be a 1D PyTorch tensor.")
    if not start_end_delimiter_seq or not target_delimiter_seq or not head_sequence:
        raise ValueError("All sequence arguments cannot be empty.")
    results = []
    start_seq_len = len(start_end_delimiter_seq)
    target_seq_len = len(target_delimiter_seq)
    head_seq_len = len(head_sequence)
    start_seq_tensor = torch.tensor(start_end_delimiter_seq, dtype=token_id_tensor.dtype, device=token_id_tensor.device)
    target_seq_tensor = torch.tensor(target_delimiter_seq, dtype=token_id_tensor.dtype, device=token_id_tensor.device)
    head_seq_tensor = torch.tensor(head_sequence, dtype=token_id_tensor.dtype, device=token_id_tensor.device)
    # --- Part 1: Find the end index of the first head_sequence from the beginning ---
    for k in range(len(token_id_tensor) - head_seq_len + 1):
        if torch.equal(token_id_tensor[k : k + head_seq_len], head_seq_tensor):
            results.append((0, k + head_seq_len))
            break # Found the first one, no need to search further
    # --- Part 2: Find tokens between start_end_delimiter_seq and target_delimiter_seq ---
    for i in range(len(token_id_tensor) - start_seq_len + 1):
        if torch.equal(token_id_tensor[i : i + start_seq_len], start_seq_tensor):
            for j in range(i + start_seq_len, len(token_id_tensor) - target_seq_len + 1):
                if torch.equal(token_id_tensor[j : j + target_seq_len], target_seq_tensor):
                    tokens_start_idx = i + start_seq_len
                    tokens_end_idx = j
                    results.append((tokens_start_idx, tokens_end_idx + 1))
                    break # Found a pair, move to find the next start_end_delimiter_seq
    return results

def compute_empg_advantage(tokenizer, batch, k=1.0, k_f=1.0, zeta=0.1):
    """
    Args:
        tokenizer: The tokenizer for identifying response segments.
        batch: A data batch with 'responses', 'old_entropy', 'advantages'.
        k (float): Hyperparameter for self-calibrating gradient scaling.
        k_f (float): Hyperparameter for the future clarity bonus.
        zeta (float): Hyperparameter for the future clarity bonus.
    """
    # --- 1. First Pass: Collect Step-Level Entropies ---
    all_step_entropies = []
    # segments_to_modify stores {'sample_idx', 'start', 'end'} for each step
    segments_to_modify = [] 

    for i in range(batch.batch.batch_size[0]):
        # Find "assistant" segments, which correspond to agent steps.
        token_segments = process_token_sequences(
            batch.batch['responses'][i], 
            tokenizer.encode("<|im_start|>assistant\n"), 
            tokenizer.encode('<|im_end|>')
        )
        for start, end in token_segments:
            if start >= end: continue
            
            # Calculate the average token-level entropy for the step
            step_entropy = batch.batch['old_entropy'][i][start:end].mean().item()
            all_step_entropies.append(step_entropy)
            segments_to_modify.append({'sample_idx': i, 'start': start, 'end': end})

    if not all_step_entropies: return

    # --- 2. Calculate Modulated Advantage Components ---
    H = np.array(all_step_entropies)
    
    # Batch-level entropy normalization (Eq. 12) with epsilon = 1e-8
    min_H, max_H = np.min(H), np.max(H)
    H_norm = (H - min_H) / (max_H - min_H + 1e-8)

    # Self-calibrating gradient scaling g(H) (Eq. 10)
    g_H_unnormalized = np.exp(-k * H_norm)
    mean_g_H = np.mean(g_H_unnormalized)
    g_H = g_H_unnormalized / (mean_g_H + 1e-8)
    
    # Future clarity bonus f(H) (Eq. 11)
    f_H = np.exp(-k_f * H_norm)

    # Convert to tensors for PyTorch operations
    g_H = torch.tensor(g_H, device=batch.batch['advantages'].device, dtype=torch.float32)
    f_H = torch.tensor(f_H, device=batch.batch['advantages'].device, dtype=torch.float32)

    # --- 3. Second Pass: Apply Advantage Modulation (Eq. 8) ---
    step_advantages = []
    for i, segment in enumerate(segments_to_modify):
        idx, start, end = segment['sample_idx'], segment['start'], segment['end']
        
        # Apply self-calibrating gradient scaling
        batch.batch['advantages'][idx][start:end] *= g_H[i]
        
        # Add future clarity bonus if there is a next step
        next_seg = segments_to_modify[i+1] if i+1 < len(segments_to_modify) else None
        if next_seg and next_seg['sample_idx'] == idx:
            batch.batch['advantages'][idx][start:end] += zeta * f_H[i+1]
        step_advantages.append(batch.batch['advantages'][idx][start])
            
    # --- 4. Final Advantage Normalization (Eq. 7) ---
    if step_advantages:
        final_adv_mean = torch.mean(torch.stack(step_advantages))
        batch.batch['advantages'] -= final_adv_mean

Paper & Citation

For a detailed description of the method, experiments, and analysis, please refer to our paper.

Read on arXiv

BibTeX

@misc{wang2025harnessinguncertaintyentropymodulatedpolicy,
          title={Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents}, 
          author={Jiawei Wang and Jiacai Liu and Yuqian Fu and Yingru Li and Xintao Wang and Yuan Lin and Yu Yue and Lin Zhang and Yang Wang and Ke Wang},
          year={2025},
          eprint={2509.09265},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2509.09265}, 
    }