Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

Jiawei Wang1,2,*, Jiacai Liu1,3,*, Yuqian Fu1,4,*, Yingru Li1, Xintao Wang1,3,*, Yuan Lin1,*, Yu Yue1,*, Lin Zhang5, Yang Wang1,*,†, Ke Wang1,*,†
1ByteDance, 2University of Science and Technology of China, 3Fudan University, 4Institute of Automation, Chinese Academy of Sciences
*Work done at ByteDance Seed, Corresponding authors

Abstract

Large Language Model (LLM) agents face a significant challenge in long-horizon tasks due to sparse, outcome-based rewards that make credit assignment for intermediate steps difficult. We identify a fundamental problem in the learning dynamics itself: a policy gradient's magnitude is inherently coupled with its entropy, leading to inefficiently small updates for confident correct actions and potentially destabilizing large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, strongly penalizes confident errors to combat "hallucinated confidence," and attenuates updates from uncertain steps to stabilize exploration. We further introduce a "future clarity" intrinsic bonus that encourages the agent to find more predictable solution paths. Through comprehensive experiments on challenging benchmarks like WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines.

Theoretical Motivation

Our approach is motivated by a fundamental analysis of the relationship between a policy's gradient and its predictive uncertainty. Standard policy gradients possess an inherent dynamic where high-entropy (uncertain) actions naturally produce large gradients, while low-entropy (confident) actions produce small ones. This presents a dual challenge:

  1. Confident and correct steps, which should be strongly reinforced, receive small updates, limiting learning speed.
  2. Uncertain exploratory steps can introduce large, noisy gradients that destabilize training.

This dynamic is formally characterized by the following proposition:

Proposition 1. For a softmax policy, the expected squared L2-norm of the score function is a direct function of the policy's Rényi-2 entropy, H2(π):
Ea~π[||∇z log π(a|s)||2] = 1 - exp(-H2(π))

This shows that the expected gradient norm is monotonically coupled with policy entropy. Our goal is to re-calibrate this learning signal. EMPG provides a two-part re-calibration:

1. Self-Calibrating Gradient Scaling

This component directly addresses the issue by re-calibrating the magnitude of the update based on current-step uncertainty, amplifying reliable signals and attenuating noisy ones.

2. Future Clarity Bonus

However, re-calibrating the update magnitude is only half the solution. A truly effective learning signal must also guide the agent in a useful direction. This motivates our second component, the Future Clarity Bonus, which can be formally justified through the lens of information theory. By providing an intrinsic motivation for the agent to seek low-entropy next states, the bonus encourages actions that yield high Information Gain about the optimal future path. This corresponds to a local, step-wise objective of minimizing the policy's entropy at the next state:

min H(πθ(·|st+1))

This objective imbues the agent with a generalizable meta-skill: to actively seek clarity in the face of ambiguity.

The EMPG Framework

Entropy-Modulated Credit Assignment

Standard policy gradients treat all steps in a trajectory equally, which is inefficient. High-entropy (uncertain) actions produce large gradients, while low-entropy (confident) actions produce small ones. This can destabilize training and slow down learning on correct, confident steps.

EMPG solves this by modulating the credit assignment based on step-wise entropy. For a successful trajectory, it amplifies updates for confident steps and attenuates updates for uncertain ones, accelerating learning. For a failed trajectory, it strongly penalizes confident errors to fight "hallucinated confidence" while avoiding harsh penalties for uncertain exploration.

Future Clarity Bonus

EMPG also introduces a "future clarity" bonus. This intrinsic reward encourages the agent to take actions that lead to less uncertain (lower entropy) subsequent states. This guides the agent towards more predictable and stable solution paths, improving overall performance and reliability.

Conceptual diagram contrasting uniform credit assignment with EMPG's entropy-modulated approach.
EMPG's confidence-modulated credit assignment vs. baseline. (Click to enlarge)

Experimental Results

EMPG consistently outperforms strong policy gradient baselines across challenging long-horizon benchmarks.

Performance on ALFWorld and WebShop

Table of results for ALFWorld and WebShop benchmarks.
Results on ALFWorld and WebShop. (Click to enlarge)

Performance on Deep Search

Table of results for the Deep Search benchmark.
Results on Deep Search. (Click to enlarge)

Analysis

Training Stability

KL Loss dynamics during training for the Qwen2.5-32B-Instruct model.

The baseline agent (orange) suffers from late-stage instability, while the EMPG-enhanced agent (blue) remains stable throughout training. (Click to enlarge)

Step-Level Entropy Dynamics

Average entropy change after RL fine-tuning for each entropy percentile.

Unlike token-level findings, even low-entropy steps undergo significant changes, validating our step-level analysis approach. (Click to enlarge)

PyTorch-Style Pseudocode

This pseudocode outlines the core logic for calculating the EMPG advantage, demonstrating how to modulate credits based on step-level entropy.


import numpy as np
import torch

def compute_empg_advantage(tokenizer, batch, k=1.0, k_f=1.0, zeta=0.1):
    """
    Args:
        tokenizer: The tokenizer for identifying response segments.
        batch: A data batch with 'responses', 'old_entropy', 'advantages'.
        k (float): Hyperparameter for self-calibrating gradient scaling.
        k_f (float): Hyperparameter for the future clarity bonus.
        zeta (float): Hyperparameter for the future clarity bonus.
    """
    # --- 1. First Pass: Collect Step-Level Entropies ---
    all_step_entropies = []
    # segments_to_modify stores {'sample_idx', 'start', 'end'} for each step
    segments_to_modify = [] 

    for i in range(batch.batch.batch_size[0]):
        # Find "assistant" segments, which correspond to agent steps.
        token_segments = process_token_sequences(
            batch.batch['responses'][i], 
            tokenizer.encode("<|im_start|>assistant\n"), 
            tokenizer.encode('<|im_end|>')
        )
        for start, end in token_segments:
            if start >= end: continue
            
            # Calculate the average token-level entropy for the step
            step_entropy = batch.batch['old_entropy'][i][start:end].mean().item()
            all_step_entropies.append(step_entropy)
            segments_to_modify.append({'sample_idx': i, 'start': start, 'end': end})

    if not all_step_entropies: return

    # --- 2. Calculate Modulated Advantage Components ---
    H = np.array(all_step_entropies)
    
    # Batch-level entropy normalization (Eq. 12) with epsilon = 1e-8
    min_H, max_H = np.min(H), np.max(H)
    H_norm = (H - min_H) / (max_H - min_H + 1e-8)

    # Self-calibrating gradient scaling g(H) (Eq. 10)
    g_H_unnormalized = np.exp(-k * H_norm)
    mean_g_H = np.mean(g_H_unnormalized)
    g_H = g_H_unnormalized / (mean_g_H + 1e-8)
    
    # Future clarity bonus f(H) (Eq. 11)
    f_H = np.exp(-k_f * H_norm)

    # Convert to tensors for PyTorch operations
    g_H = torch.tensor(g_H, device=batch.batch['advantages'].device, dtype=torch.float32)
    f_H = torch.tensor(f_H, device=batch.batch['advantages'].device, dtype=torch.float32)

    # --- 3. Second Pass: Apply Advantage Modulation (Eq. 8) ---
    step_advantages = []
    for i, segment in enumerate(segments_to_modify):
        idx, start, end = segment['sample_idx'], segment['start'], segment['end']
        
        # Apply self-calibrating gradient scaling
        batch.batch['advantages'][idx][start:end] *= g_H[i]
        
        # Add future clarity bonus if there is a next step
        next_seg = segments_to_modify[i+1] if i+1 < len(segments_to_modify) else None
        if next_seg and next_seg['sample_idx'] == idx:
            batch.batch['advantages'][idx][start:end] += zeta * f_H[i+1]
        step_advantages.append(batch.batch['advantages'][idx][start])
            
    # --- 4. Final Advantage Normalization (Eq. 7) ---
    if step_advantages:
        final_adv_mean = torch.mean(torch.stack(step_advantages))
        batch.batch['advantages'] -= final_adv_mean

Paper & Citation

For a detailed description of the method, experiments, and analysis, please refer to our paper.

BibTeX

@misc{wang2025harnessinguncertaintyentropymodulatedpolicy,
          title={Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents}, 
          author={Jiawei Wang and Jiacai Liu and Yuqian Fu and Yingru Li and Xintao Wang and Yuan Lin and Yu Yue and Lin Zhang and Yang Wang and Ke Wang},
          year={2025},
          eprint={2509.09265},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2509.09265}, 
    }