当前位置：首页 > news >正文

强化学习算法：近端策略优化(PPO)

news 2026/6/16 17:26:13

强化学习算法：近端策略优化(PPO)

1. 技术分析

1.1 PPO概述

PPO是当前最流行的强化学习算法：

PPO特点 近端策略优化: 限制策略更新幅度 信赖域优化: TRPO的简化版 稳定训练: 不易发散 核心思想: 裁剪目标函数 避免策略突变

1.2 PPO优势

特性	PPO	TRPO
实现复杂度	低	高
计算效率	高	低
稳定性	高	很高

1.3 PPO变体

PPO变体 PPO-Clip: 裁剪目标函数 PPO-Penalty: KL惩罚 PPO-Adapter: 自适应KL

2. 核心功能实现

2.1 PPO算法

import numpy as np class PPO: def __init__(self, policy, value_function, optimizer, clip_ratio=0.2, gamma=0.99, lambda_=0.95, epochs=10, batch_size=64): self.policy = policy self.value_function = value_function self.optimizer = optimizer self.clip_ratio = clip_ratio self.gamma = gamma self.lambda_ = lambda_ self.epochs = epochs self.batch_size = batch_size def compute_advantages(self, rewards, values, dones): advantages = [] running_advantage = 0 for i in reversed(range(len(rewards))): if dones[i]: running_advantage = 0 running_advantage = rewards[i] + self.gamma * (1 - dones[i]) * (values[i + 1] + self.lambda_ * running_advantage) advantages.insert(0, running_advantage - values[i]) advantages = np.array(advantages) advantages = (advantages - np.mean(advantages)) / (np.std(advantages) + 1e-9) return advantages def compute_returns(self, rewards, values, dones): returns = [] running_return = 0 for i in reversed(range(len(rewards))): if dones[i]: running_return = 0 running_return = rewards[i] + self.gamma * (1 - dones[i]) * running_return returns.insert(0, running_return) return np.array(returns) def train(self, env, episodes=1000): for episode in range(episodes): states = [] actions = [] rewards = [] dones = [] values = [] state = env.reset() done = False while not done: action_probs = self.policy(state) action = np.random.choice(len(action_probs), p=action_probs) value = self.value_function(state) next_state, reward, done = env.step(action) states.append(state) actions.append(action) rewards.append(reward) dones.append(done) values.append(value) state = next_state values.append(self.value_function(state)) advantages = self.compute_advantages(rewards, values, dones) returns = self.compute_returns(rewards, values, dones) self._update_policy(states, actions, advantages, returns) def _update_policy(self, states, actions, advantages, returns): old_probs = np.array([self.policy(s)[a] for s, a in zip(states, actions)]) for _ in range(self.epochs): indices = np.random.permutation(len(states)) for i in range(0, len(states), self.batch_size): batch_indices = indices[i:i+self.batch_size] batch_states = np.array([states[j] for j in batch_indices]) batch_actions = np.array([actions[j] for j in batch_indices]) batch_advantages = advantages[batch_indices] batch_returns = returns[batch_indices] batch_old_probs = old_probs[batch_indices] new_probs = np.array([self.policy(s)[a] for s, a in zip(batch_states, batch_actions)]) ratio = new_probs / (batch_old_probs + 1e-9) clip_adv = np.clip(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio) * batch_advantages policy_loss = -np.mean(np.minimum(ratio * batch_advantages, clip_adv)) new_values = np.array([self.value_function(s) for s in batch_states]) value_loss = np.mean((batch_returns - new_values) ** 2) loss = policy_loss + 0.5 * value_loss self.optimizer.step(loss)

2.2 PPO策略网络

class PPOPolicyNetwork: def __init__(self, state_dim, action_dim, hidden_dim=64): self.state_dim = state_dim self.action_dim = action_dim self.W1 = np.random.randn(state_dim, hidden_dim) * 0.01 self.b1 = np.zeros(hidden_dim) self.W2 = np.random.randn(hidden_dim, hidden_dim) * 0.01 self.b2 = np.zeros(hidden_dim) self.W3 = np.random.randn(hidden_dim, action_dim) * 0.01 self.b3 = np.zeros(action_dim) def forward(self, state): h1 = np.maximum(0, state @ self.W1 + self.b1) h2 = np.maximum(0, h1 @ self.W2 + self.b2) logits = h2 @ self.W3 + self.b3 exp_logits = np.exp(logits - np.max(logits)) probs = exp_logits / np.sum(exp_logits) return probs class PPOValueNetwork: def __init__(self, state_dim, hidden_dim=64): self.state_dim = state_dim self.W1 = np.random.randn(state_dim, hidden_dim) * 0.01 self.b1 = np.zeros(hidden_dim) self.W2 = np.random.randn(hidden_dim, hidden_dim) * 0.01 self.b2 = np.zeros(hidden_dim) self.W3 = np.random.randn(hidden_dim, 1) * 0.01 self.b3 = np.zeros(1) def forward(self, state): h1 = np.maximum(0, state @ self.W1 + self.b1) h2 = np.maximum(0, h1 @ self.W2 + self.b2) value = h2 @ self.W3 + self.b3 return value[0]

2.3 PPO连续动作

class PPOContinuousPolicy: def __init__(self, state_dim, action_dim, hidden_dim=64): self.state_dim = state_dim self.action_dim = action_dim self.W1 = np.random.randn(state_dim, hidden_dim) * 0.01 self.b1 = np.zeros(hidden_dim) self.W_mu = np.random.randn(hidden_dim, action_dim) * 0.01 self.b_mu = np.zeros(action_dim) self.W_log_std = np.random.randn(hidden_dim, action_dim) * 0.01 self.b_log_std = np.zeros(action_dim) def forward(self, state): h = np.maximum(0, state @ self.W1 + self.b1) mu = h @ self.W_mu + self.b_mu log_std = h @ self.W_log_std + self.b_log_std std = np.exp(log_std) return mu, std def sample_action(self, state): mu, std = self.forward(state) action = mu + np.random.normal(0, std, size=mu.shape) log_prob = self._compute_log_prob(action, mu, std) return action, log_prob def _compute_log_prob(self, action, mu, std): var = std ** 2 log_prob = -0.5 * np.sum(np.log(2 * np.pi * var) + (action - mu) ** 2 / var) return log_prob

3. 性能对比

3.1 PPO变体对比

变体	稳定性	性能	复杂度
PPO-Clip	高	高	低
PPO-Penalty	很高	高	中
PPO-Adapter	很高	很高	高

3.2 PPO vs 其他算法

算法	样本效率	稳定性	适用场景
PPO	高	高	通用
DQN	中	中	离散动作
DDPG	中	中	连续动作

3.3 PPO超参数影响

参数	默认值	影响
clip_ratio	0.2	策略更新幅度
gamma	0.99	折扣因子
lambda_	0.95	GAE权重
epochs	10	更新轮数

4. 最佳实践

4.1 PPO配置

def configure_ppo(task_type): configs = { 'discrete': { 'clip_ratio': 0.2, 'gamma': 0.99, 'lambda_': 0.95, 'epochs': 10 }, 'continuous': { 'clip_ratio': 0.2, 'gamma': 0.99, 'lambda_': 0.95, 'epochs': 10, 'target_kl': 0.01 } } return configs.get(task_type, configs['discrete']) class PPOConfigGenerator: @staticmethod def from_task(task_type): return configure_ppo(task_type)

4.2 训练技巧

class PPOTrainingTips: @staticmethod def gae_advantage(): return {'lambda_': 0.95} @staticmethod def entropy_bonus(coeff=0.01): return {'entropy_coeff': coeff} @staticmethod def gradient_clipping(max_norm=0.5): return {'max_norm': max_norm}