强化学习算法:近端策略优化(PPO)
1. 技术分析
1.1 PPO概述
PPO是当前最流行的强化学习算法:
PPO特点 近端策略优化: 限制策略更新幅度 信赖域优化: TRPO的简化版 稳定训练: 不易发散 核心思想: 裁剪目标函数 避免策略突变
1.2 PPO优势
| 特性 | PPO | TRPO |
|---|
| 实现复杂度 | 低 | 高 |
| 计算效率 | 高 | 低 |
| 稳定性 | 高 | 很高 |
1.3 PPO变体
PPO变体 PPO-Clip: 裁剪目标函数 PPO-Penalty: KL惩罚 PPO-Adapter: 自适应KL
2. 核心功能实现
2.1 PPO算法
import numpy as np class PPO: def __init__(self, policy, value_function, optimizer, clip_ratio=0.2, gamma=0.99, lambda_=0.95, epochs=10, batch_size=64): self.policy = policy self.value_function = value_function self.optimizer = optimizer self.clip_ratio = clip_ratio self.gamma = gamma self.lambda_ = lambda_ self.epochs = epochs self.batch_size = batch_size def compute_advantages(self, rewards, values, dones): advantages = [] running_advantage = 0 for i in reversed(range(len(rewards))): if dones[i]: running_advantage = 0 running_advantage = rewards[i] + self.gamma * (1 - dones[i]) * (values[i + 1] + self.lambda_ * running_advantage) advantages.insert(0, running_advantage - values[i]) advantages = np.array(advantages) advantages = (advantages - np.mean(advantages)) / (np.std(advantages) + 1e-9) return advantages def compute_returns(self, rewards, values, dones): returns = [] running_return = 0 for i in reversed(range(len(rewards))): if dones[i]: running_return = 0 running_return = rewards[i] + self.gamma * (1 - dones[i]) * running_return returns.insert(0, running_return) return np.array(returns) def train(self, env, episodes=1000): for episode in range(episodes): states = [] actions = [] rewards = [] dones = [] values = [] state = env.reset() done = False while not done: action_probs = self.policy(state) action = np.random.choice(len(action_probs), p=action_probs) value = self.value_function(state) next_state, reward, done = env.step(action) states.append(state) actions.append(action) rewards.append(reward) dones.append(done) values.append(value) state = next_state values.append(self.value_function(state)) advantages = self.compute_advantages(rewards, values, dones) returns = self.compute_returns(rewards, values, dones) self._update_policy(states, actions, advantages, returns) def _update_policy(self, states, actions, advantages, returns): old_probs = np.array([self.policy(s)[a] for s, a in zip(states, actions)]) for _ in range(self.epochs): indices = np.random.permutation(len(states)) for i in range(0, len(states), self.batch_size): batch_indices = indices[i:i+self.batch_size] batch_states = np.array([states[j] for j in batch_indices]) batch_actions = np.array([actions[j] for j in batch_indices]) batch_advantages = advantages[batch_indices] batch_returns = returns[batch_indices] batch_old_probs = old_probs[batch_indices] new_probs = np.array([self.policy(s)[a] for s, a in zip(batch_states, batch_actions)]) ratio = new_probs / (batch_old_probs + 1e-9) clip_adv = np.clip(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio) * batch_advantages policy_loss = -np.mean(np.minimum(ratio * batch_advantages, clip_adv)) new_values = np.array([self.value_function(s) for s in batch_states]) value_loss = np.mean((batch_returns - new_values) ** 2) loss = policy_loss + 0.5 * value_loss self.optimizer.step(loss)
2.2 PPO策略网络
class PPOPolicyNetwork: def __init__(self, state_dim, action_dim, hidden_dim=64): self.state_dim = state_dim self.action_dim = action_dim self.W1 = np.random.randn(state_dim, hidden_dim) * 0.01 self.b1 = np.zeros(hidden_dim) self.W2 = np.random.randn(hidden_dim, hidden_dim) * 0.01 self.b2 = np.zeros(hidden_dim) self.W3 = np.random.randn(hidden_dim, action_dim) * 0.01 self.b3 = np.zeros(action_dim) def forward(self, state): h1 = np.maximum(0, state @ self.W1 + self.b1) h2 = np.maximum(0, h1 @ self.W2 + self.b2) logits = h2 @ self.W3 + self.b3 exp_logits = np.exp(logits - np.max(logits)) probs = exp_logits / np.sum(exp_logits) return probs class PPOValueNetwork: def __init__(self, state_dim, hidden_dim=64): self.state_dim = state_dim self.W1 = np.random.randn(state_dim, hidden_dim) * 0.01 self.b1 = np.zeros(hidden_dim) self.W2 = np.random.randn(hidden_dim, hidden_dim) * 0.01 self.b2 = np.zeros(hidden_dim) self.W3 = np.random.randn(hidden_dim, 1) * 0.01 self.b3 = np.zeros(1) def forward(self, state): h1 = np.maximum(0, state @ self.W1 + self.b1) h2 = np.maximum(0, h1 @ self.W2 + self.b2) value = h2 @ self.W3 + self.b3 return value[0]
2.3 PPO连续动作
class PPOContinuousPolicy: def __init__(self, state_dim, action_dim, hidden_dim=64): self.state_dim = state_dim self.action_dim = action_dim self.W1 = np.random.randn(state_dim, hidden_dim) * 0.01 self.b1 = np.zeros(hidden_dim) self.W_mu = np.random.randn(hidden_dim, action_dim) * 0.01 self.b_mu = np.zeros(action_dim) self.W_log_std = np.random.randn(hidden_dim, action_dim) * 0.01 self.b_log_std = np.zeros(action_dim) def forward(self, state): h = np.maximum(0, state @ self.W1 + self.b1) mu = h @ self.W_mu + self.b_mu log_std = h @ self.W_log_std + self.b_log_std std = np.exp(log_std) return mu, std def sample_action(self, state): mu, std = self.forward(state) action = mu + np.random.normal(0, std, size=mu.shape) log_prob = self._compute_log_prob(action, mu, std) return action, log_prob def _compute_log_prob(self, action, mu, std): var = std ** 2 log_prob = -0.5 * np.sum(np.log(2 * np.pi * var) + (action - mu) ** 2 / var) return log_prob
3. 性能对比
3.1 PPO变体对比
| 变体 | 稳定性 | 性能 | 复杂度 |
|---|
| PPO-Clip | 高 | 高 | 低 |
| PPO-Penalty | 很高 | 高 | 中 |
| PPO-Adapter | 很高 | 很高 | 高 |
3.2 PPO vs 其他算法
| 算法 | 样本效率 | 稳定性 | 适用场景 |
|---|
| PPO | 高 | 高 | 通用 |
| DQN | 中 | 中 | 离散动作 |
| DDPG | 中 | 中 | 连续动作 |
3.3 PPO超参数影响
| 参数 | 默认值 | 影响 |
|---|
| clip_ratio | 0.2 | 策略更新幅度 |
| gamma | 0.99 | 折扣因子 |
| lambda_ | 0.95 | GAE权重 |
| epochs | 10 | 更新轮数 |
4. 最佳实践
4.1 PPO配置
def configure_ppo(task_type): configs = { 'discrete': { 'clip_ratio': 0.2, 'gamma': 0.99, 'lambda_': 0.95, 'epochs': 10 }, 'continuous': { 'clip_ratio': 0.2, 'gamma': 0.99, 'lambda_': 0.95, 'epochs': 10, 'target_kl': 0.01 } } return configs.get(task_type, configs['discrete']) class PPOConfigGenerator: @staticmethod def from_task(task_type): return configure_ppo(task_type)
4.2 训练技巧
class PPOTrainingTips: @staticmethod def gae_advantage(): return {'lambda_': 0.95} @staticmethod def entropy_bonus(coeff=0.01): return {'entropy_coeff': coeff} @staticmethod def gradient_clipping(max_norm=0.5): return {'max_norm': max_norm}
5. 总结
PPO是当前最实用的强化学习算法:
- 裁剪目标函数:限制策略更新幅度
- GAE:计算优势函数
- 稳定训练:不易发散
- 通用性强:支持离散和连续动作
对比数据如下:
- PPO-Clip是最常用的变体
- clip_ratio=0.2是标准选择
- GAE(lambda=0.95)提高样本效率
- 推荐使用PPO作为默认强化学习算法