当前位置：首页 > news >正文

用Python和TensorFlow训练AI玩贪吃蛇：从游戏逻辑到DQN算法实战（附完整代码）

news 2026/6/11 2:58:47

用Python和TensorFlow训练AI玩贪吃蛇：从游戏逻辑到DQN算法实战

贪吃蛇这个经典游戏，几乎每个人都玩过。但你是否想过，让AI来玩这个游戏会是什么样子？本文将带你从零开始，用Python和TensorFlow构建一个能够自主玩贪吃蛇的AI系统。不同于简单的规则式AI，我们将使用深度强化学习中的DQN算法，让AI真正"学会"如何玩这个游戏。

1. 项目准备与环境搭建

在开始编码之前，我们需要准备好开发环境。这个项目需要以下几个主要组件：

Python 3.7或更高版本
Pygame库（用于游戏界面）
TensorFlow 2.x（用于构建和训练神经网络）
NumPy（用于数值计算）

安装这些依赖非常简单，只需在命令行中执行以下命令：

pip install pygame tensorflow numpy

对于硬件要求，虽然可以在CPU上运行，但如果有NVIDIA显卡并安装了CUDA，训练速度会显著提升。建议至少4GB内存，因为神经网络训练过程会比较消耗资源。

项目目录结构建议如下：

/snake_ai /game __init__.py snake.py # 游戏逻辑 render.py # 游戏渲染 /rl __init__.py dqn.py # DQN算法实现 memory.py # 经验回放缓冲区 config.py # 配置文件 train.py # 训练脚本 play.py # 人类游玩脚本

2. 贪吃蛇游戏逻辑实现

首先我们需要构建贪吃蛇游戏的基本框架。使用Pygame可以方便地创建游戏窗口和处理用户输入。

2.1 游戏核心类设计

我们创建三个主要类：Snake、Food和Game。下面是Snake类的核心代码：

class Snake: def __init__(self, block_size=20, width=800, height=600): self.length = 3 self.positions = [(width // 2, height // 2)] self.direction = random.choice([(0, 1), (0, -1), (1, 0), (-1, 0)]) self.block_size = block_size self.width = width self.height = height self.color = (0, 255, 0) # 绿色 def get_head_position(self): return self.positions[0] def turn(self, new_direction): # 防止180度转弯 if (new_direction[0] * -1, new_direction[1] * -1) != self.direction: self.direction = new_direction def move(self): head = self.get_head_position() x, y = self.direction new_x = (head[0] + (x * self.block_size)) % self.width new_y = (head[1] + (y * self.block_size)) % self.height new_position = (new_x, new_y) self.positions.insert(0, new_position) if len(self.positions) > self.length: self.positions.pop() def reset(self): self.length = 3 self.positions = [(self.width // 2, self.height // 2)] self.direction = random.choice([(0, 1), (0, -1), (1, 0), (-1, 0)]) def draw(self, surface): for p in self.positions: rect = pygame.Rect((p[0], p[1]), (self.block_size, self.block_size)) pygame.draw.rect(surface, self.color, rect) pygame.draw.rect(surface, (0, 0, 0), rect, 1)

2.2 游戏主循环

游戏主循环负责处理输入、更新游戏状态和渲染画面：

class Game: def __init__(self, width=800, height=600, block_size=20): pygame.init() self.screen = pygame.display.set_mode((width, height)) self.clock = pygame.time.Clock() self.snake = Snake(block_size, width, height) self.food = Food(block_size, width, height) self.width = width self.height = height self.block_size = block_size self.score = 0 def run(self): running = True while running: for event in pygame.event.get(): if event.type == pygame.QUIT: running = False elif event.type == pygame.KEYDOWN: if event.key == pygame.K_UP: self.snake.turn((0, -1)) elif event.key == pygame.K_DOWN: self.snake.turn((0, 1)) elif event.key == pygame.K_LEFT: self.snake.turn((-1, 0)) elif event.key == pygame.K_RIGHT: self.snake.turn((1, 0)) self.snake.move() # 检测是否吃到食物 if self.snake.get_head_position() == self.food.position: self.snake.length += 1 self.score += 1 self.food = Food(self.block_size, self.width, self.height) # 检测碰撞 if self.snake.get_head_position() in self.snake.positions[1:]: print(f"Game Over! Score: {self.score}") self.snake.reset() self.score = 0 # 渲染 self.screen.fill((255, 255, 255)) self.snake.draw(self.screen) self.food.draw(self.screen) pygame.display.update() self.clock.tick(10) # 控制游戏速度 pygame.quit()

3. DQN算法原理与实现

深度Q网络(DQN)是强化学习中的一种重要算法，它结合了Q-learning和深度神经网络的优点。

3.1 DQN核心概念

DQN的核心思想是使用神经网络来近似Q函数，即状态-动作值函数。Q函数表示在某个状态下采取某个动作所能获得的预期回报。

DQN有几个关键组件：

经验回放(Experience Replay)：存储智能体的经验(状态,动作,奖励,新状态)在记忆库中，训练时从中随机采样，打破数据间的相关性。
目标网络(Target Network)：使用一个独立的网络来计算目标Q值，提高训练稳定性。
ε-贪婪策略(ε-Greedy Policy)：在探索和利用之间取得平衡，开始时更多探索，逐渐增加利用。

3.2 DQN实现代码

下面是DQN的核心实现：

import numpy as np import tensorflow as tf from collections import deque import random class DQNAgent: def __init__(self, state_size, action_size): self.state_size = state_size self.action_size = action_size self.memory = deque(maxlen=2000) self.gamma = 0.95 # 折扣因子 self.epsilon = 1.0 # 探索率 self.epsilon_min = 0.01 self.epsilon_decay = 0.995 self.learning_rate = 0.001 self.model = self._build_model() self.target_model = self._build_model() self.update_target_model() def _build_model(self): model = tf.keras.Sequential() model.add(tf.keras.layers.Dense(24, input_dim=self.state_size, activation='relu')) model.add(tf.keras.layers.Dense(24, activation='relu')) model.add(tf.keras.layers.Dense(self.action_size, activation='linear')) model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate)) return model def update_target_model(self): self.target_model.set_weights(self.model.get_weights()) def remember(self, state, action, reward, next_state, done): self.memory.append((state, action, reward, next_state, done)) def act(self, state): if np.random.rand() <= self.epsilon: return random.randrange(self.action_size) act_values = self.model.predict(state) return np.argmax(act_values[0]) def replay(self, batch_size): if len(self.memory) < batch_size: return minibatch = random.sample(self.memory, batch_size) states = np.array([i[0] for i in minibatch]) actions = np.array([i[1] for i in minibatch]) rewards = np.array([i[2] for i in minibatch]) next_states = np.array([i[3] for i in minibatch]) dones = np.array([i[4] for i in minibatch]) states = np.squeeze(states) next_states = np.squeeze(next_states) targets = rewards + self.gamma * (np.amax(self.target_model.predict_on_batch(next_states), axis=1)) * (1 - dones) targets_full = self.model.predict_on_batch(states) ind = np.array([i for i in range(batch_size)]) targets_full[[ind], [actions]] = targets self.model.fit(states, targets_full, epochs=1, verbose=0) if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay def load(self, name): self.model.load_weights(name) def save(self, name): self.model.save_weights(name)

4. 训练AI玩贪吃蛇

现在我们将游戏环境和DQN算法结合起来，训练AI玩贪吃蛇。

4.1 状态表示

我们需要定义如何将游戏状态表示为神经网络可以理解的输入。对于贪吃蛇游戏，状态可以包括：

蛇头周围四个方向是否有障碍（蛇身或墙壁）
食物相对于蛇头的位置（左/右/上/下）
蛇当前的移动方向

def get_state(self): head = self.snake.get_head_position() food = self.food.position # 计算四个方向的点 point_l = (head[0] - self.block_size, head[1]) point_r = (head[0] + self.block_size, head[1]) point_u = (head[0], head[1] - self.block_size) point_d = (head[0], head[1] + self.block_size) # 当前移动方向 dir_l = self.snake.direction == (-1, 0) dir_r = self.snake.direction == (1, 0) dir_u = self.snake.direction == (0, -1) dir_d = self.snake.direction == (0, 1) state = [ # 危险直行 (dir_r and self.is_collision(point_r)) or (dir_l and self.is_collision(point_l)) or (dir_u and self.is_collision(point_u)) or (dir_d and self.is_collision(point_d)), # 危险右转 (dir_u and self.is_collision(point_r)) or (dir_d and self.is_collision(point_l)) or (dir_l and self.is_collision(point_u)) or (dir_r and self.is_collision(point_d)), # 危险左转 (dir_d and self.is_collision(point_r)) or (dir_u and self.is_collision(point_l)) or (dir_r and self.is_collision(point_u)) or (dir_l and self.is_collision(point_d)), # 移动方向 dir_l, dir_r, dir_u, dir_d, # 食物位置 food[0] < head[0], # 食物在左 food[0] > head[0], # 食物在右 food[1] < head[1], # 食物在上 food[1] > head[1] # 食物在下 ] return np.array(state, dtype=int)

4.2 奖励函数设计

奖励函数是强化学习中最关键的部分之一，它告诉AI什么是好的行为，什么是坏的行为。对于贪吃蛇游戏，我们可以设计如下奖励：

吃到食物：+10
撞到自己或墙壁：-10
靠近食物：+1
远离食物：-1
每移动一步：-0.1（鼓励高效）

def get_reward(self, snake, food, done): if done: return -10 if snake.get_head_position() == food.position: return 10 # 计算与食物的距离 head = snake.get_head_position() food_pos = food.position new_dist = abs(head[0] - food_pos[0]) + abs(head[1] - food_pos[1]) # 如果距离减小，给予正奖励；否则负奖励 if new_dist < self.prev_distance: reward = 1 else: reward = -1 self.prev_distance = new_dist # 每步的小惩罚 reward -= 0.1 return reward

4.3 训练过程

训练过程主要包括以下步骤：

初始化环境和智能体
获取当前状态
智能体选择动作
执行动作，获取新状态和奖励
存储经验到记忆库
训练智能体
定期更新目标网络

def train(): pygame.init() width, height, block_size = 800, 600, 20 game = Game(width, height, block_size) agent = DQNAgent(state_size=11, action_size=3) # 3动作：直行、右转、左转 episodes = 1000 batch_size = 32 for e in range(episodes): game.reset() state = game.get_state() state = np.reshape(state, [1, 11]) total_reward = 0 while True: action = agent.act(state) # 执行动作 if action == 0: # 直行 pass elif action == 1: # 右转 if game.snake.direction == (0, -1): game.snake.turn((1, 0)) elif game.snake.direction == (1, 0): game.snake.turn((0, 1)) elif game.snake.direction == (0, 1): game.snake.turn((-1, 0)) elif game.snake.direction == (-1, 0): game.snake.turn((0, -1)) elif action == 2: # 左转 if game.snake.direction == (0, -1): game.snake.turn((-1, 0)) elif game.snake.direction == (-1, 0): game.snake.turn((0, 1)) elif game.snake.direction == (0, 1): game.snake.turn((1, 0)) elif game.snake.direction == (1, 0): game.snake.turn((0, -1)) game.snake.move() # 检查游戏状态 done = False if game.snake.get_head_position() in game.snake.positions[1:]: done = True # 检查是否吃到食物 if game.snake.get_head_position() == game.food.position: game.snake.length += 1 game.food = Food(block_size, width, height) # 获取奖励和新状态 reward = game.get_reward(game.snake, game.food, done) total_reward += reward next_state = game.get_state() next_state = np.reshape(next_state, [1, 11]) # 存储经验 agent.remember(state, action, reward, next_state, done) state = next_state if done: print(f"Episode: {e}/{episodes}, Score: {game.snake.length}, Total reward: {total_reward}, Epsilon: {agent.epsilon:.2f}") break if len(agent.memory) > batch_size: agent.replay(batch_size) # 定期更新目标网络 if e % 10 == 0: agent.update_target_model() # 定期保存模型 if e % 100 == 0: agent.save(f"snake_dqn_{e}.h5") agent.save("snake_dqn_final.h5")

5. 调优与改进

训练过程中，你可能会遇到AI表现不佳的情况。以下是几个常见的调优方向：

5.1 奖励函数调整

奖励函数的设计对训练效果影响巨大。可以尝试以下调整：

增加对长时间存活的奖励
调整靠近/远离食物的奖励幅度
增加对形成循环移动的惩罚

5.2 网络结构优化

可以尝试更复杂的网络结构：

def _build_model(self): model = tf.keras.Sequential([ tf.keras.layers.Dense(64, input_dim=self.state_size, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(self.action_size, activation='linear') ]) model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate)) return model

5.3 训练参数调整

关键训练参数包括：

参数	建议值	说明
γ (gamma)	0.9-0.99	折扣因子，越大表示越重视长期奖励
ε (epsilon)	1.0→0.01	探索率，初始高探索，逐渐降低
ε衰减	0.995	控制探索率降低速度
学习率	0.0001-0.001	影响权重更新幅度
批次大小	32-64	每次训练的样本数量
记忆容量	1000-10000	经验回放缓冲区大小

5.4 高级技巧

双DQN(Double DQN)：使用两个网络分别选择动作和评估动作，减少过高估计问题。
优先级经验回放(Prioritized Experience Replay)：给重要的经验样本更高采样概率。
决斗网络架构(Dueling Network)：将Q值分解为状态值和优势函数。

实现双DQN只需修改replay方法：

def replay(self, batch_size): if len(self.memory) < batch_size: return minibatch = random.sample(self.memory, batch_size) states = np.array([i[0] for i in minibatch]) actions = np.array([i[1] for i in minibatch]) rewards = np.array([i[2] for i in minibatch]) next_states = np.array([i[3] for i in minibatch]) dones = np.array([i[4] for i in minibatch]) states = np.squeeze(states) next_states = np.squeeze(next_states) # 双DQN修改部分 next_actions = np.argmax(self.model.predict_on_batch(next_states), axis=1) q_values_next = self.target_model.predict_on_batch(next_states) targets = rewards + self.gamma * q_values_next[np.arange(batch_size), next_actions] * (1 - dones) targets_full = self.model.predict_on_batch(states) ind = np.array([i for i in range(batch_size)]) targets_full[[ind], [actions]] = targets self.model.fit(states, targets_full, epochs=1, verbose=0) if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay

查看全文

http://www.zskr.cn/news/1501246.html