共计 4354 个字符,预计需要花费 11 分钟才能阅读完成。
基于 Q -Learning 的 FlappyBird AI
在 birdbot 实现的 FlappyBird 根底上训练 AI,这个 FlappyBird 的实现对游戏进行了简略的封装,能够很不便失去游戏的状态来辅助算法实现。同时能够显示游戏界面不便调试,可能看到算法实现的成果。也能够抉择敞开游戏界面以及声音,这样游戏依然能失常运行,个别用于训练阶段,能够缩小 CPU 的占用
实现参考的是 SarvagyaVaish 的 Flappy Bird RL
Q-Learning
Q-Learning 是强化学习算法中 value-based 的算法
Q 即为 Q(s,a)就是在某一时刻的 s 状态下 (s∈S),采取 动作 a (a∈A) 动作可能取得收益的冀望,环境会依据 agent 的动作反馈相应的回报 reward,所以算法的次要思维就是将 State 与 Action 构建成一张 Q -table 来存储 Q 值,而后依据 Q 值来选取可能取得最大的收益的动作
Q-Table | a1 | a2 |
---|---|---|
s1 | q(s1,a1) | q(s1,a2) |
s2 | q(s2,a1) | q(s2,a2) |
s3 | q(s3,a1) | q(s3,a2) |
算法流程
在更新的过程中,引入了学习速率 alpha,管制先前的 Q 值和新的 Q 值之间有多少差别被保留
γ 为折扣因子,0<= γ<1,γ= 0 示意立刻回报,γ 趋于 1 示意未来回报,γ 决定工夫的远近对回报的影响水平
具体的 Q -Learning 过程能够参考上面这篇
A Painless Q-learning Tutorial (一个 Q-learning 算法的扼要教程)
FlappyBird 中利用
状态空间
- 从下方管子开始算起的垂直距离
- 从下一对管子算起的程度间隔
- 鸟:死或生
动作
每一个状态,有两个可能的动作
- 点击一下
- 啥也不干
处分
处分的机制齐全基于鸟是否存活
- +1,如果小鸟还活着
- -1000,如果小鸟死了
流程
伪代码
初始化 Q = {};
while Q 未收敛:初始化小鸟的地位 S,开始新一轮游戏
while S != 死亡状态:应用策略 π,取得动作 a =π(S)
应用动作 a 进行游戏,取得小鸟的新地位 S', 与处分 R(S,a)
Q[S,A] ← (1-α)*Q[S,A] + α*(R(S,a) + γ* max Q[S',a]) // 更新 Q
S ← S'
- 察看 Flappy Bird 处于什么状态,并执行最大化预期处分的口头。而后持续运行游戏,接着取得下一个状态 s’
- 察看新的状态 s’和与之相干的处分:+ 1 或者 -1000
- 依据 Q Learning 规定更新 Q 阵列
Q[s,a] ← Q[s,a] + α (r + γ*V(s’) – Q[s,a])
- 设定以后状态为 s’,而后从新来过
代码
import pyglet
import random
import pickle
import atexit
import os
from pybird.game import Game
class Bot:
def __init__(self, game):
self.game = game
# constants
self.WINDOW_HEIGHT = Game.WINDOW_HEIGHT
self.PIPE_WIDTH = Game.PIPE_WIDTH
# this flag is used to make sure at most one tap during
# every call of run()
self.tapped = False
self.game.play()
# variables for plan
self.Q = {}
self.alpha = 0.7
self.explore = 100
self.pre_s = (9999, 9999)
self.pre_a = 'do_nothing'
self.absolute_path = os.path.split(os.path.realpath(__file__))[0]
self.memo = self.absolute_path + '/memo'
if os.path.isfile(self.memo):
_dict = pickle.load(open(self.memo))
self.Q = _dict["Q"]
self.game.record.iters = _dict.get("iters", 0)
self.game.record.best_iter = _dict.get("best_iter", 0)
def do_at_exit():
_dict = {"Q": self.Q,
"iters": self.game.record.iters,
"best_iter": self.game.record.best_iter}
pickle.dump(_dict, open(self.memo, 'wb'))
atexit.register(do_at_exit)
# this method is auto called every 0.05s by the pyglet
def run(self):
if self.game.state == 'PLAY':
self.tapped = False
# call plan() to execute your plan
self.plan(self.get_state())
else:
state = self.get_state()
bird_state = list(state['bird'])
bird_state[2] = 'dead'
state['bird'] = bird_state
# do NOT allow tap
self.tapped = True
self.plan(state)
# restart game
print 'iters:',self.game.record.iters,'score:', self.game.record.get(), 'best:', self.game.record.best_score
self.game.record.inc_iters()
self.game.restart()
self.game.play()
# get the state that robot needed
def get_state(self):
state = {}
# bird's position and status(dead or alive)
state['bird'] = (int(round(self.game.bird.x)), \
int(round(self.game.bird.y)), 'alive')
state['pipes'] = []
# pipes' position
for i in range(1, len(self.game.pipes), 2):
p = self.game.pipes[i]
if p.x < Game.WINDOW_WIDTH:
# this pair of pipes shows on screen
x = int(round(p.x))
y = int(round(p.y))
state['pipes'].append((x, y))
state['pipes'].append((x, y - Game.PIPE_HEIGHT_INTERVAL))
return state
# simulate the click action, bird will fly higher when tapped
# It can be called only once every time slice(every execution cycle of plan())
def tap(self):
if not self.tapped:
self.game.bird.jump()
self.tapped = True
# That's where the robot actually works
# NOTE Put your code here
def plan(self, state):
x = state['bird'][0]
y = state['bird'][1]
if len(state['pipes']) == 0:
if y < self.WINDOW_HEIGHT / 2:
self.tap()
return
h, v = 9999, 9999
reward = -1000 if state['bird'][2] == 'dead' else 1
for i in range(1, len(state['pipes']), 2):
p = state['pipes'][i]
if x <= p[0] + self.PIPE_WIDTH:
h = p[0] + self.PIPE_WIDTH - x
v = p[1] - y
break
scale = 10
h /= scale
v /= scale
self.Q.setdefault((h, v), {'tap': 0, 'do_nothing': 0})
self.Q.setdefault(self.pre_s, {'tap': 0, 'do_nothing': 0})
tap_v = self.Q[(h, v)]['tap']
nothing_v = self.Q[(h, v)]['do_nothing']
self.Q[self.pre_s][self.pre_a] += self.alpha * (reward + max(tap_v, nothing_v) - self.Q[self.pre_s][self.pre_a])
self.pre_s = (h, v)
if random.randint(0, self.explore) > 100:
self.pre_a = "do_nothing" if random.randint(0, 1) else "tap"
else:
tap_v = self.Q[self.pre_s]['tap']
nothing_v = self.Q[self.pre_s]['do_nothing']
self.pre_a = "do_nothing" if tap_v <= nothing_v else "tap"
if self.pre_a == 'tap':
self.tap()
else:
pass
if __name__ == '__main__':
show_window = True
enable_sound = False
game = Game()
game.set_sound(enable_sound)
bot = Bot(game)
def update(dt):
game.update(dt)
bot.run()
pyglet.clock.schedule_interval(update, Game.TIME_INTERVAL)
if show_window:
window = pyglet.window.Window(Game.WINDOW_WIDTH, Game.WINDOW_HEIGHT, vsync = False)
@window.event
def on_draw():
window.clear()
game.draw()
pyglet.app.run()
else:
pyglet.app.run()
全副代码见 github 仓库
参考
- Flappy Bird RL
- 如何用简略例子解说 Q – learning 的具体过程?– 牛阿的答复 – 知乎
- Q-Learning 算法详解
正文完