基于Q-Learning 的FlappyBird AI
在birdbot实现的FlappyBird根底上训练AI,这个FlappyBird的实现对游戏进行了简略的封装,能够很不便失去游戏的状态来辅助算法实现。同时能够显示游戏界面不便调试,可能看到算法实现的成果。也能够抉择敞开游戏界面以及声音,这样游戏依然能失常运行,个别用于训练阶段,能够缩小CPU的占用
实现参考的是SarvagyaVaish的Flappy Bird RL
Q-Learning
Q-Learning是强化学习算法中value-based的算法
Q即为Q(s,a)就是在某一时刻的 s 状态下(s∈S),采取 动作a (a∈A)动作可能取得收益的冀望,环境会依据agent的动作反馈相应的回报reward,所以算法的次要思维就是将State与Action构建成一张Q-table来存储Q值,而后依据Q值来选取可能取得最大的收益的动作
Q-Table | a1 | a2 |
---|---|---|
s1 | q(s1,a1) | q(s1,a2) |
s2 | q(s2,a1) | q(s2,a2) |
s3 | q(s3,a1) | q(s3,a2) |
算法流程
在更新的过程中,引入了学习速率alpha,管制先前的Q值和新的Q值之间有多少差别被保留
为折扣因子,0<= <1,=0示意立刻回报,趋于1示意未来回报,决定工夫的远近对回报的影响水平
具体的Q-Learning过程能够参考上面这篇
A Painless Q-learning Tutorial (一个 Q-learning 算法的扼要教程)
FlappyBird中利用
状态空间
- 从下方管子开始算起的垂直距离
- 从下一对管子算起的程度间隔
- 鸟:死或生
动作
每一个状态,有两个可能的动作
- 点击一下
- 啥也不干
处分
处分的机制齐全基于鸟是否存活
- +1,如果小鸟还活着
- -1000,如果小鸟死了
流程
伪代码
初始化 Q = {};while Q 未收敛: 初始化小鸟的地位S,开始新一轮游戏 while S != 死亡状态: 应用策略,取得动作a=(S) 应用动作a进行游戏,取得小鸟的新地位S',与处分R(S,a) Q[S,A] ← (1-)*Q[S,A] + *(R(S,a) + * max Q[S',a]) // 更新Q S ← S'
- 察看Flappy Bird处于什么状态,并执行最大化预期处分的口头。而后持续运行游戏,接着取得下一个状态s’
- 察看新的状态s’和与之相干的处分:+1或者-1000
- 依据Q Learning规定更新Q阵列
Q[s,a] ← Q[s,a] + (r + *V(s') - Q[s,a])
- 设定以后状态为s’,而后从新来过
代码
import pygletimport randomimport pickleimport atexitimport osfrom pybird.game import Gameclass Bot: def __init__(self, game): self.game = game # constants self.WINDOW_HEIGHT = Game.WINDOW_HEIGHT self.PIPE_WIDTH = Game.PIPE_WIDTH # this flag is used to make sure at most one tap during # every call of run() self.tapped = False self.game.play() # variables for plan self.Q = {} self.alpha = 0.7 self.explore = 100 self.pre_s = (9999, 9999) self.pre_a = 'do_nothing' self.absolute_path = os.path.split(os.path.realpath(__file__))[0] self.memo = self.absolute_path + '/memo' if os.path.isfile(self.memo): _dict = pickle.load(open(self.memo)) self.Q = _dict["Q"] self.game.record.iters = _dict.get("iters", 0) self.game.record.best_iter = _dict.get("best_iter", 0) def do_at_exit(): _dict = {"Q": self.Q, "iters": self.game.record.iters, "best_iter": self.game.record.best_iter} pickle.dump(_dict, open(self.memo, 'wb')) atexit.register(do_at_exit) # this method is auto called every 0.05s by the pyglet def run(self): if self.game.state == 'PLAY': self.tapped = False # call plan() to execute your plan self.plan(self.get_state()) else: state = self.get_state() bird_state = list(state['bird']) bird_state[2] = 'dead' state['bird'] = bird_state # do NOT allow tap self.tapped = True self.plan(state) # restart game print 'iters:',self.game.record.iters,' score:', self.game.record.get(), 'best: ', self.game.record.best_score self.game.record.inc_iters() self.game.restart() self.game.play() # get the state that robot needed def get_state(self): state = {} # bird's position and status(dead or alive) state['bird'] = (int(round(self.game.bird.x)), \ int(round(self.game.bird.y)), 'alive') state['pipes'] = [] # pipes' position for i in range(1, len(self.game.pipes), 2): p = self.game.pipes[i] if p.x < Game.WINDOW_WIDTH: # this pair of pipes shows on screen x = int(round(p.x)) y = int(round(p.y)) state['pipes'].append((x, y)) state['pipes'].append((x, y - Game.PIPE_HEIGHT_INTERVAL)) return state # simulate the click action, bird will fly higher when tapped # It can be called only once every time slice(every execution cycle of plan()) def tap(self): if not self.tapped: self.game.bird.jump() self.tapped = True # That's where the robot actually works # NOTE Put your code here def plan(self, state): x = state['bird'][0] y = state['bird'][1] if len(state['pipes']) == 0: if y < self.WINDOW_HEIGHT / 2: self.tap() return h, v = 9999, 9999 reward = -1000 if state['bird'][2] == 'dead' else 1 for i in range(1, len(state['pipes']), 2): p = state['pipes'][i] if x <= p[0] + self.PIPE_WIDTH: h = p[0] + self.PIPE_WIDTH - x v = p[1] - y break scale = 10 h /= scale v /= scale self.Q.setdefault((h, v), {'tap': 0, 'do_nothing': 0}) self.Q.setdefault(self.pre_s, {'tap': 0, 'do_nothing': 0}) tap_v = self.Q[(h, v)]['tap'] nothing_v = self.Q[(h, v)]['do_nothing'] self.Q[self.pre_s][self.pre_a] += self.alpha * (reward + max(tap_v, nothing_v) - self.Q[self.pre_s][self.pre_a]) self.pre_s = (h, v) if random.randint(0, self.explore) > 100: self.pre_a = "do_nothing" if random.randint(0, 1) else "tap" else: tap_v = self.Q[self.pre_s]['tap'] nothing_v = self.Q[self.pre_s]['do_nothing'] self.pre_a = "do_nothing" if tap_v <= nothing_v else "tap" if self.pre_a == 'tap': self.tap() else: pass if __name__ == '__main__': show_window = True enable_sound = False game = Game() game.set_sound(enable_sound) bot = Bot(game) def update(dt): game.update(dt) bot.run() pyglet.clock.schedule_interval(update, Game.TIME_INTERVAL) if show_window: window = pyglet.window.Window(Game.WINDOW_WIDTH, Game.WINDOW_HEIGHT, vsync = False) @window.event def on_draw(): window.clear() game.draw() pyglet.app.run() else: pyglet.app.run()
全副代码见github仓库
参考
- Flappy Bird RL
- 如何用简略例子解说 Q - learning 的具体过程? - 牛阿的答复 - 知乎
- Q-Learning算法详解