前言

在深度强化学习的实验中，Atari游戏占了很大的地位。现在我们一般使用OpenAI开发的Gym包来进行与环境的交互。本文介绍在Atari游戏的一些常见预处理过程。

该文所涉及到的wrapper均来自OpenAI baselines
https://github.com/openai/gym…

一些常见Wrapper解读

Noop Reset

class NoopResetEnv(gym.Wrapper):
    def __init__(self, env, noop_max=30):
        """Sample initial states by taking random number of no-ops on reset.
        No-op is assumed to be action 0.
        """
        gym.Wrapper.__init__(self, env)
        self.noop_max = noop_max
        self.override_num_noops = None
        self.noop_action = 0
        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'

    def reset(self, **kwargs):
        """ Do no-op action for a number of steps in [1, noop_max]."""
        self.env.reset(**kwargs)
        if self.override_num_noops is not None:
            noops = self.override_num_noops
        else:
            noops = self.unwrapped.np_random.randint(1, self.noop_max + 1)
        assert noops > 0
        obs = None
        for _ in range(noops):
            obs, _, done, _ = self.env.step(self.noop_action)
            if done:
                obs = self.env.reset(**kwargs)
        return obs

    def step(self, ac):
        return self.env.step(ac)

该wrapper的作用是在reset环境的时候，使用随机数量的no-op动作（假设其为环境的动作0）来采样初始化状态，如果在中途环境已经返回done了，则重新reset环境。这有利于增加初始画面的随机性，减小陷入过拟合的几率。

Fire Reset

class FireResetEnv(gym.Wrapper):
    def __init__(self, env):
        """Take action on reset for environments that are fixed until firing."""
        gym.Wrapper.__init__(self, env)
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def reset(self, **kwargs):
        self.env.reset(**kwargs)
        obs, _, done, _ = self.env.step(1)
        if done:
            self.env.reset(**kwargs)
        obs, _, done, _ = self.env.step(2)
        if done:
            self.env.reset(**kwargs)
        return obs

    def step(self, ac):
        return self.env.step(ac)

在一些Atari游戏中，有开火键，比如Space Invaders，该wrapper的作用是返回一个选择开火动作后不done的环境状态。

Episodic Life

class EpisodicLifeEnv(gym.Wrapper):
    def __init__(self, env):
        """Make end-of-life == end-of-episode, but only reset on true game over.
        Done by DeepMind for the DQN and co. since it helps value estimation.
        """
        gym.Wrapper.__init__(self, env)
        self.lives = 0
        self.was_real_done  = True

    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        self.was_real_done = done
        # check current lives, make loss of life terminal,
        # then update lives to handle bonus lives
        lives = self.env.unwrapped.ale.lives()
        if lives < self.lives and lives > 0:
            # for Qbert sometimes we stay in lives == 0 condition for a few frames
            # so it's important to keep lives > 0, so that we only reset once
            # the environment advertises done.
            done = True
        self.lives = lives
        return obs, reward, done, info

    def reset(self, **kwargs):
        """Reset only when lives are exhausted.
        This way all states are still reachable even though lives are episodic,
        and the learner need not know about any of this behind-the-scenes.
        """
        if self.was_real_done:
            obs = self.env.reset(**kwargs)
        else:
            # no-op step to advance from terminal/lost life state
            obs, _, _, _ = self.env.step(0)
        self.lives = self.env.unwrapped.ale.lives()
        return obs

在许多游戏中，玩家操纵的角色有不止一条命，为了加速Agent的训练，使其尽量避免死亡，将每条命死亡后的done设为True，同时使用一个属性self.was_real_done来标记所有生命都用完之后的真正done。

Max adn Skip

class MaxAndSkipEnv(gym.Wrapper):
    def __init__(self, env, skip=4):
        """Return only every `skip`-th frame"""
        gym.Wrapper.__init__(self, env)
        # most recent raw observations (for max pooling across time steps)
        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
        self._skip       = skip

    def step(self, action):
        """Repeat action, sum reward, and max over last observations."""
        total_reward = 0.0
        done = None
        for i in range(self._skip):
            obs, reward, done, info = self.env.step(action)
            if i == self._skip - 2: self._obs_buffer[0] = obs
            if i == self._skip - 1: self._obs_buffer[1] = obs
            total_reward += reward
            if done:
                break
        # Note that the observation on the done=True frame
        # doesn't matter
        max_frame = self._obs_buffer.max(axis=0)

        return max_frame, total_reward, done, info

    def reset(self, **kwargs):
        return self.env.reset(**kwargs)

该Wrapper提供了跳帧操作，即每skip帧返回一次环境状态元组，在跳过的帧里执行相同的动作，将其奖励叠加，并且取最后两帧像素值中的最大值。在Atari游戏中，有些画面是仅在奇数帧出现的，因此要对最后两帧取最大值。

Clip Reward

class ClipRewardEnv(gym.RewardWrapper):
    def __init__(self, env):
        gym.RewardWrapper.__init__(self, env)

    def reward(self, reward):
        """Bin reward to {+1, 0, -1} by its sign."""
        return np.sign(reward)

对于不同游戏来说，其得分衡量也是不同的，为了便于统一度量和学习，将所有奖励统一定义为1（reward > 0），0（reward = 0）或-1（reward < 0）。

Wrap Frame

class WarpFrame(gym.ObservationWrapper):
    def __init__(self, env, width=84, height=84, grayscale=True, dict_space_key=None):
        """
        Warp frames to 84x84 as done in the Nature paper and later work.

        If the environment uses dictionary observations, `dict_space_key` can be specified which indicates which
        observation should be warped.
        """
        super().__init__(env)
        self._width = width
        self._height = height
        self._grayscale = grayscale
        self._key = dict_space_key
        if self._grayscale:
            num_colors = 1
        else:
            num_colors = 3

        new_space = gym.spaces.Box(
            low=0,
            high=255,
            shape=(self._height, self._width, num_colors),
            dtype=np.uint8,
        )
        if self._key is None:
            original_space = self.observation_space
            self.observation_space = new_space
        else:
            original_space = self.observation_space.spaces[self._key]
            self.observation_space.spaces[self._key] = new_space
        assert original_space.dtype == np.uint8 and len(original_space.shape) == 3

    def observation(self, obs):
        if self._key is None:
            frame = obs
        else:
            frame = obs[self._key]

        if self._grayscale:
            frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        frame = cv2.resize(
            frame[34:194], (self._width, self._height), interpolation=cv2.INTER_AREA
        )
        if self._grayscale:
            frame = np.expand_dims(frame, -1)

        if self._key is None:
            obs = frame
        else:
            obs = obs.copy()
            obs[self._key] = frame
        return obs

该Wrap对观察到的帧的图片数据进行了处理。首先将3维RGB图像转为灰度图像，之后将其resize为84 × 84的灰度图像。本例为Pong游戏的wrap Frame处理，为使Agent更关注于游戏本身的画面，避免被得分等图像区域误导，我对画面进行了裁切（frame[34: 194]），对于不同的游戏，裁切的方法可能不同。

Frame Stack

class FrameStack(gym.Wrapper):
    def __init__(self, env, k):
        """Stack k last frames.

        Returns lazy array, which is much more memory efficient.

        See Also
        --------
        baselines.common.atari_wrappers.LazyFrames
        """
        gym.Wrapper.__init__(self, env)
        self.k = k
        self.frames = deque([], maxlen=k)
        shp = env.observation_space.shape
        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[:-1] + (shp[-1] * k,)), dtype=env.observation_space.dtype)

    def reset(self):
        ob = self.env.reset()
        for _ in range(self.k):
            self.frames.append(ob)
        return self._get_ob()

    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.frames.append(ob)
        return self._get_ob(), reward, done, info

    def _get_ob(self):
        assert len(self.frames) == self.k
        return LazyFrames(list(self.frames))

本wrapper的作用是将k帧灰度图像并为一帧，以此来为CNN提供一些序列信息（Human Level control through deep reinforcement learning）。Wrapper会维持一个大小为k的deque，之后依次使用最新的ob来替代最久远的ob，达到不同时间的状态叠加的效果。最后返回一个LazyFrame。如果想要使用LazyFrame，只需利用np.array(lazy_frames_instance)即可将LazyFrame对象转为ndarray对象。

Scaled Float Frame

class ScaledFloatFrame(gym.ObservationWrapper):
    def __init__(self, env):
        gym.ObservationWrapper.__init__(self, env)
        self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)

    def observation(self, observation):
        # careful! This undoes the memory optimization, use
        # with smaller replay buffers only.
        return (np.array(observation).astype(np.float32) / 255.0

该Wrapper的目的是将0 ~ 255的图像归一化到 [0, 1]。

后记

在强化学习训练中，环境及其预处理是一个非常重要的步骤，甚至可以直接影响到强化学习智能体的训练成功与否。除上述Wrapper外，读者也可另根据自己的需求来写Wrapper，以满足需求。最后，附上Nature DQN的环境Wrapper：

def wrap_deepmind(env, episode_life=True, clip_rewards=True, frame_stack=False, scale=False):
    """Configure environment for DeepMind-style Atari.
    """
    if episode_life:
        env = EpisodicLifeEnv(env)
    if 'FIRE' in env.unwrapped.get_action_meanings():
        env = FireResetEnv(env)
    env = WarpFrame(env)
    if scale:
        env = ScaledFloatFrame(env)
    if clip_rewards:
        env = ClipRewardEnv(env)
    if frame_stack:
        env = FrameStack(env, 4)
    return env

PS

找了一圈博客网站，CSDN太恶心直接拉黑，博客园又要求实名信息，总感觉不爽，最后选择了SegmentFault。其实最终还是希望能使用自建的博客的，不过一来暂时没有主机，二来最近也没时间折腾GitHub Pages，先在这里记录一下学习笔记好了。就酱。

Gym-Atari环境预处理Wrapper解读

前言

一些常见Wrapper解读

Noop Reset

Fire Reset

Episodic Life

Max adn Skip

Clip Reward

Wrap Frame

Frame Stack

Scaled Float Frame

后记

PS

评论

发表回复取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

Gym-Atari环境预处理Wrapper解读

前言

一些常见Wrapper解读

Noop Reset

Fire Reset

Episodic Life

Max adn Skip

Clip Reward

Wrap Frame

Frame Stack

Scaled Float Frame

后记

PS

评论

发表回复 取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

发表回复取消回复