关于机器学习:DDPG强化学习的PyTorch代码实现和逐步讲解

深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)是受Deep Q-Network启发的无模型、非策略深度强化算法，是基于应用策略梯度的Actor-Critic，本文将应用pytorch对其进行残缺的实现和解说

DDPG的要害组成部分是

Replay Buffer
Actor-Critic neural network
Exploration Noise
Target network
Soft Target Updates for Target Network

上面咱们一个一个来逐渐实现：

Replay Buffer

DDPG应用Replay Buffer存储通过摸索环境采样的过程和处分(S，a，R，S+)。Replay Buffer在帮忙代理减速学习以及DDPG的稳定性方面起着至关重要的作用:

最小化样本之间的相关性：将过来的教训存储在 Replay Buffer 中，从而容许代理从各种教训中学习。
启用离线策略学习：容许代理从重播缓冲区采样转换，而不是从以后策略采样转换。
高效采样：将过来的教训存储在缓冲区中，容许代理屡次从不同的教训中学习。

 classReplay_buffer():     '''     Code based on:     https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py     Expects tuples of (state, next_state, action, reward, done)     '''     def__init__(self, max_size=capacity):         """Create Replay buffer.         Parameters         ----------         size: int             Max number of transitions to store in the buffer. When the buffer             overflows the old memories are dropped.         """         self.storage= []         self.max_size=max_size         self.ptr=0      defpush(self, data):         iflen(self.storage) ==self.max_size:             self.storage[int(self.ptr)] =data             self.ptr= (self.ptr+1) %self.max_size         else:             self.storage.append(data)      defsample(self, batch_size):         """Sample a batch of experiences.         Parameters         ----------         batch_size: int             How many transitions to sample.         Returns         -------         state: np.array             batch of state or observations         action: np.array             batch of actions executed given a state         reward: np.array             rewards received as results of executing action         next_state: np.array             next state next state or observations seen after executing action         done: np.array             done[i] = 1 if executing ation[i] resulted in             the end of an episode and 0 otherwise.         """         ind=np.random.randint(0, len(self.storage), size=batch_size)         state, next_state, action, reward, done= [], [], [], [], []          foriinind:             st, n_st, act, rew, dn=self.storage[i]             state.append(np.array(st, copy=False))             next_state.append(np.array(n_st, copy=False))             action.append(np.array(act, copy=False))             reward.append(np.array(rew, copy=False))             done.append(np.array(dn, copy=False))          returnnp.array(state), np.array(next_state), np.array(action), np.array(reward).reshape(-1, 1), np.array(done).reshape(-1, 1)

Actor-Critic Neural Network

这是Actor-Critic 强化学习算法的 PyTorch 实现。该代码定义了两个神经网络模型，一个 Actor 和一个 Critic。

Actor 模型的输出：环境状态；Actor 模型的输入：具备间断值的动作。

Critic 模型的输出：环境状态和动作；Critic 模型的输入：Q 值，即以后状态-动作对的预期总处分。

 classActor(nn.Module):     """     The Actor model takes in a state observation as input and      outputs an action, which is a continuous value.          It consists of four fully connected linear layers with ReLU activation functions and      a final output layer selects one single optimized action for the state     """     def__init__(self, n_states, action_dim, hidden1):         super(Actor, self).__init__()         self.net=nn.Sequential(             nn.Linear(n_states, hidden1),              nn.ReLU(),              nn.Linear(hidden1, hidden1),              nn.ReLU(),              nn.Linear(hidden1, hidden1),              nn.ReLU(),              nn.Linear(hidden1, 1)         )              defforward(self, state):         returnself.net(state)  classCritic(nn.Module):     """     The Critic model takes in both a state observation and an action as input and      outputs a Q-value, which estimates the expected total reward for the current state-action pair.           It consists of four linear layers with ReLU activation functions,      State and action inputs are concatenated before being fed into the first linear layer.           The output layer has a single output, representing the Q-value     """     def__init__(self, n_states, action_dim, hidden2):         super(Critic, self).__init__()         self.net=nn.Sequential(             nn.Linear(n_states+action_dim, hidden2),              nn.ReLU(),              nn.Linear(hidden2, hidden2),              nn.ReLU(),              nn.Linear(hidden2, hidden2),              nn.ReLU(),              nn.Linear(hidden2, action_dim)         )              defforward(self, state, action):         returnself.net(torch.cat((state, action), 1))

Exploration Noise

向 Actor 抉择的动作增加噪声是 DDPG 中用来激励摸索和改良学习过程的一种技术。

能够应用高斯噪声或 Ornstein-Uhlenbeck 噪声。高斯噪声简略且易于实现，Ornstein-Uhlenbeck 噪声会生成工夫相干的噪声，能够帮忙代理更无效地摸索动作空间。然而与高斯噪声办法相比，Ornstein-Uhlenbeck 噪声稳定更平滑且随机性更低。

 importnumpyasnp importrandom importcopy  classOU_Noise(object):     """Ornstein-Uhlenbeck process.     code from :     https://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab     The OU_Noise class has four attributes              size: the size of the noise vector to be generated         mu: the mean of the noise, set to 0 by default         theta: the rate of mean reversion, controlling how quickly the noise returns to the mean         sigma: the volatility of the noise, controlling the magnitude of fluctuations     """     def__init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):         self.mu=mu*np.ones(size)         self.theta=theta         self.sigma=sigma         self.seed=random.seed(seed)         self.reset()      defreset(self):         """Reset the internal state (= noise) to mean (mu)."""         self.state=copy.copy(self.mu)      defsample(self):         """Update internal state and return it as a noise sample.         This method uses the current state of the noise and generates the next sample         """         dx=self.theta* (self.mu-self.state) +self.sigma*np.array([np.random.normal() for_inrange(len(self.state))])         self.state+=dx         returnself.state

要在DDPG中应用高斯噪声，能够间接将高斯噪声增加到代理的动作抉择过程中。

DDPG

DDPG (Deep Deterministic Policy Gradient)采纳两组Actor-Critic神经网络进行函数迫近。在DDPG中，指标网络是Actor-Critic ，它指标网络具备与Actor-Critic网络雷同的构造和参数化。

在训练期时，代理应用其 Actor-Critic 网络与环境交互，并将教训元组（S、A、R、S+）存储在Replay Buffer中。而后代理从 Replay Buffer 中采样并应用数据更新 Actor-Critic 网络。 DDPG 算法不是通过间接从 Actor-Critic 网络复制来更新指标网络权重，而是通过称为软指标更新的过程迟缓更新指标网络权重。

软指标的更新是从Actor-Critic网络传输到指标网络的称为指标更新率()的权重的一小部分。

软指标的更新公式如下:

通过应用软指标技术，能够大大提高学习的稳定性。

 #Set Hyperparameters # Hyperparameters adapted for performance from capacity=1000000 batch_size=64 update_iteration=200 tau=0.001# tau for soft updating gamma=0.99# discount factor directory='./' hidden1=20# hidden layer for actor hidden2=64.#hiiden laye for critic  classDDPG(object):     def__init__(self, state_dim, action_dim):         """         Initializes the DDPG agent.          Takes three arguments:                state_dim which is the dimensionality of the state space,                 action_dim which is the dimensionality of the action space, and                 max_action which is the maximum value an action can take.                   Creates a replay buffer, an actor-critic  networks and their corresponding target networks.          It also initializes the optimizer for both actor and critic networks alog with          counters to track the number of training iterations.         """         self.replay_buffer=Replay_buffer()                  self.actor=Actor(state_dim, action_dim, hidden1).to(device)         self.actor_target=Actor(state_dim, action_dim,  hidden1).to(device)         self.actor_target.load_state_dict(self.actor.state_dict())         self.actor_optimizer=optim.Adam(self.actor.parameters(), lr=3e-3)          self.critic=Critic(state_dim, action_dim,  hidden2).to(device)         self.critic_target=Critic(state_dim, action_dim,  hidden2).to(device)         self.critic_target.load_state_dict(self.critic.state_dict())         self.critic_optimizer=optim.Adam(self.critic.parameters(), lr=2e-2)         # learning rate                    self.num_critic_update_iteration=0         self.num_actor_update_iteration=0         self.num_training=0      defselect_action(self, state):         """         takes the current state as input and returns an action to take in that state.          It uses the actor network to map the state to an action.         """         state=torch.FloatTensor(state.reshape(1, -1)).to(device)         returnself.actor(state).cpu().data.numpy().flatten()       defupdate(self):         """         updates the actor and critic networks using a batch of samples from the replay buffer.          For each sample in the batch, it computes the target Q value using the target critic network and the target actor network.          It then computes the current Q value          using the critic network and the action taken by the actor network.                   It computes the critic loss as the mean squared error between the target Q value and the current Q value, and          updates the critic network using gradient descent.                   It then computes the actor loss as the negative mean Q value using the critic network and the actor network, and          updates the actor network using gradient ascent.                   Finally, it updates the target networks using          soft updates, where a small fraction of the actor and critic network weights are transferred to their target counterparts.          This process is repeated for a fixed number of iterations.         """          foritinrange(update_iteration):             # For each Sample in replay buffer batch             state, next_state, action, reward, done=self.replay_buffer.sample(batch_size)             state=torch.FloatTensor(state).to(device)             action=torch.FloatTensor(action).to(device)             next_state=torch.FloatTensor(next_state).to(device)             done=torch.FloatTensor(1-done).to(device)             reward=torch.FloatTensor(reward).to(device)              # Compute the target Q value             target_Q=self.critic_target(next_state, self.actor_target(next_state))             target_Q=reward+ (done*gamma*target_Q).detach()              # Get current Q estimate             current_Q=self.critic(state, action)              # Compute critic loss             critic_loss=F.mse_loss(current_Q, target_Q)                          # Optimize the critic             self.critic_optimizer.zero_grad()             critic_loss.backward()             self.critic_optimizer.step()              # Compute actor loss as the negative mean Q value using the critic network and the actor network             actor_loss=-self.critic(state, self.actor(state)).mean()              # Optimize the actor             self.actor_optimizer.zero_grad()             actor_loss.backward()             self.actor_optimizer.step()                           """             Update the frozen target models using              soft updates, where              tau,a small fraction of the actor and critic network weights are transferred to their target counterparts.              """             forparam, target_paraminzip(self.critic.parameters(), self.critic_target.parameters()):                 target_param.data.copy_(tau*param.data+ (1-tau) *target_param.data)              forparam, target_paraminzip(self.actor.parameters(), self.actor_target.parameters()):                 target_param.data.copy_(tau*param.data+ (1-tau) *target_param.data)                                      self.num_actor_update_iteration+=1             self.num_critic_update_iteration+=1     defsave(self):         """         Saves the state dictionaries of the actor and critic networks to files         """         torch.save(self.actor.state_dict(), directory+'actor.pth')         torch.save(self.critic.state_dict(), directory+'critic.pth')      defload(self):         """         Loads the state dictionaries of the actor and critic networks to files         """         self.actor.load_state_dict(torch.load(directory+'actor.pth'))         self.critic.load_state_dict(torch.load(directory+'critic.pth'))

训练DDPG

这里咱们应用 OpenAI Gym 的“MountainCarContinuous-v0”来训练咱们的DDPG RL 模型，这里的环境提供间断的口头和察看空间，指标是尽快让小车达到山顶。

上面定义算法的各种参数，例如最大训练次数、摸索噪声和记录距离等等。应用固定的随机种子能够使得过程可能回溯。

 importgym  # create the environment env_name='MountainCarContinuous-v0' env=gym.make(env_name) device='cuda'iftorch.cuda.is_available() else'cpu'  # Define different parameters for training the agent max_episode=100 max_time_steps=5000 ep_r=0 total_step=0 score_hist=[] # for rensering the environmnet render=True render_interval=10 # for reproducibility env.seed(0) torch.manual_seed(0) np.random.seed(0) #Environment action ans states state_dim=env.observation_space.shape[0] action_dim=env.action_space.shape[0] max_action=float(env.action_space.high[0]) min_Val=torch.tensor(1e-7).float().to(device)   # Exploration Noise exploration_noise=0.1 exploration_noise=0.1*max_action

创立DDPG代理类的实例，以训练代理达到指定的次数。在每轮完结时调用代理的update()办法来更新参数，并且在每十轮之后应用save()办法将代理的参数保留到一个文件中。

 # Create a DDPG instance agent=DDPG(state_dim, action_dim)  # Train the agent for max_episodes foriinrange(max_episode):     total_reward=0     step=0     state=env.reset()     for  tinrange(max_time_steps):         action=agent.select_action(state)         # Add Gaussian noise to actions for exploration         action= (action+np.random.normal(0, 1, size=action_dim)).clip(-max_action, max_action)         #action += ou_noise.sample()         next_state, reward, done, info=env.step(action)         total_reward+=reward         ifrenderandi>=render_interval : env.render()         agent.replay_buffer.push((state, next_state, action, reward, np.float(done)))         state=next_state         ifdone:             break         step+=1              score_hist.append(total_reward)     total_step+=step+1     print("Episode: \t{}  Total Reward: \t{:0.2f}".format( i, total_reward))     agent.update()     ifi%10==0:         agent.save() env.close()

测试DDPG

 test_iteration=100    foriinrange(test_iteration):     state=env.reset()     fortincount():         action=agent.select_action(state)         next_state, reward, done, info=env.step(np.float32(action))         ep_r+=reward         print(reward)         env.render()         ifdone:              print("reward{}".format(reward))             print("Episode \t{}, the episode reward is \t{:0.2f}".format(i, ep_r))             ep_r=0             env.render()             break         state=next_state

咱们应用上面的参数让模型收敛：

从规范正态分布中采样噪声，而不是随机采样。
将polyak常数(tau)从0.99更改为0.001
批改Critic 网络的暗藏层大小为[64,64]。在Critic 网络的第二层之后删除了ReLU激活。改成(Linear, ReLU, Linear, Linear)。
最大缓冲区大小更改为1000000
将batch_size的大小从128更改为64

训练了75轮之后的成果如下：

总结

DDPG算法是一种受deep Q-Network (DQN)算法启发的无模型off-policy Actor-Critic算法。它联合了策略梯度办法和Q-learning的长处来学习间断动作空间的确定性策略。

与DQN相似，它应用重播缓冲区存储过来的教训和指标网络，用于训练网络，从而进步了训练过程的稳定性。

DDPG算法须要认真的超参数调优以获得最佳性能。超参数包含学习率、批大小、指标网络更新速率和探测噪声参数。超参数的渺小变动会对算法的性能产生重大影响。

下面的参数来自：

https://avoid.overfit.cn/post/9951ac196ec84629968ce7168215e461

作者：Renu Khandelwal