关于机器学习:DDPG强化学习的PyTorch代码实现和逐步讲解

深度确定性策略梯度 (Deep Deterministic Policy Gradient, DDPG) 是受 Deep Q-Network 启发的无模型、非策略深度强化算法，是基于应用策略梯度的 Actor-Critic，本文将应用 pytorch 对其进行残缺的实现和解说

DDPG 的要害组成部分是

Replay Buffer
Actor-Critic neural network
Exploration Noise
Target network
Soft Target Updates for Target Network

上面咱们一个一个来逐渐实现：

DDPG 应用 Replay Buffer 存储通过摸索环境采样的过程和处分(Sₜ，aₜ，Rₜ，Sₜ+₁)。Replay Buffer 在帮忙代理减速学习以及 DDPG 的稳定性方面起着至关重要的作用:

最小化样本之间的相关性：将过来的教训存储在 Replay Buffer 中，从而容许代理从各种教训中学习。
启用离线策略学习：容许代理从重播缓冲区采样转换，而不是从以后策略采样转换。
高效采样：将过来的教训存储在缓冲区中，容许代理屡次从不同的教训中学习。

  classReplay_buffer():
     '''
     Code based on:
     https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py
     Expects tuples of (state, next_state, action, reward, done)
     '''def__init__(self, max_size=capacity):"""Create Replay buffer.
         Parameters
         ----------
         size: int
             Max number of transitions to store in the buffer. When the buffer
             overflows the old memories are dropped.
         """
         self.storage= []
         self.max_size=max_size
         self.ptr=0
 
     defpush(self, data):
         iflen(self.storage) ==self.max_size:
             self.storage[int(self.ptr)] =data
             self.ptr= (self.ptr+1) %self.max_size
         else:
             self.storage.append(data)
 
     defsample(self, batch_size):
         """Sample a batch of experiences.
         Parameters
         ----------
         batch_size: int
             How many transitions to sample.
         Returns
         -------
         state: np.array
             batch of state or observations
         action: np.array
             batch of actions executed given a state
         reward: np.array
             rewards received as results of executing action
         next_state: np.array
             next state next state or observations seen after executing action
         done: np.array
             done[i] = 1 if executing ation[i] resulted in
             the end of an episode and 0 otherwise.
         """
         ind=np.random.randint(0, len(self.storage), size=batch_size)
         state, next_state, action, reward, done= [], [], [], [], []
 
         foriinind:
             st, n_st, act, rew, dn=self.storage[i]
             state.append(np.array(st, copy=False))
             next_state.append(np.array(n_st, copy=False))
             action.append(np.array(act, copy=False))
             reward.append(np.array(rew, copy=False))
             done.append(np.array(dn, copy=False))
 
         returnnp.array(state), np.array(next_state), np.array(action), np.array(reward).reshape(-1, 1), np.array(done).reshape(-1, 1)

这是 Actor-Critic 强化学习算法的 PyTorch 实现。该代码定义了两个神经网络模型，一个 Actor 和一个 Critic。

Actor 模型的输出：环境状态；Actor 模型的输入：具备间断值的动作。

Critic 模型的输出：环境状态和动作；Critic 模型的输入：Q 值，即以后状态 - 动作对的预期总处分。

  classActor(nn.Module):
     """
     The Actor model takes in a state observation as input and 
     outputs an action, which is a continuous value.
     
     It consists of four fully connected linear layers with ReLU activation functions and 
     a final output layer selects one single optimized action for the state
     """
     def__init__(self, n_states, action_dim, hidden1):
         super(Actor, self).__init__()
         self.net=nn.Sequential(nn.Linear(n_states, hidden1), 
             nn.ReLU(), 
             nn.Linear(hidden1, hidden1), 
             nn.ReLU(), 
             nn.Linear(hidden1, hidden1), 
             nn.ReLU(), 
             nn.Linear(hidden1, 1)
         )
         
     defforward(self, state):
         returnself.net(state)
 
 classCritic(nn.Module):
     """
     The Critic model takes in both a state observation and an action as input and 
     outputs a Q-value, which estimates the expected total reward for the current state-action pair. 
     
     It consists of four linear layers with ReLU activation functions, 
     State and action inputs are concatenated before being fed into the first linear layer. 
     
     The output layer has a single output, representing the Q-value
     """
     def__init__(self, n_states, action_dim, hidden2):
         super(Critic, self).__init__()
         self.net=nn.Sequential(nn.Linear(n_states+action_dim, hidden2), 
             nn.ReLU(), 
             nn.Linear(hidden2, hidden2), 
             nn.ReLU(), 
             nn.Linear(hidden2, hidden2), 
             nn.ReLU(), 
             nn.Linear(hidden2, action_dim)
         )
         
     defforward(self, state, action):
         returnself.net(torch.cat((state, action), 1))

向 Actor 抉择的动作增加噪声是 DDPG 中用来激励摸索和改良学习过程的一种技术。

能够应用高斯噪声或 Ornstein-Uhlenbeck 噪声。高斯噪声简略且易于实现，Ornstein-Uhlenbeck 噪声会生成工夫相干的噪声，能够帮忙代理更无效地摸索动作空间。然而与高斯噪声办法相比，Ornstein-Uhlenbeck 噪声稳定更平滑且随机性更低。

  importnumpyasnp
 importrandom
 importcopy
 
 classOU_Noise(object):
     """Ornstein-Uhlenbeck process.
     code from :
     https://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab
     The OU_Noise class has four attributes
     
         size: the size of the noise vector to be generated
         mu: the mean of the noise, set to 0 by default
         theta: the rate of mean reversion, controlling how quickly the noise returns to the mean
         sigma: the volatility of the noise, controlling the magnitude of fluctuations
     """
     def__init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):
         self.mu=mu*np.ones(size)
         self.theta=theta
         self.sigma=sigma
         self.seed=random.seed(seed)
         self.reset()
 
     defreset(self):
         """Reset the internal state (= noise) to mean (mu)."""
         self.state=copy.copy(self.mu)
 
     defsample(self):
         """Update internal state and return it as a noise sample.
         This method uses the current state of the noise and generates the next sample
         """
         dx=self.theta* (self.mu-self.state) +self.sigma*np.array([np.random.normal() for_inrange(len(self.state))])
         self.state+=dx
         returnself.state

要在 DDPG 中应用高斯噪声，能够间接将高斯噪声增加到代理的动作抉择过程中。

DDPG (Deep Deterministic Policy Gradient)采纳两组 Actor-Critic 神经网络进行函数迫近。在 DDPG 中，指标网络是 Actor-Critic，它指标网络具备与 Actor-Critic 网络雷同的构造和参数化。

在训练期时，代理应用其 Actor-Critic 网络与环境交互，并将教训元组（Sₜ、Aₜ、Rₜ、Sₜ+₁）存储在 Replay Buffer 中。而后代理从 Replay Buffer 中采样并应用数据更新 Actor-Critic 网络。DDPG 算法不是通过间接从 Actor-Critic 网络复制来更新指标网络权重，而是通过称为软指标更新的过程迟缓更新指标网络权重。

软指标的更新是从 Actor-Critic 网络传输到指标网络的称为指标更新率 (τ) 的权重的一小部分。

软指标的更新公式如下:

通过应用软指标技术，能够大大提高学习的稳定性。

  #Set Hyperparameters
 # Hyperparameters adapted for performance from
 capacity=1000000
 batch_size=64
 update_iteration=200
 tau=0.001# tau for soft updating
 gamma=0.99# discount factor
 directory='./'
 hidden1=20# hidden layer for actor
 hidden2=64.#hiiden laye for critic
 
 classDDPG(object):
     def__init__(self, state_dim, action_dim):
         """
         Initializes the DDPG agent. 
         Takes three arguments:
                state_dim which is the dimensionality of the state space, 
                action_dim which is the dimensionality of the action space, and 
                max_action which is the maximum value an action can take. 
         
         Creates a replay buffer, an actor-critic  networks and their corresponding target networks. 
         It also initializes the optimizer for both actor and critic networks alog with 
         counters to track the number of training iterations.
         """
         self.replay_buffer=Replay_buffer()
         
         self.actor=Actor(state_dim, action_dim, hidden1).to(device)
         self.actor_target=Actor(state_dim, action_dim,  hidden1).to(device)
         self.actor_target.load_state_dict(self.actor.state_dict())
         self.actor_optimizer=optim.Adam(self.actor.parameters(), lr=3e-3)
 
         self.critic=Critic(state_dim, action_dim,  hidden2).to(device)
         self.critic_target=Critic(state_dim, action_dim,  hidden2).to(device)
         self.critic_target.load_state_dict(self.critic.state_dict())
         self.critic_optimizer=optim.Adam(self.critic.parameters(), lr=2e-2)
         # learning rate
 
         
 
         self.num_critic_update_iteration=0
         self.num_actor_update_iteration=0
         self.num_training=0
 
     defselect_action(self, state):
         """
         takes the current state as input and returns an action to take in that state. 
         It uses the actor network to map the state to an action.
         """
         state=torch.FloatTensor(state.reshape(1, -1)).to(device)
         returnself.actor(state).cpu().data.numpy().flatten()
 
 
     defupdate(self):
         """
         updates the actor and critic networks using a batch of samples from the replay buffer. 
         For each sample in the batch, it computes the target Q value using the target critic network and the target actor network. 
         It then computes the current Q value 
         using the critic network and the action taken by the actor network. 
         
         It computes the critic loss as the mean squared error between the target Q value and the current Q value, and 
         updates the critic network using gradient descent. 
         
         It then computes the actor loss as the negative mean Q value using the critic network and the actor network, and 
         updates the actor network using gradient ascent. 
         
         Finally, it updates the target networks using 
         soft updates, where a small fraction of the actor and critic network weights are transferred to their target counterparts. 
         This process is repeated for a fixed number of iterations.
         """
 
         foritinrange(update_iteration):
             # For each Sample in replay buffer batch
             state, next_state, action, reward, done=self.replay_buffer.sample(batch_size)
             state=torch.FloatTensor(state).to(device)
             action=torch.FloatTensor(action).to(device)
             next_state=torch.FloatTensor(next_state).to(device)
             done=torch.FloatTensor(1-done).to(device)
             reward=torch.FloatTensor(reward).to(device)
 
             # Compute the target Q value
             target_Q=self.critic_target(next_state, self.actor_target(next_state))
             target_Q=reward+ (done*gamma*target_Q).detach()
 
             # Get current Q estimate
             current_Q=self.critic(state, action)
 
             # Compute critic loss
             critic_loss=F.mse_loss(current_Q, target_Q)
             
             # Optimize the critic
             self.critic_optimizer.zero_grad()
             critic_loss.backward()
             self.critic_optimizer.step()
 
             # Compute actor loss as the negative mean Q value using the critic network and the actor network
             actor_loss=-self.critic(state, self.actor(state)).mean()
 
             # Optimize the actor
             self.actor_optimizer.zero_grad()
             actor_loss.backward()
             self.actor_optimizer.step()
 
             
             """
             Update the frozen target models using 
             soft updates, where 
             tau,a small fraction of the actor and critic network weights are transferred to their target counterparts. 
             """
             forparam, target_paraminzip(self.critic.parameters(), self.critic_target.parameters()):
                 target_param.data.copy_(tau*param.data+ (1-tau) *target_param.data)
 
             forparam, target_paraminzip(self.actor.parameters(), self.actor_target.parameters()):
                 target_param.data.copy_(tau*param.data+ (1-tau) *target_param.data)
             
            
             self.num_actor_update_iteration+=1
             self.num_critic_update_iteration+=1
     defsave(self):
         """Saves the state dictionaries of the actor and critic networks to files"""
         torch.save(self.actor.state_dict(), directory+'actor.pth')
         torch.save(self.critic.state_dict(), directory+'critic.pth')
 
     defload(self):
         """Loads the state dictionaries of the actor and critic networks to files"""
         self.actor.load_state_dict(torch.load(directory+'actor.pth'))
         self.critic.load_state_dict(torch.load(directory+'critic.pth'))

这里咱们应用 OpenAI Gym 的“MountainCarContinuous-v0”来训练咱们的 DDPG RL 模型，这里的环境提供间断的口头和察看空间，指标是尽快让小车达到山顶。

上面定义算法的各种参数，例如最大训练次数、摸索噪声和记录距离等等。应用固定的随机种子能够使得过程可能回溯。

  importgym
 
 # create the environment
 env_name='MountainCarContinuous-v0'
 env=gym.make(env_name)
 device='cuda'iftorch.cuda.is_available() else'cpu'
 
 # Define different parameters for training the agent
 max_episode=100
 max_time_steps=5000
 ep_r=0
 total_step=0
 score_hist=[]
 # for rensering the environmnet
 render=True
 render_interval=10
 # for reproducibility
 env.seed(0)
 torch.manual_seed(0)
 np.random.seed(0)
 #Environment action ans states
 state_dim=env.observation_space.shape[0]
 action_dim=env.action_space.shape[0]
 max_action=float(env.action_space.high[0])
 min_Val=torch.tensor(1e-7).float().to(device) 
 
 # Exploration Noise
 exploration_noise=0.1
 exploration_noise=0.1*max_action

创立 DDPG 代理类的实例，以训练代理达到指定的次数。在每轮完结时调用代理的 update()办法来更新参数，并且在每十轮之后应用 save()办法将代理的参数保留到一个文件中。

  # Create a DDPG instance
 agent=DDPG(state_dim, action_dim)
 
 # Train the agent for max_episodes
 foriinrange(max_episode):
     total_reward=0
     step=0
     state=env.reset()
     for  tinrange(max_time_steps):
         action=agent.select_action(state)
         # Add Gaussian noise to actions for exploration
         action= (action+np.random.normal(0, 1, size=action_dim)).clip(-max_action, max_action)
         #action += ou_noise.sample()
         next_state, reward, done, info=env.step(action)
         total_reward+=reward
         ifrenderandi>=render_interval : env.render()
         agent.replay_buffer.push((state, next_state, action, reward, np.float(done)))
         state=next_state
         ifdone:
             break
         step+=1
         
     score_hist.append(total_reward)
     total_step+=step+1
     print("Episode: \t{}  Total Reward: \t{:0.2f}".format(i, total_reward))
     agent.update()
     ifi%10==0:
         agent.save()
 env.close()

  test_iteration=100
   
 foriinrange(test_iteration):
     state=env.reset()
     fortincount():
         action=agent.select_action(state)
         next_state, reward, done, info=env.step(np.float32(action))
         ep_r+=reward
         print(reward)
         env.render()
         ifdone: 
             print("reward{}".format(reward))
             print("Episode \t{}, the episode reward is \t{:0.2f}".format(i, ep_r))
             ep_r=0
             env.render()
             break
         state=next_state

咱们应用上面的参数让模型收敛：

从规范正态分布中采样噪声，而不是随机采样。
将 polyak 常数 (tau) 从 0.99 更改为 0.001
批改 Critic 网络的暗藏层大小为[64,64]。在 Critic 网络的第二层之后删除了 ReLU 激活。改成(Linear, ReLU, Linear, Linear)。
最大缓冲区大小更改为 1000000
将 batch_size 的大小从 128 更改为 64

训练了 75 轮之后的成果如下：

DDPG 算法是一种受 deep Q-Network (DQN)算法启发的无模型 off-policy Actor-Critic 算法。它联合了策略梯度办法和 Q -learning 的长处来学习间断动作空间的确定性策略。

与 DQN 相似，它应用重播缓冲区存储过来的教训和指标网络，用于训练网络，从而进步了训练过程的稳定性。

DDPG 算法须要认真的超参数调优以获得最佳性能。超参数包含学习率、批大小、指标网络更新速率和探测噪声参数。超参数的渺小变动会对算法的性能产生重大影响。

下面的参数来自：

https://avoid.overfit.cn/post/9951ac196ec84629968ce7168215e461

作者：Renu Khandelwal

关于机器学习:DDPG强化学习的PyTorch代码实现和逐步讲解

Replay Buffer

Actor-Critic Neural Network

Exploration Noise

DDPG

训练 DDPG

测试 DDPG

总结

Just My Socks（注册教程内含优惠码）

	classReplay_buffer():
	'''
	Code based on:
	https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py
	Expects tuples of (state, next_state, action, reward, done)
	'''def__init__(self, max_size=capacity):"""Create Replay buffer.
	Parameters
	----------
	size: int
	Max number of transitions to store in the buffer. When the buffer
	overflows the old memories are dropped.
	"""
	self.storage= []
	self.max_size=max_size
	self.ptr=0

	defpush(self, data):
	iflen(self.storage) ==self.max_size:
	self.storage[int(self.ptr)] =data
	self.ptr= (self.ptr+1) %self.max_size
	else:
	self.storage.append(data)

	defsample(self, batch_size):
	"""Sample a batch of experiences.
	Parameters
	----------
	batch_size: int
	How many transitions to sample.
	Returns
	-------
	state: np.array
	batch of state or observations
	action: np.array
	batch of actions executed given a state
	reward: np.array
	rewards received as results of executing action
	next_state: np.array
	next state next state or observations seen after executing action
	done: np.array
	done[i] = 1 if executing ation[i] resulted in
	the end of an episode and 0 otherwise.
	"""
	ind=np.random.randint(0, len(self.storage), size=batch_size)
	state, next_state, action, reward, done= [], [], [], [], []

	foriinind:
	st, n_st, act, rew, dn=self.storage[i]
	state.append(np.array(st, copy=False))
	next_state.append(np.array(n_st, copy=False))
	action.append(np.array(act, copy=False))
	reward.append(np.array(rew, copy=False))
	done.append(np.array(dn, copy=False))

	returnnp.array(state), np.array(next_state), np.array(action), np.array(reward).reshape(-1, 1), np.array(done).reshape(-1, 1)

	classActor(nn.Module):
	"""
	The Actor model takes in a state observation as input and
	outputs an action, which is a continuous value.

	It consists of four fully connected linear layers with ReLU activation functions and
	a final output layer selects one single optimized action for the state
	"""
	def__init__(self, n_states, action_dim, hidden1):
	super(Actor, self).__init__()
	self.net=nn.Sequential(nn.Linear(n_states, hidden1),
	nn.ReLU(),
	nn.Linear(hidden1, hidden1),
	nn.ReLU(),
	nn.Linear(hidden1, hidden1),
	nn.ReLU(),
	nn.Linear(hidden1, 1)
	)

	defforward(self, state):
	returnself.net(state)

	classCritic(nn.Module):
	"""
	The Critic model takes in both a state observation and an action as input and
	outputs a Q-value, which estimates the expected total reward for the current state-action pair.

	It consists of four linear layers with ReLU activation functions,
	State and action inputs are concatenated before being fed into the first linear layer.

	The output layer has a single output, representing the Q-value
	"""
	def__init__(self, n_states, action_dim, hidden2):
	super(Critic, self).__init__()
	self.net=nn.Sequential(nn.Linear(n_states+action_dim, hidden2),
	nn.ReLU(),
	nn.Linear(hidden2, hidden2),
	nn.ReLU(),
	nn.Linear(hidden2, hidden2),
	nn.ReLU(),
	nn.Linear(hidden2, action_dim)
	)

	defforward(self, state, action):
	returnself.net(torch.cat((state, action), 1))

	importnumpyasnp
	importrandom
	importcopy

	classOU_Noise(object):
	"""Ornstein-Uhlenbeck process.
	code from :
	https://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab
	The OU_Noise class has four attributes

	size: the size of the noise vector to be generated
	mu: the mean of the noise, set to 0 by default
	theta: the rate of mean reversion, controlling how quickly the noise returns to the mean
	sigma: the volatility of the noise, controlling the magnitude of fluctuations
	"""
	def__init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):
	self.mu=mu*np.ones(size)
	self.theta=theta
	self.sigma=sigma
	self.seed=random.seed(seed)
	self.reset()

	defreset(self):
	"""Reset the internal state (= noise) to mean (mu)."""
	self.state=copy.copy(self.mu)

	defsample(self):
	"""Update internal state and return it as a noise sample.
	This method uses the current state of the noise and generates the next sample
	"""
	dx=self.theta* (self.mu-self.state) +self.sigma*np.array([np.random.normal() for_inrange(len(self.state))])
	self.state+=dx
	returnself.state

	#Set Hyperparameters
	# Hyperparameters adapted for performance from
	capacity=1000000
	batch_size=64
	update_iteration=200
	tau=0.001# tau for soft updating
	gamma=0.99# discount factor
	directory='./'
	hidden1=20# hidden layer for actor
	hidden2=64.#hiiden laye for critic

	classDDPG(object):
	def__init__(self, state_dim, action_dim):
	"""
	Initializes the DDPG agent.
	Takes three arguments:
	state_dim which is the dimensionality of the state space,
	action_dim which is the dimensionality of the action space, and
	max_action which is the maximum value an action can take.

	Creates a replay buffer, an actor-critic networks and their corresponding target networks.
	It also initializes the optimizer for both actor and critic networks alog with
	counters to track the number of training iterations.
	"""
	self.replay_buffer=Replay_buffer()

	self.actor=Actor(state_dim, action_dim, hidden1).to(device)
	self.actor_target=Actor(state_dim, action_dim, hidden1).to(device)
	self.actor_target.load_state_dict(self.actor.state_dict())
	self.actor_optimizer=optim.Adam(self.actor.parameters(), lr=3e-3)

	self.critic=Critic(state_dim, action_dim, hidden2).to(device)
	self.critic_target=Critic(state_dim, action_dim, hidden2).to(device)
	self.critic_target.load_state_dict(self.critic.state_dict())
	self.critic_optimizer=optim.Adam(self.critic.parameters(), lr=2e-2)
	# learning rate



	self.num_critic_update_iteration=0
	self.num_actor_update_iteration=0
	self.num_training=0

	defselect_action(self, state):
	"""
	takes the current state as input and returns an action to take in that state.
	It uses the actor network to map the state to an action.
	"""
	state=torch.FloatTensor(state.reshape(1, -1)).to(device)
	returnself.actor(state).cpu().data.numpy().flatten()


	defupdate(self):
	"""
	updates the actor and critic networks using a batch of samples from the replay buffer.
	For each sample in the batch, it computes the target Q value using the target critic network and the target actor network.
	It then computes the current Q value
	using the critic network and the action taken by the actor network.

	It computes the critic loss as the mean squared error between the target Q value and the current Q value, and
	updates the critic network using gradient descent.

	It then computes the actor loss as the negative mean Q value using the critic network and the actor network, and
	updates the actor network using gradient ascent.

	Finally, it updates the target networks using
	soft updates, where a small fraction of the actor and critic network weights are transferred to their target counterparts.
	This process is repeated for a fixed number of iterations.
	"""

	foritinrange(update_iteration):
	# For each Sample in replay buffer batch
	state, next_state, action, reward, done=self.replay_buffer.sample(batch_size)
	state=torch.FloatTensor(state).to(device)
	action=torch.FloatTensor(action).to(device)
	next_state=torch.FloatTensor(next_state).to(device)
	done=torch.FloatTensor(1-done).to(device)
	reward=torch.FloatTensor(reward).to(device)

	# Compute the target Q value
	target_Q=self.critic_target(next_state, self.actor_target(next_state))
	target_Q=reward+ (donegammatarget_Q).detach()

	# Get current Q estimate
	current_Q=self.critic(state, action)

	# Compute critic loss
	critic_loss=F.mse_loss(current_Q, target_Q)

	# Optimize the critic
	self.critic_optimizer.zero_grad()
	critic_loss.backward()
	self.critic_optimizer.step()

	# Compute actor loss as the negative mean Q value using the critic network and the actor network
	actor_loss=-self.critic(state, self.actor(state)).mean()

	# Optimize the actor
	self.actor_optimizer.zero_grad()
	actor_loss.backward()
	self.actor_optimizer.step()


	"""
	Update the frozen target models using
	soft updates, where
	tau,a small fraction of the actor and critic network weights are transferred to their target counterparts.
	"""
	forparam, target_paraminzip(self.critic.parameters(), self.critic_target.parameters()):
	target_param.data.copy_(tauparam.data+ (1-tau) target_param.data)

	forparam, target_paraminzip(self.actor.parameters(), self.actor_target.parameters()):
	target_param.data.copy_(tauparam.data+ (1-tau) target_param.data)


	self.num_actor_update_iteration+=1
	self.num_critic_update_iteration+=1
	defsave(self):
	"""Saves the state dictionaries of the actor and critic networks to files"""
	torch.save(self.actor.state_dict(), directory+'actor.pth')
	torch.save(self.critic.state_dict(), directory+'critic.pth')

	defload(self):
	"""Loads the state dictionaries of the actor and critic networks to files"""
	self.actor.load_state_dict(torch.load(directory+'actor.pth'))
	self.critic.load_state_dict(torch.load(directory+'critic.pth'))

	importgym

	# create the environment
	env_name='MountainCarContinuous-v0'
	env=gym.make(env_name)
	device='cuda'iftorch.cuda.is_available() else'cpu'

	# Define different parameters for training the agent
	max_episode=100
	max_time_steps=5000
	ep_r=0
	total_step=0
	score_hist=[]
	# for rensering the environmnet
	render=True
	render_interval=10
	# for reproducibility
	env.seed(0)
	torch.manual_seed(0)
	np.random.seed(0)
	#Environment action ans states
	state_dim=env.observation_space.shape[0]
	action_dim=env.action_space.shape[0]
	max_action=float(env.action_space.high[0])
	min_Val=torch.tensor(1e-7).float().to(device)

	# Exploration Noise
	exploration_noise=0.1
	exploration_noise=0.1*max_action

	# Create a DDPG instance
	agent=DDPG(state_dim, action_dim)

	# Train the agent for max_episodes
	foriinrange(max_episode):
	total_reward=0
	step=0
	state=env.reset()
	for tinrange(max_time_steps):
	action=agent.select_action(state)
	# Add Gaussian noise to actions for exploration
	action= (action+np.random.normal(0, 1, size=action_dim)).clip(-max_action, max_action)
	#action += ou_noise.sample()
	next_state, reward, done, info=env.step(action)
	total_reward+=reward
	ifrenderandi>=render_interval : env.render()
	agent.replay_buffer.push((state, next_state, action, reward, np.float(done)))
	state=next_state
	ifdone:
	break
	step+=1

	score_hist.append(total_reward)
	total_step+=step+1
	print("Episode: \t{} Total Reward: \t{:0.2f}".format(i, total_reward))
	agent.update()
	ifi%10==0:
	agent.save()
	env.close()

	test_iteration=100

	foriinrange(test_iteration):
	state=env.reset()
	fortincount():
	action=agent.select_action(state)
	next_state, reward, done, info=env.step(np.float32(action))
	ep_r+=reward
	print(reward)
	env.render()
	ifdone:
	print("reward{}".format(reward))
	print("Episode \t{}, the episode reward is \t{:0.2f}".format(i, ep_r))
	ep_r=0
	env.render()
	break
	state=next_state

关于机器学习:DDPG强化学习的PyTorch代码实现和逐步讲解

Replay Buffer

Actor-Critic Neural Network

Exploration Noise

DDPG

训练 DDPG

测试 DDPG

总结

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）