关于python:利用强化学习QLearning实现最短路径算法

如果你是一名计算机专业的学生，有对图论有根本的理解，那么你肯定晓得一些驰名的最优门路解，如Dijkstra算法、Bellman-Ford算法和a*算法(A-Star)等。

这些算法都是大佬们通过有数小时的致力才发现的，然而当初曾经是人工智能的时代，强化学习算法可能为咱们提出和前辈一样好的解决方案吗?

本文中咱们将尝试找出一种办法，在从目的地a挪动到目的地B时尽可能减少遍历门路。咱们应用本人的创立虚构数据来提供演示，上面代码将创立虚构的交通网格：

 importnetworkxasnx  # Create the graph object G=nx.Graph()  # Define the nodes nodes= ['New York, NY', 'Los Angeles, CA', 'Chicago, IL', 'Houston, TX', 'Phoenix, AZ', 'Dallas, TX', 'Miami, FL']  # Add the nodes to the graph G.add_nodes_from(nodes)  # Define the edges and their distances edges= [('New York, NY', 'Chicago, IL', {'distance': 790}),           ('New York, NY', 'Miami, FL', {'distance': 1300}),          ('Chicago, IL', 'Dallas, TX', {'distance': 960}),          ('Dallas, TX', 'Houston, TX', {'distance': 240}),          ('Houston, TX', 'Phoenix, AZ', {'distance': 1170}),          ('Phoenix, AZ', 'Los Angeles, CA', {'distance': 380}),          ('Los Angeles, CA', 'Dallas, TX', {'distance': 1240}),          ('Los Angeles, CA', 'Chicago, IL', {'distance': 2010})]  # Add the edges to the graph G.add_edges_from(edges)

运行起来没有报错，然而咱们不晓得数据是什么样子的，所以让咱们先进行可视化，理解数据：

 importmatplotlib.pyplotasplt  # set positions for the nodes (optional) pos=nx.spring_layout(G)  # draw the nodes and edges nx.draw_networkx_nodes(G, pos, node_color='lightblue', node_size=500) nx.draw_networkx_edges(G, pos, edge_color='gray', width=2)  # draw edge labels edge_labels=nx.get_edge_attributes(G, 'weight') nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels)  # draw node labels node_labels= {node: node.split(',')[0] fornodeinG.nodes()} nx.draw_networkx_labels(G, pos, labels=node_labels)  # show the plot plt.axis('off') plt.show()

咱们有了一个根本的节点网络。然而这感觉太简略了。对于一个强化学习代理来说，这基本上没有难度，所以咱们减少更多的节点:

这样就简单多了，然而它看起来很凌乱，比方从New York 到 Arizona就可能是一个挑战。

咱们这里应用最常见且通用的Q-Learning来解决这个问题，因为它有动作-状态对矩阵，能够帮忙确定最佳的动作。在寻找图中最短门路的状况下，Q-Learning能够通过迭代更新每个状态-动作对的q值来确定两个节点之间的最优门路。

上图为q值的演示。

上面咱们开始实现本人的Q-Learning

 importnetworkxasnx importnumpyasnp  defq_learning_shortest_path(G, start_node, end_node, learning_rate=0.8, discount_factor=0.95, epsilon=0.2, num_episodes=1000):     """     Calculates the shortest path in a graph G using Q-learning algorithm.      Parameters:         G (networkx.Graph): the graph         start_node: the starting node         end_node: the destination node         learning_rate (float): the learning rate (default=0.8)         discount_factor (float): the discount factor (default=0.95)         epsilon (float): the exploration factor (default=0.2)         num_episodes (int): the number of episodes (default=1000)      Returns:         A list with the shortest path from start_node to end_node.     """

咱们的输出是整个的图，还有开始和完结的节点，首先就须要提取每个节点之间的间隔，将其提供给Q-learning算法。

 # Extract nodes and edges data     nodes=list(G.nodes())     num_nodes=len(nodes)     edges=list(G.edges(data=True))     num_edges=len(edges)     edge_distances=np.zeros((num_nodes, num_nodes))     fori, j, datainedges:         edge_distances[nodes.index(i), nodes.index(j)] =data['weight']         edge_distances[nodes.index(j), nodes.index(i)] =data['weight']

创立一个Q-table ，这样咱们就能够在不断更新模型的同时更新值。

 # Initialize Q-values table     q_table=np.zeros((num_nodes, num_nodes))          # Convert start and end node to node indices     start_node_index=nodes.index(start_node)     end_node_index=nodes.index(end_node)

上面就是强化学习算法的外围!

 # Q-learning algorithm     forepisodeinrange(num_episodes):         current_node=start_node_index         print(episode)         whilecurrent_node!=end_node_index:             # Choose action based on epsilon-greedy policy             ifnp.random.uniform(0, 1) <epsilon:                 # Explore                 possible_actions=np.where(edge_distances[current_node,:] >0)[0]                 iflen(possible_actions) ==0:                     break                 action=np.random.choice(possible_actions)             else:                 # Exploit                 possible_actions=np.where(q_table[current_node,:] ==np.max(q_table[current_node,:]))[0]                 iflen(possible_actions) ==0:                     break                 action=np.random.choice(possible_actions)              # Calculate reward and update Q-value             next_node=action             reward=-edge_distances[current_node, next_node]             q_table[current_node, next_node] = (1-learning_rate) *q_table[current_node, next_node] +learning_rate* (reward+discount_factor*np.max(q_table[next_node, :]))             # Move to next node             current_node=next_node             ifcurrent_node==end_node_index:                 break     print(q_table)

这里须要留神的事件是，咱们激励模型摸索还是利用一个特定的门路。

大多数强化算法都是基于这种简略的衡量制订的。过多的摸索的问题在于它可能导致代理破费太多工夫摸索环境，而没有足够的工夫利用它曾经学到的常识，可能导致代理采取次优口头并最终无奈实现其指标。如果摸索率设置得太高，代理可能永远不会收敛到最优策略。然而如果摸索率设置得太低，代理可能会陷入次优策略。所以，须要在摸索和利用之间获得均衡，确保代理进行足够的摸索以理解环境，同时利用其常识来最大化回报。

而强化学习中过多利用的问题会使代理陷入次优策略，无奈发现可能更好的动作或状态。即便有更好的抉择，代理也可能对其以后的政策过于自信。这被称为“破绽利用陷阱”或“部分最优”问题，代理无奈从次优解决方案中逃脱。在这种状况下，摸索有助于发现更好的策略和防止“部分最优”。

回到咱们的代码，咱们须要查看Q-table ，并确保能够从中提取出最短门路。

 # Extract shortest path from Q-values table     shortest_path= [start_node]     current_node=start_node_index     whilecurrent_node!=end_node_index:         next_node=np.argmax(q_table[current_node, :])         shortest_path.append(nodes[next_node])         current_node=next_node     shortest_path.append(end_node)     returnshortest_path

最初，应用函数来查看否可能失去所需的输入。

 shortest_path=q_learning_shortest_path(G, 'New York, NY', 'Phoenix, AZ') print(shortest_path)

输入后果如下：

这就是咱们数据中从New York, NY到Phoenix, AZ的最短门路!

如果你感兴趣或者想理解更多，能够在这个链接中查看残缺的代码。

https://avoid.overfit.cn/post/a4d722175b984e39a8317a7fc44e8cd6

作者：Amos Eda