Assignment 2ECS7002P - Artificial Intelligence in GamesNovember 9, 2020In this assignment, you will implement a variety of reinforcement learning algorithms to find policies for thefrozen lake environment. Please read this entire document before you start working on the assignment.1 EnvironmentThe frozen lake environment has two main variants: the small frozen lake (Fig. 1) and the big frozen lake (Fig.2). In both cases, each tile in a square grid corresponds to a state. There is also an additional absorbing state,which will be introduced soon. There are four types of tiles: start (grey), frozen lake (light blue), hole (darkblue), and goal (white). The agent has four actions, which correspond to moving one tile up, left, down, orright. However, with probability 0.1, the environment ignores the desired direction and the agent slips (movesone tile in a random direction). An action that would cause the agent to move outside the grid leaves the stateunchanged.Figure 1: Small frozen lakeFigure 2: Big frozen lakeThe agent receives reward 1 upon taking an action at the goal. In every other case, the agent receives zeroreward. Note that the agent does not receive a reward upon moving into the goal (nor a negative reward uponmoving into a hole). Upon taking an action at the goal or in a hole, the agent moves into the absorbing state.Every action taken at the absorbing state leads to the absorbing state, which also does not provide rewards.Assume a discount factor of = 0.9.1For the purposes of model-free reinforcement learning (or interactive testing), the agent is able to interactwith the frozen lake for a number of time steps that is equal to the number of tiles.Your first task is to implement the frozen lake environment. Using either Python or Java, try to mimic thePython interface presented in Listing 1.Listing 1: Frozen lake environment.The class EnvironmentModel represents a model of an environment. The constructor of this class receivesa number of states, a number of actions, and a seed that controls the pseudorandom number generator. Itssubclasses must implement two methods: p and r. The method p returns the probability of transitioning fromstate to next state given action. The method r returns the expected reward in having transitioned from state tonext state given action. The method draw receives a pair of state and action and returns a state drawn accordingto p together with the corresponding expected reward. Note that states and actions are represented by integersstarting at zero. We highly recommend that you follow the same convention, since this will facilitate immenselythe implementation of reinforcement learning algorithms. You can use a Python dictionary (or equivalent datastructure) to map (from and to) integers to a more convenient representation when necessary. Note that, ingeneral, agents may receive rewards drawn probabilistically by an environment, which is not supported in thissimplified implementation.The class Environment represents an interactive environment and inherits from EnvironmentModel. Theconstructor of this class receives a number of states, a number of actions, a maximum number of steps forinteraction, a probability distribution over initial states, and a seed that controls the pseudorandom numbergenerator. Its subclasses must implement two methods: p and r, which were already explained above. Thisclass has two new methods: reset and step. The method reset restarts the interaction between the agent andthe environment by setting the number of time steps to zero and drawing a state according to the probabilitydistribution over initial states. This state is stored by the class. The method step receives an action and returnsa next state drawn according to p, the corresponding expected reward, and a flag variable. The new state isstored by the class. This method also keeps track of how many steps have been taken. Once the number of stepsmatches or exceeds the pre-defined maximum number of steps, the flag variable indicates that the interactionshould end.The class FrozenLake represents the frozen lake environment. Your task is to implement the methods p andr for this class. The constructor of this class receives a matrix that represents a lake, a probability that theagent will slip at any given time step, a maximum number of steps for interaction, and a seed that controlsthe pseudorandom number generator. This class overrides the method step to indicate that the interactionshould also end when the absorbing state is reached. The method render is capable of rendering the state ofthe environment or a pair of policy and value function.The function play can be used to test your implementation of the environment before you try the next tasks.42 Tabular model-based reinforcement learningYour next task is to implement policy evaluation, policy improvement, policy iteration, and value iteration.You may follow the interface suggested in Listing 2.Listing 2: Tabular model-based algorithms.The function policy evaluation receives an environment model, a deterministic policy, a discount factor, atolerance parameter, and a maximum number of iterations. A deterministic policy may be represented by anarray that contains the action prescribed for each state.The function policy improvement receives an environment model, the value function for a policy to beimproved, and a discount factor.The function policy iteration receives an environment model, a discount factor, a tolerance parameter, amaximum number of iterations, and (optionally) the initial policy.The function value iteration receives an environment model, a discount factor, a tolerance parameter, amaximum number of iterations, and (optionally) the initial value function.3 Tabular model-free reinforcement learningYour next task is to implement Sarsa control and Q-learning control. You may follow the interface suggested inListing 3. We recommend that you use the small frozen lake to test your implementation, since these algorithmsmay need many episodes to find an optimal policy for the big frozen lake.Listing 3: Tabular model-free algorithms.The function sarsa receives an environment, a maximum number of episodes, an initial learning rate, adiscount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom numbergenerator. Note that the learning rate and exploration factor decrease linearly as the number of episodesincreases (for instance, eta[i] contains the learning rate for episode i).The function q learning receives an environment, a maximum number of episodes, an initial learning rate,a discount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom numbergenerator. Note that the learning rate and exploration factor decrease linearly as the number of episodesincreases (for instance, eta[i] contains the learning rate for episode i).Important: The -greedy policy based on Q should break ties randomly between actions that maximize Qfor a given state. This plays a large role in encouraging exploration.4 Non-tabular model-free reinforcement learningIn this task, you will treat the frozen lake environment as if it required linear action-value function approximation.Your task is to implement Sarsa control and Q-learning control using linear function approximation.In the process, you will learn that tabular model-free reinforcement learning is a special case of non-tabularmodel-free reinforcement learning. You may follow the interface suggested in Listing 4.Listing 4: Non-tabular model-free algorithms.The class LinearWrapper implements a wrapper that behaves similarly to an environment that is given toits constructor. However, the methods reset and step return a feature matrix when they would typically returna state s. Each row a of this feature matrix contains the feature vector (s, a) that represents the pair of actionand state (s, a). The method encode state is responsible for representing a state by such a feature matrix.More concretely, each possible pair of state and action is represented by a different vector where all elementsexcept one are zero. Therefore, the feature matrix has |S||A| columns. The method decode policy receives aparameter vector obtained by a non-tabular reinforcement learning algorithm and returns the correspondinggreedy policy together with its value function estimate.The function linear sarsa receives an environment (wrapped by LinearWrapper ), a maximum number ofepisodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed thatcontrols the pseudorandom number generator. Note that the learning rate and exploration factor decay linearlyas the number of episodes grows (for instance, eta[i] contains the learning rate for episode i).The function linear q learning receives an environment (wrapped by LinearWrapper ), a maximum numberof episodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed thatcontrols the pseudorandom number generator. Note that the learning rate and exploration factor decay linearlyas the number of episodes grows (for instance, eta[i] contains the learning rate for episode i).The Q-learning control algorithm for linear function approximation is presented in Algorithm 1. Note thatthis algorithm uses a slightly different convention for naming variables and omits some details for the sake ofsimplicity (such as learning rate/exploration factor decay).Algorithm 1 Q-learning control algorithm for linear function approximationInput: feature vector (s, a) for all state-action pairs (s, a), number of episodes N, learning rate , probabilityof choosing random action , Important: The -greedy policy based on Q should break ties randomly between actions that maximize Q(Algorithm 1, Line 9). This plays a large role in encouraging exploration.5 Main functionYour final implementation task is to write a program that uses all the algorithms that you have implemented forthis assignment. Your main function should behave analogously to the function presented in Listing 5. Usingthe small frozen lake as a benchmark, find and render an optimal policy using policy iteration, value iteration,Sarsa control, Q-learning control, linear Sarsa control, and linear Q-learning. For grading purposes, if yourmain function does not call one of these algorithms, we will assume that it is not implemented correctly.Listing 5: Main function.6 Submission instructionsThis assignment corresponds to 40% of the final grade for this module. You will work in groups of 3 students. Thedeadline for submitting this assignment is December 11th, 2020. Penalties for late submissions will be appliedin accordance with the School policy. The submission cut-off date is 7 days after the deadline. Submissionsshould be made through QM+. Submissions by e-mail will be ignored. Please always check whether the fileswere uploaded correctly to QM+. Cases of extenuating circumstances have to go through the proper procedurein accordance with the School policy. Only cases approved by the School in due time will be considered.You will find the group selection page in QM+. You must be part of a group in QM+ before submittingyour assignment. Plagiarism leads to irreversible non-negotiable failure in the module. If you are unsure aboutwhat constitutes plagiarism, please ask.This assignment requires a group submission and an individual submission, which are detailed in the nextsections.6.1 Group submissionThe group submission must be a single zip file. This file must contain a single folder named group[group id].This folder must contain a report and a folder named code.The code folder must contain a file named README.txt, which explains how to run your main function (seeSection 5). Based solely on the correctness and clarity of your code, you will receive the following number ofpoints for accomplishing each of the following tasks:
...