共计 3188 个字符,预计需要花费 8 分钟才能阅读完成。
CSE 525 Programming Assignment 1
Due March 20th 11:59:59
The goal of this assignment is to implement three RL algorithms listed as follows.
● Monte Carlo (with function approximation)
● Fitted Q iteration
● DQN
You will be using 1 MuJoCo environment (InvertedPendulumMuJoCoEnv-v0) and 1
Atari environment (Pong-v0), and compare the RL algorithms. Feel free to use all of the
extensions/tricks we discussed during the classes for reliable learning. As the behavior
policy of off-policy RL methods, use epsilon-greedy.
What you need to submit:
(1) A notebook file that contains your network’s definition, training processes,
evaluation results and necessary comments of your codes.
(2) A report that contains core codes of the algorithms and networks design, analysis
of your results and comparison between the algorithms.
Prerequisites
In this assignment, we recommend using Colab, OpenAI Gym, OpenAI Gym[Atari],
PyBullet and PyBulletGym (Open AI Gym[Mujoco] implementation based on PyBullet).
So, before getting started, please be prepared for the smooth running of the required
dependencies.
The afore-mentioned packages are actually simulated environments that are able to
interact with our agents to offer instant observations, rewards, and other important
information. For this time, we picked 1 discrete environment in Atari called“Pong-v0”
and 1 continuous environment in MuJoCo called“InvertedPendulumMuJoCoEnv-v0”.
Note that the actions in“Pong”are discrete while the actions in“InvertedPendulum”are
continuous. As you know, the three algorithms are not able to deal with continuous
actions, which further requires you to discretize the action spaces in the
“InvertedPendulum”environment first.
For the Atari game“Pong”environment, we encourage you to preprocess the image
input to make it easier for the network to learn.
Rubrics
1) Network design for two environments. (20 points in toal, 10 points each)
2) Training process for three algorithms, there should be 6 training processes in total
for 2 environments and three algorithms. (30 points in toal, 5 points each, you
should provide a decent amount of comments to explain your codes.)
3) Evaluation results of your 6 training programs, this should include cumulative
reward by training episodes plots, average return on ten times run of your final
policy and any other plots that you find helpful to explain your design’s
performance. (30 points in toal, 5 points each.)
4) Analysis of performance of three algorithms for each environment, analyze your
plots and numbers under each algorithm and compare three algorithms under each
environment. (15 points in total)
5) Comparison between the use of epsilon-greedy vs. random behavior policy. For
this experiment, use“InvertedPendulum”as your environment and fitted Q
iteration as your RL algorithm. Give plots of your cumulative reward by episodes
and average return on test runs of your learned policy and analyze the performance
of different behavior policies. (5 points in total)
To start with:
We prepared a simple starter code for you to understand what you should code and where
to put your analysis. You don’t have to strictly follow the format, write your code in the
way you are comfortable with.
Before turning in:
- Check your notebook file, make sure that once the instructors“restart and run
all”, no errors occur. Also, make sure the format of your report is correct. - Rename your notebook file like firstname_lastname_SBUID.ipynb and your
report like firstname_lastname_SBUID_report.pdf. Zip these two files in a name
like firstname_lastname_SBUID.zip and upload to Blackboard.
After turning in: - Any format errors and fail-to-run errors might result in penalty.
- Late submissions might result in penalty. 10% per day, 50% max.