OpenAI Gym 《二》

心想何不讓寫故事的人︰

Dynamics with Python balancing the five link pendulum

Dynamics with Python

We’ve been working on a conference paper to demonstrate the ability to do multibody dynamics with Python. We’ve been calling this work flow PyDy, short for Python Dynamics. Several pieces of the puzzle have come together lately to really demonstrate the power of the scientific python software packages to handle complex dynamic and controls problems (i.e. IPython notebooks, matplotlib animations, python-control, and our software package mechanics which is a part of SymPy). After writing the draft of our paper, which uses a general n-link pendulum as it’s main example, I came across this blog post by Wolfram demonstrating their ability to symbolically derive the equations of motion for the n-link pendulum and stabilize it with an LQR controller. It inspired me to replicate the example as I realized that it was relatively easy to do with all free and open source software!

In this example problem we will derive the equations of motion of an n-link pendulum on a laterally sliding cart and then develop a controller to stabilize it. Balancing a single inverted pendulum is a classic problem that is many times a student’s first experience with non-linear dynamics and control. The problem here is extended to a general n-link pendulum and as we will see the equations of motion quickly get messy with greater than 2 links.

 

說故事呢?

 

一則可知『派生動力學』 PyDy 來龍去脈,再者能得創作精神也!

─── 《STEM 隨筆︰古典力學︰動力學【一】

 

假使通熟『派生動力學』 PyDy ,又能設計『 LQR 控制器』穩定『n-link pendulum 』。

那麼應該容易閱讀 Gym 文件

Getting Started with Gym

Gym is a toolkit for developing and comparing reinforcement learning algorithms. It makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as TensorFlow or Theano.

The gym library is a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.

 

所舉的 CartPole 範例吧︰

Environments

Here’s a bare minimum example of getting something running. This will run an instance of the CartPole-v0 environment for 1000 timesteps, rendering the environment at each step. You should see a window pop up rendering the classic cart-pole problem:

import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action

It should look something like this:

Normally, we’ll end the simulation before the cart-pole is allowed to go off-screen. More on that later.

If you’d like to see some other environments in action, try replacing CartPole-v0 above with something like MountainCar-v0, MsPacman-v0 (requires the Atari dependency), or Hopper-v1(requires the MuJoCo dependencies). Environments all descend from the Env base class.

Note that if you’re missing any dependencies, you should get a helpful error message telling you what you’re missing. (Let us know if a dependency gives you trouble without a clear instruction to fix it.)Installing a missing dependency is generally pretty simple. You’ll also need a MuJoCo license forHopper-v1.


Observations

If we ever want to do better than take random actions at each step, it’d probably be good to actually know what our actions are doing to the environment.

The environment’s step function returns exactly what we need. In fact, step returns four values. These are:

  • observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
  • reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
  • done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
  • info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward.

The process gets started by calling reset(), which returns an initial observation. So a more proper way of writing the previous code would be to respect the done flag:

import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
[-0.061586   -0.75893141  0.05793238  1.15547541]
[-0.07676463 -0.95475889  0.08104189  1.46574644]
[-0.0958598  -1.15077434  0.11035682  1.78260485]
[-0.11887529 -0.95705275  0.14600892  1.5261692 ]
[-0.13801635 -0.7639636   0.1765323   1.28239155]
[-0.15329562 -0.57147373  0.20218013  1.04977545]
Episode finished after 14 timesteps
[-0.02786724  0.00361763 -0.03938967 -0.01611184]
[-0.02779488 -0.19091794 -0.03971191  0.26388759]
[-0.03161324  0.00474768 -0.03443415 -0.04105167]

 

只不過當『思考』換了一種方式,『問題』似乎模糊起來?這時讀一下『原始碼』︰

gym/gym/envs/classic_control/cartpole.py

"""
Classic cart-pole system implemented by Rich Sutton et al.
Copied from http://incompleteideas.net/sutton/book/code/pole.c
permalink: https://perma.cc/C9ZM-652R
"""

import math
import gym
from gym import spaces, logger
from gym.utils import seeding
import numpy as np

class CartPoleEnv(gym.Env):
    """
    Description:
        A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity.
    Source:
        This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson
    Observation: 
        Type: Box(4)
        Num	Observation                 Min         Max
        0	Cart Position             -4.8            4.8
        1	Cart Velocity             -Inf            Inf
        2	Pole Angle                 -24°           24°
        3	Pole Velocity At Tip      -Inf            Inf
        
    Actions:
        Type: Discrete(2)
        Num	Action
        0	Push cart to the left
        1	Push cart to the right
        
        Note: The amount the velocity is reduced or increased is not fixed as it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it
    Reward:
        Reward is 1 for every step taken, including the termination step
    Starting State:
        All observations are assigned a uniform random value between ±0.05
    Episode Termination:
        Pole Angle is more than ±12°
        Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)
        Episode length is greater than 200
        Solved Requirements
        Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.
    """

 

看看別人怎樣『解答』︰

karpathy’s algorithm, Took 211 episodes to solve the environment. 195.27 ± 1.57 3 years ago

""" Quick script for an "Episodic Controller" Agent, i.e. nearest neighbor """

import logging
import os
import tempfile
import numpy as np

import gym

class EpisodicAgent(object):
    """
    Episodic agent is a simple nearest-neighbor based agent:
    - At training time it remembers all tuples of (state, action, reward).
    - After each episode it computes the empirical value function based 
        on the recorded rewards in the episode.
    - At test time it looks up k-nearest neighbors in the state space 
        and takes the action that most often leads to highest average value.
    """

 

大概是省不了的功夫吧!

不禁想,要是有一篇清楚明白的教材多好☆

Q-Learning Data Science Tutorial
Satwik Kansal Author: Satwik Kansal Software Developer
Brendan MartinAuthor: Brendan Martin Founder of LearnDataSci

Reinforcement Q-Learning from Scratch in Python with OpenAI Gym

Teach a Taxi to pick up and drop off passengers at the right locations with Reinforcement Learning

Most of you have probably heard of AI learning to play computer games on their own, a very popular example being Deepmind. Deepmind hit the news when their AlphaGo program defeated the South Korean Go world champion in 2016. There had been many successful attempts in the past to develop agents with the intent of playing Atari games like Breakout, Pong, and Space Invaders.

Each of these programs follow a paradigm of Machine Learning known as Reinforcement Learning. If you’ve never been exposed to reinforcement learning before, the following is a very straightforward analogy for how it works.