OpenAI Gym 《一》


Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball.



伊凡·彼得羅維奇·巴夫洛夫[1](俄語:Иван Петрович Павлов,1849年9月26日-1936年2月27日),俄羅斯生理學家、心理學家醫師。因為對狗研究而首先對古典制約作出描述而著名,並在1904年因為對消化系統的研究得到諾貝爾生理學或醫學獎










食物 (US) =>唾液分泌(UR)

食物 (US) + 聲音 (NS) =>唾液分泌(UR)

聲音 (CS) =>唾液分泌(CR)





Reinforcement learning

Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms. In the operations research and control literature, reinforcement learning is called approximate dynamic programming, or neuro-dynamic programming.[1][2] The problems of interest in reinforcement learning have also been studied in the theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation, particularly in the absence of a mathematical model of the environment. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality. In machine learning, the environment is typically formulated as a Markov Decision Process (MDP), as many reinforcement learning algorithms for this context utilize dynamic programming techniques.[2][1][3] The main difference between the classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.[2][1]

Reinforcement learning is considered as one of three machine learning paradigms, alongside supervised learning and unsupervised learning. It differs from supervised learning in that correct input/output pairs[clarification needed] need not be presented, and sub-optimal actions need not be explicitly corrected. Instead the focus is on performance[clarification needed], which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).[4] The exploration vs. exploitation trade-off has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.[citation needed]


The typical framing of a Reinforcement Learning (RL) scenario: an agent takes actions in an environment, which is interpreted into a reward and a representation of the state, which are fed back into the agent.

Basic reinforcement is modeled as a Markov decision process:

  • a set of environment and agent states, S;
  • a set of actions, A, of the agent;
  • \displaystyle P_{a}(s,s')=Pr(s_{t+1}=s'|s_{t}=s,a_{t}=a) is the probability of transition from state \displaystyle s to state \displaystyle s' under action \displaystyle a .
  • \displaystyle R_{a}(s,s') is the immediate reward after transition from \displaystyle s to \displaystyle s' with action \displaystyle a .
  • rules that describe what the agent observes

Rules are often stochastic. The observation typically involves the scalar, immediate reward associated with the last transition. In many works, the agent is assumed to observe the current environmental state (full observability). If not, the agent has partial observability. Sometimes the set of actions available to the agent is restricted (a zero balance cannot be reduced).

A reinforcement learning agent interacts with its environment in discrete time steps. At each time t, the agent receives an observation \displaystyle o_{t} , which typically includes the reward \displaystyle r_{t} . It then chooses an action \displaystyle a_{t} from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state \displaystyle s_{t+1} and the reward \displaystyle r_{t+1} associated with the transition \displaystyle (s_{t},a_{t},s_{t+1}) is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can (possibly randomly) choose any action as a function of the history.

When the agent’s performance is compared to that of an agent that acts optimally, the difference in performance gives rise to the notion of regret. In order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize future income), although the immediate reward associated with this might be negative.

Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon, checkers[5] and go (AlphaGo).

Two elements make reinforcement learning powerful: the use of samples to optimize performance and the use of function approximation to deal with large environments. Thanks to these two key components, reinforcement learning can be used in large environments in the following situations:

  • A model of the environment is known, but an analytic solution is not available;
  • Only a simulation model of the environment is given (the subject of simulation-based optimization);[6]
  • The only way to collect information about the environment is to interact with it.

The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to machine learning problems.


Reinforcement learning requires clever exploration mechanisms. Randomly selecting actions, without reference to an estimated probability distribution, shows poor performance. The case of (small) finite Markov decision processes is relatively well understood. However, due to the lack of algorithms that properly scale well with the number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical.

One such method is \displaystyle \epsilon-greedy, when the agent chooses the action that it believes has the best long-term effect with probability \displaystyle 1-\epsilon . If no action which satisfies this condition is found, the agent chooses an action uniformly at random. Here, \displaystyle 0<\epsilon <1 is a tuning parameter, which is sometimes changed, either according to a fixed schedule (making the agent explore progressively less), or adaptively based on heuristics.[7]



無論『簡單符碼』能否解釋『森然宇宙』?前行者最好能讀讀 OpenAI Gym 之『白皮書』︰

A whitepaper for OpenAI Gym is available at, and here’s a BibTeX entry that you can use to cite it in a publication:

  Author = {Greg Brockman and Vicki Cheung and Ludwig Pettersson and Jonas Schneider and John Schulman and Jie Tang and Wojciech Zaremba},
  Title = {OpenAI Gym},
  Year = {2016},
  Eprint = {arXiv:1606.01540},

OpenAI Gym

OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as: arXiv:1606.01540 [cs.LG]
  (or arXiv:1606.01540v1 [cs.LG] for this version)

Submission history

From: John Schulman [view email]
[v1] Sun, 5 Jun 2016 17:54:48 UTC (546 KB)




Open source interface to reinforcement learning tasks.

The gym library provides an easy-to-use suite of reinforcement learning tasks.

import gym
env = gym.make("Taxi-v2")
observation = env.reset()
for _ in range(1000):
  action = env.action_space.sample() # your agent here (this takes random actions)
  observation, reward, done, info = env.step(action)

We provide the environment; you provide the algorithm.

You can write your agent using your existing numerical computation library, such as TensorFlow or Theano.




sudo pip3 install gym