Hi everyone! Today I want to show how in 50 lines of Python, we can teach a machine to balance a pole! Weāll be using the standard OpenAI Gym as our testing environment, and be creating our agent with nothing but numpy. I'll also be going through a crash course on reinforcement learning, so don't worry if you don't have prior experience!

The cart pole problem is where we have to push the cart left and right to balance a pole on top of it. Itās similar to balancing a pencil vertically on our finger tip, except in 1 dimension (quite challenging!)

If this is your first time in machine learning or reinforcement learning, Iāll cover some basics here so youāll have grounding on the terms weāll be using here :). If this isnāt your first time, you can go on and hop down to developing our policy!

Reinforcement Learning

Reinforcement learning (RL) is the field of study delving in teaching agents (our algorithm/machine) to perform certain tasks/actions without explicitly telling it how to do so. Think of it as a baby, moving it's legs in a random fashion; by luck if the baby stands upright, we hand it a candy/reward. Similarly the agent's goal will be to maximise the total reward over its lifetime, and we will decide the rewards which align with the tasks we want to accomplish. For the standing up example, a reward of 1 when standing upright and 0 otherwise.

An example of an RL agent would be AlphaGo, where the agent has learned how to play the game of Go to maximize its reward (winning games). In this tutorial, weāll be creating an agent that can solve the problem of balancing a pole on a cart, by pushing the cart left or right.

State

A state is what the game looks like at the moment. We typically deal with numerical representation of games. In the game of pong, it might be the vertical position of each paddle and the x, y coordinate of the ball. In the case of cart pole, our state is composed of 4 numbers: the position of the cart, the speed of the cart, the position of the pole (as an angle) and the angular velocity of the pole. These 4 numbers are given to us as an array (or vector). This is important; understanding the state is an array of numbers means we can do some mathematical operations on it to decide what action we want to take according to the state.

Policy

A policy is a function that can take the state of the game (ex. position of board pieces, or where the cart and pole are) and output the action the agent should take in the position (ex. move the knight, or push the cart to the left). After the agent takes the action we chose, the game will update with the next state, which weāll feed into the policy again to make a decision. This continues on until the game ends in some way. The policy is very important and is what weāre looking for, as it is the decision making ability behind an agent.

Dot Products

A dot product between two arrays (vectors) is simply multiplying each element of the first array by the corresponding element of the second array, and summing all of it together. Say we wanted to find the dot product of array A and B, itāll simply be A[0]B[0] + A[1]B[1]... Weāll be using this operation to multiply the state (which is an array) by another array (which will be our policy). Weāll see this in action in the next section.

Developing our Policy

To solve our game of cart pole, weāll want to let our machine learn a strategy or a policy to win the game (or maximize our rewards).

For the agent weāll develop today, weāll be representing our policy as an array of 4 numbers that represent how āimportantā each component of the state is (the cart position, pole position, etc.) and then weāll dot product the policy array with the state to output a single number. Depending on if the number is positive or negative, weāll push the cart left or right.

If this sounds a bit abstract, letās pick a concrete example and see what will happen.

Letās say the cart is centered in the game and stationary, and the pole is tilted to the right and is also falling towards the right. Itāll look something like this:

And the associated state might look like this:

The state array would then be [0, 0, 0.2, 0.05].

Now intuitively, weāll want to straighten the pole back up by pushing the cart to the right. Iāve taken a good policy from one of my training runs and its policy array reads: [-0.116, 0.332, 0.207 0.352]. Letās do the math real quick by hand and see what this policy will output as an action for this state.

Here weāll dot product the state array [0, 0, 0.2, 0.05] and the policy array (pasted above). If the number is positive, we push the cart to the right, if the number is negative, we push left.

The result is positive, which means the policy also wouldāve pushed the cart to the right in this situation, exactly how weād want it to behave.

Now this is all fine and dandy, and clearly all we need are 4 magic numbers like the one above to help solve this problem. Now, how do we get those numbers? What if we just totally picked them at random? How well would it work? Letās find out and start digging into the code!

Start Your Editor!

Letās pop open a Python instance on repl.it. Repl.it allows you to quickly bring up cloud instances of a ton of different programming environments, and edit code within a powerful cloud IDE that is accessible anywhere as you might know already!

Install the Packages

Weāll start off by installing the two packages we need for this project: numpy to help with numerical calculations, and OpenAI Gym to serve as our simulator for our agent.

Simply type gym and numpy into the package search tool on the left hand side of the editor and click the plus button to install the packages.

Laying Down the Foundations

Letās first import the two dependencies we just installed into our main.py script and set up a new gym environment:

import gym
import numpy as np
env = gym.make('CartPole-v1')

Next weāll define a function called āplayā, that will be given an environment and a policy array, and will play the policy array in the environment and return the score, and a snapshot (observation) of the game at each timestep. Weāll use the score to tell us how well the policy played and the snapshots for us to watch how the policy did in a single game. This way we can test different policies and see how well they do in the game!

Letās start off with the function definition, and resetting the game to a starting state.

def play(env, policy):
observation = env.reset()

Next weāll initialize some variables to keep track if the game is over yet, the total score of the policy, and the snapshots (observations) of each step during the game.

done = False
score = 0
observations = []

Now weāll simply just play the game for a lot of time steps, until the gym tells us the game is done.

for _ in range(5000):
observations += [observation.tolist()] # Record the observations for normalization and replay
if done: # If the simulation was over last iteration, exit loop
break
# Pick an action according to the policy matrix
outcome = np.dot(policy, observation)
action = 1 if outcome > 0 else 0
# Make the action, record reward
observation, reward, done, info = env.step(action)
score += reward
return score, observations

The bulk of the code above is mainly just in playing the game and recording the outcome. The actual code that is our policy is simply these two lines:

All weāre doing is doing the dot product operation between the policy array and the state (observation) array like weāve shown in the concrete example earlier. Then we either choose an action of 1 or 0 (left or right) depending if the outcome is positive or negative.

Now weāll want to start playing some games and find our optimal policy!

Playing the First Game

Now that we have a function to play the game and tell how good our policy is, weāll want to start generating some policies and see how well they do.

What if we just tried to plug in some random policies at first? How far can we go? Letās use numpy to generate our policy, which is a 4 element array or a 4x1 matrix. Itāll pick 4 numbers between 0 and 1 to use as our policy.

policy = np.random.rand(1,4)

With that policy in place, and the environment we created above, we can plug them into play and get a score.

Simply hit run to run our script. It should output the score our policy got.

The max score for this game is 500, chances are is that your policy didnāt fare so well. If yours did, congrats! It must be your lucky day! Just seeing a number though isnāt very rewarding, itād be great if we could visualize how our agent plays the game, and in the next step weāll be setting that up!

Watching our Agent

To watch our agent, weāll use flask to set up a lightweight server so we can see our agentās performance in our browser. Flask is a light Python HTTP server framework that can serve our HTML UI and data. Iāll keep this part brief as the details behind rendering and HTTP servers isnāt critical to training our agent.

Weāll first want to install āflaskā as a Python package, just like how we installed gym and numpy in the previous sections.

Next, at the bottom of our script, weāll create a flask server. Itāll expose the recording of each frame of the game on the /data endpoint and host the UI on /.

Additionally weāll need to add two files. One will be a blank Python file to the project. This is a technicality of how repl.it detects if the repl is either in eval mode or project mode. Simply use the new file button to add a blank Python script.

After that we also want to create an index.html that will host the rendering UI. I wonāt dive into details here, but simply upload this index.html to your repl.it project.

You now should have a project directory that looks like this:

Now with these two files, when we run the repl, it should now also play back how our policy did. With this in place, letās try to find an optimal policy!

Policy Search

In our first pass, we simply randomly picked one policy, but what if we picked a handful of policies, and only kept the one that did the best?

Letās go back to the part where we play the policy, and instead of just generating one, letās write a loop to generate a few and keep track of how well each policy did, and save only the best policy.

Weāll first create a tuple called max that will store the score, observations, and policy array of the best policy weāve seen so far.

max = (0, [], [])

Next weāll generate and evaluate 10 policies, and save the best policy in max.

for _ in range(10):
policy = np.random.rand(1,4)
score, observations = play(env, policy)
if score > max[0]:
max = (score, observations, policy)
print('Max Score', max[0])

Weāll also have to tell our /data endpoint to return the replay of the best policy.

If we run the repl now, we should get a max score of 500, if not, try running the repl one more time! We can also watch the policy balance the pole perfectly fine! Wow that was easy!

Not So Fast

Or maybe it isnāt. We cheated a bit in the first part in a couple of ways. First of all we only randomly created policy arrays between the range of 0 to 1. This just happens to work, but if we flipped the greater than operator around, weāll see that the agent will fail pretty catastrophically. To try it yourself change action = 1 if outcome > 0 else 0 to action = 1 if outcome < 0 else 0.

This doesnāt seem very robust, in that if we just happened to pick less than instead of greater than, we could never find a policy that could solve the game. To alleviate this, we actually should generate policies with negative numbers as well. This will make it more difficult to find a good policy (as a lot of the negative ones arenāt good), but weāre no longer ācheatingā by fitting our specific algorithm to this specific game. If we tried to do this on other environments in the OpenAI gym, our algorithm would definitely fail.

To do this instead of having policy = np.random.rand(1,4), weāll change to policy = np.random.rand(1,4) - 0.5. This way each number in our policy will be between -0.5 and 0.5 instead of 0 to 1. But because this is more difficult, weād also want to search through more policies. In the for loop above, instead of iterating through 10 policies, letās try 100 policies by changing the code to read for _ in range(100):. I also encourage you to try to just iterate through 10 policies first, to see how hard it is to get good policies now with negative numbers.

If you run the repl now, no matter if weāre using greater than or less than, we can still find a good policy for the game.

Not So Fast Pt. 2

But wait, thereās more! Even though our policy might be able to achieve the max score of 500 on a single run, can it do it every time? When weāve generated 100 policies, and pick the policy that did best on its single run, the policy mightāve just gotten very lucky, and in it could be a very bad policy that just happened to have a very good run. This is because the game itself has an element of randomness to it (the starting position is different every time), so a policy could be good at just one starting position, but not others.

So to fix this, weād want to evaluate how well a policy did on multiple trials. For now, letās take the best policy we found from before, and see how well itāll do on 100 trials.

scores = []
for _ in range(100):
score, _ = play(env, max[2])
scores += [score]
print('Average Score (100 trials)', np.mean(scores))

Here weāre playing the best policy (index 2 of max) 100 times, and recording the score each time. We then use numpy to calculate the average score and print it to our terminal. Thereās no hard published definition of āsolvedā, but it should be only a few points shy of 500. You might notice that the best policy might actually be subpar sometimes. However, Iāll leave the fix up to you to decide!

done=True

Congrats! š Weāve successfully created an AI that can solve cart pole very effectively, and rather efficiently. Now thereās a lot of room for improvement to be made thatāll be part of an article in a later series. Some things we can investigate more on:

Finding a ārealā optimal policy (will do well in 100 separate plays)

Optimizing the number of times we have to search to find an optimal policy (āsample efficiencyā)

Doing a proper search of the policy instead of trying to just randomly pick them.

Hi everyone! Today I want to show how in 50 lines of Python, we can teach a machine to balance a pole! Weāll be using the standard OpenAI Gym as our testing environment, and be creating our agent with nothing but numpy. I'll also be going through a crash course on reinforcement learning, so don't worry if you don't have prior experience!

The cart pole problem is where we have to push the cart left and right to balance a pole on top of it. Itās similar to balancing a pencil vertically on our finger tip, except in 1 dimension (quite challenging!)

You can check out the final repl here: https://repl.it/@MikeShi42/CartPole.

## RL Crash Course

If this is your first time in machine learning or reinforcement learning, Iāll cover some basics here so youāll have grounding on the terms weāll be using here :). If this isnāt your first time, you can go on and hop down to developing our policy!

Reinforcement LearningReinforcement learning (RL) is the field of study delving in teaching agents (our algorithm/machine) to perform certain tasks/actions without explicitly telling it how to do so. Think of it as a baby, moving it's legs in a random fashion; by luck if the baby stands upright, we hand it a candy/reward. Similarly the agent's goal will be to maximise the total reward over its lifetime, and we will decide the rewards which align with the tasks we want to accomplish. For the standing up example, a reward of 1 when standing upright and 0 otherwise.

An example of an RL agent would be AlphaGo, where the agent has learned how to play the game of Go to maximize its reward (winning games). In this tutorial, weāll be creating an agent that can solve the problem of balancing a pole on a cart, by pushing the cart left or right.

StateA state is what the game looks like at the moment. We typically deal with numerical representation of games. In the game of pong, it might be the vertical position of each paddle and the x, y coordinate of the ball. In the case of cart pole, our state is composed of 4 numbers: the position of the cart, the speed of the cart, the position of the pole (as an angle) and the angular velocity of the pole. These 4 numbers are given to us as an array (or vector). This is important; understanding the state is an array of numbers means we can do some mathematical operations on it to decide what action we want to take according to the state.

PolicyA policy is a function that can take the state of the game (ex. position of board pieces, or where the cart and pole are) and output the action the agent should take in the position (ex. move the knight, or push the cart to the left). After the agent takes the action we chose, the game will update with the next state, which weāll feed into the policy again to make a decision. This continues on until the game ends in some way. The policy is very important and is what weāre looking for, as it is the decision making ability behind an agent.

Dot ProductsA dot product between two arrays (vectors) is simply multiplying each element of the first array by the corresponding element of the second array, and summing all of it together. Say we wanted to find the dot product of array A and B, itāll simply be A[0]

B[0] + A[1]B[1]... Weāll be using this operation to multiply the state (which is an array) by another array (which will be our policy). Weāll see this in action in the next section.## Developing our Policy

To solve our game of cart pole, weāll want to let our machine learn a strategy or a policy to win the game (or maximize our rewards).

For the agent weāll develop today, weāll be representing our policy as an array of 4 numbers that represent how āimportantā each component of the state is (the cart position, pole position, etc.) and then weāll dot product the policy array with the state to output a single number. Depending on if the number is positive or negative, weāll push the cart left or right.

If this sounds a bit abstract, letās pick a concrete example and see what will happen.

Letās say the cart is centered in the game and stationary, and the pole is tilted to the right and is also falling towards the right. Itāll look something like this:

And the associated state might look like this:

The state array would then be [0, 0, 0.2, 0.05].

Now intuitively, weāll want to straighten the pole back up by pushing the cart to the right. Iāve taken a good policy from one of my training runs and its policy array reads: [-0.116, 0.332, 0.207 0.352]. Letās do the math real quick by hand and see what this policy will output as an action for this state.

Here weāll dot product the state array [0, 0, 0.2, 0.05] and the policy array (pasted above). If the number is positive, we push the cart to the right, if the number is negative, we push left.

The result is positive, which means the policy also wouldāve pushed the cart to the right in this situation, exactly how weād want it to behave.

Now this is all fine and dandy, and clearly all we need are 4 magic numbers like the one above to help solve this problem. Now, how do we get those numbers? What if we just totally picked them at random? How well would it work? Letās find out and start digging into the code!

## Start Your Editor!

Letās pop open a Python instance on repl.it. Repl.it allows you to quickly bring up cloud instances of a ton of different programming environments, and edit code within a powerful cloud IDE that is accessible anywhere as you might know already!

## Install the Packages

Weāll start off by installing the two packages we need for this project: numpy to help with numerical calculations, and OpenAI Gym to serve as our simulator for our agent.

Simply type

`gym`

and`numpy`

into the package search tool on the left hand side of the editor and click the plus button to install the packages.## Laying Down the Foundations

Letās first import the two dependencies we just installed into our main.py script and set up a new gym environment:

Next weāll define a function called āplayā, that will be given an environment and a policy array, and will play the policy array in the environment and return the score, and a snapshot (observation) of the game at each timestep. Weāll use the score to tell us how well the policy played and the snapshots for us to watch how the policy did in a single game. This way we can test different policies and see how well they do in the game!

Letās start off with the function definition, and resetting the game to a starting state.

Next weāll initialize some variables to keep track if the game is over yet, the total score of the policy, and the snapshots (observations) of each step during the game.

Now weāll simply just play the game for a lot of time steps, until the gym tells us the game is done.

The bulk of the code above is mainly just in playing the game and recording the outcome. The actual code that is our policy is simply these two lines:

All weāre doing is doing the dot product operation between the policy array and the state (observation) array like weāve shown in the concrete example earlier. Then we either choose an action of 1 or 0 (left or right) depending if the outcome is positive or negative.

So far our main.py should look like this:

Github Gist

Now weāll want to start playing some games and find our optimal policy!

## Playing the First Game

Now that we have a function to play the game and tell how good our policy is, weāll want to start generating some policies and see how well they do.

What if we just tried to plug in some random policies at first? How far can we go? Letās use numpy to generate our policy, which is a 4 element array or a 4x1 matrix. Itāll pick 4 numbers between 0 and 1 to use as our policy.

With that policy in place, and the environment we created above, we can plug them into play and get a score.

Simply hit run to run our script. It should output the score our policy got.

The max score for this game is 500, chances are is that your policy didnāt fare so well. If yours did, congrats! It must be your lucky day! Just seeing a number though isnāt very rewarding, itād be great if we could visualize how our agent plays the game, and in the next step weāll be setting that up!

## Watching our Agent

To watch our agent, weāll use flask to set up a lightweight server so we can see our agentās performance in our browser. Flask is a light Python HTTP server framework that can serve our HTML UI and data. Iāll keep this part brief as the details behind rendering and HTTP servers isnāt critical to training our agent.

Weāll first want to install āflaskā as a Python package, just like how we installed gym and numpy in the previous sections.

Next, at the bottom of our script, weāll create a flask server. Itāll expose the recording of each frame of the game on the

`/data`

endpoint and host the UI on`/`

.Additionally weāll need to add two files. One will be a blank Python file to the project. This is a technicality of how repl.it detects if the repl is either in eval mode or project mode. Simply use the new file button to add a blank Python script.

After that we also want to create an index.html that will host the rendering UI. I wonāt dive into details here, but simply

upload this index.html to your repl.it project.You now should have a project directory that looks like this:

Now with these two files, when we run the repl, it should now also play back how our policy did. With this in place, letās try to find an optimal policy!

## Policy Search

In our first pass, we simply randomly picked one policy, but what if we picked a handful of policies, and only kept the one that did the best?

Letās go back to the part where we play the policy, and instead of just generating one, letās write a loop to generate a few and keep track of how well each policy did, and save only the best policy.

Weāll first create a tuple called

`max`

that will store the score, observations, and policy array of the best policy weāve seen so far.Next weāll generate and evaluate 10 policies, and save the best policy in max.

Weāll also have to tell our /data endpoint to return the replay of the best policy.

should be changed to

Your main.py should look something like this now.

If we run the repl now, we should get a max score of 500, if not, try running the repl one more time! We can also watch the policy balance the pole perfectly fine! Wow that was easy!

## Not So Fast

Or maybe it isnāt. We cheated a bit in the first part in a couple of ways. First of all we only randomly created policy arrays between the range of 0 to 1. This just

happensto work, but if we flipped the greater than operator around, weāll see that the agent will fail pretty catastrophically. To try it yourself change`action = 1 if outcome > 0 else 0`

to`action = 1 if outcome < 0 else 0`

.This doesnāt seem very robust, in that if we just happened to pick less than instead of greater than, we could never find a policy that could solve the game. To alleviate this, we actually should generate policies with negative numbers as well. This will make it more difficult to find a good policy (as a lot of the negative ones arenāt good), but weāre no longer ācheatingā by fitting our specific algorithm to this specific game. If we tried to do this on other environments in the OpenAI gym, our algorithm would definitely fail.

To do this instead of having

`policy = np.random.rand(1,4)`

, weāll change to`policy = np.random.rand(1,4) - 0.5`

. This way each number in our policy will be between -0.5 and 0.5 instead of 0 to 1. But because this is more difficult, weād also want to search through more policies. In the for loop above, instead of iterating through 10 policies, letās try 100 policies by changing the code to read`for _ in range(100):`

. I also encourage you to try to just iterate through 10 policies first, to see how hard it is to get good policies now with negative numbers.Now our main.py should look like this

If you run the repl now, no matter if weāre using greater than or less than, we can still find a good policy for the game.

## Not So Fast Pt. 2

But wait, thereās more! Even though our policy might be able to achieve the max score of 500 on a single run, can it do it every time? When weāve generated 100 policies, and pick the policy that did best on its single run, the policy mightāve just gotten very lucky, and in it could be a very bad policy that just happened to have a very good run. This is because the game itself has an element of randomness to it (the starting position is different every time), so a policy could be good at just one starting position, but not others.

So to fix this, weād want to evaluate how well a policy did on multiple trials. For now, letās take the best policy we found from before, and see how well itāll do on 100 trials.

Here weāre playing the best policy (index 2 of

`max`

) 100 times, and recording the score each time. We then use numpy to calculate the average score and print it to our terminal. Thereās no hard published definition of āsolvedā, but it should be only a few points shy of 500. You might notice that the best policy might actually be subpar sometimes. However, Iāll leave the fix up to you to decide!## done=True

Congrats! š Weāve successfully created an AI that can solve cart pole very effectively, and rather efficiently. Now thereās a lot of room for improvement to be made thatāll be part of an article in a later series. Some things we can investigate more on:

If youāre interested in experimenting more with ML with pretrained models and out-of-the-box working code, check out ModelDepot!

@IEATPYTHON Thanks so much for the kind words š