Skip to content

Q-learning Agent in Python

I am creating a q-learning agent to solve a cartpole problem in this tutorial. Q-learning is part of active reinforcement learning, it does not need a map of the environment and it learns an action-utility representation from temporal differences (TD).

Q-learning is an off-policy algorithm as it uses the best Q-value (quality of action) and does not need to pay attention to a policy. Q-learning is more flexible than SARSA because it can learn how to behave well regardless of which policy it is guided by.

Q-learning agents can be used in partially observable environments, the algorithm can find an optimal policy for any finite markov decision process (FMDP) if it has infinite exploration time. The algorithm needs to explore the environment in order to maximize the total reward.

Markov Decision Process (MDP)

A decreasing exploration rate (epsilon) makes the agent more likely to explore in the beginning and more likely to follow the policy in the end. The agent updates its policy/model by applying a learning rate (alpha) and a discount factor (gamma). The learning rate determines how likely new knowledge is to replace old knowledge, a learning rate of 0 means no learning and a learning rate of 1 indicates that new knowledge is most imporant. The discount factor determines how important future rewards is, a discount factor of 0 means that current rewards is most important while a discount factor of 1 means that a long-time reward is most important.

Problem and Libraries

A pole is attached to a cart that moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over, the agent can apply a force from left or right and reward of 1 is provided for every timestep that the pole remains upright. An episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center or after 200 timesteps. The cartpole problem is part of the gym library, i am also using the following libraries: os, math, random, pickle and numpy.

Code

The cartpole-v0 (version 0) problem is considered solved if the average reward is 195 or more over 100 consecutive trials. The agent can be trained and evaluated, the model is saved to disk after each training session. The full code for the q-learning agent is shown below.

# Import libraries
import os
import math
import random
import pickle
import gym
import numpy as np

# Discretize observation (countinuous to discrete)
def discretize(env, buckets, state):
    upper_bounds = [env.observation_space.high[0], 0.5, env.observation_space.high[2], math.radians(50)]
    lower_bounds = [env.observation_space.low[0], -0.5, env.observation_space.low[2], -math.radians(50)]
    ratios = [(state[i] + abs(lower_bounds[i])) / (upper_bounds[i] - lower_bounds[i]) for i in range(len(state))]
    next_state = [int(round((buckets[i] - 1) * ratios[i])) for i in range(len(state))]
    next_state = [min(buckets[i] - 1, max(0, next_state[i])) for i in range(len(state))]
    return tuple(next_state)

# Get an action (0:Left, 1:Right)
def get_action(model, state, epsilon):
    return random.randint(0, 1) if (random.random() <= epsilon) else np.argmax(model[state])

# Update model
def update(model, current_state, next_state, action, reward, alpha=0.85, gamma=0.95):
    model[current_state][action] += alpha * (reward + gamma * np.max(model[next_state]) - model[current_state][action])

# Exploration rate
def get_epsilon(t, min_epsilon, divisor=25):
    return max(min_epsilon, min(1, 1.0 - math.log10((t + 1) / divisor)))

# Learning rate
def get_alpha(t, min_alpha, divisor=25):
    return max(min_alpha, min(1.0, 1.0 - math.log10((t + 1) / divisor)))

# Get a model
def get_model(env) -> ():

     # Load a model if we have saved one
    if(os.path.isfile('models\\cartpole.q') == True):
        with open ('models\\cartpole.q', 'rb') as fp:
            return pickle.load(fp)

    # Set buckets
    buckets = (1, 1, 6, 12)

    # Return an empty model (Q-table) and buckets
    return (np.zeros(buckets + (env.action_space.n, )), buckets)

# Train a model
def train():

    # Variables
    episodes = 300
    timesteps = 200
    total_score = 0

    # Create an environment
    env = gym.make('CartPole-v0')

    # Get a model (Q table) and buckets
    model, buckets = get_model(env)

    # Loop episodes
    for episode in range(episodes):

        # Start episode and get initial observation (discretized in a tuple)
        current_state = discretize(env, buckets, env.reset())

        # Get learning rate and exploration rate
        alpha = get_alpha(episode, 0.1)
        epsilon = get_epsilon(episode, 0.1)

        # Reset score
        score = 0

        # Loop timesteps
        for t in range(timesteps):

            # Get an action (0:Left, 1:Right)
            action = get_action(model, current_state, epsilon)

            # Perform a step
            # next_state (position, velocity, angle and angular velocity)
            next_state, reward, done, info = env.step(action)

            # Discretize the state to buckets
            next_state = discretize(env, buckets, next_state)

            # Update the model
            update(model, current_state, next_state, action, reward, alpha, 1.0)
  
            # Update the state
            current_state = next_state

            # Update score
            score += reward
            total_score += reward

            # Check if we are done (game over)
            if done:
                print('Episode {0}, Score: {1}, Timesteps: {2}, Epsilon: {3}'.format(episode+1, score, t+1, epsilon))
                break
       
    # Close the environment
    env.close()

    # Save the model
    with open('models\\cartpole.q', 'wb') as fp:
        pickle.dump((model, buckets), fp)

    # Print final score
    print()
    print('--- Evaluation ---')
    print('Average score: {0}'.format(total_score / episodes))
    print('Episodes: {0}'.format(episodes))
    print()

    # Print model
    print('--- Model (Q-table) ---')
    print(model)
    print()

# Evaluate a model
def evaluate():

    # Variables
    episodes = 100
    timesteps = 200
    total_score = 0

    # Create an environment
    env = gym.make('CartPole-v0')

    # Get a model (Q table) and buckets
    model, buckets = get_model(env)

    # Loop episodes
    for episode in range(episodes):

        # Start episode and get initial observation (discretized in a tuple)
        state = discretize(env, buckets, env.reset())

        # Reset score
        score = 0

        # Loop timesteps
        for t in range(timesteps):

            # Render the environment
            env.render(mode='human')

            # Get an action (0:Left, 1:Right)
            action = np.argmax(model[state])

            # Perform a step
            # next_state (position, velocity, angle and angular velocity)
            state, reward, done, info = env.step(action)

            # Discretize the state to buckets
            state = discretize(env, buckets, state)

            # Update score
            score += reward
            total_score += reward

            # Check if we are done (game over)
            if done:
                print('Episode {0}, Score: {1}, Timesteps: {2}'.format(episode+1, score, t+1))
                break
       
    # Close the environment
    env.close()

    # Print final score
    print()
    print('--- Evaluation ---')
    print('Average score: {0}'.format(total_score / episodes))
    print('Episodes: {0}'.format(episodes))
    print()

# The main entry point for this module
def main():

    # Train the model
    train()

    # Evaluate the model
    #evaluate()

# Tell python to run main method
if __name__ == "__main__": main()

Training

Episode 86, Score: 28.0, Timesteps: 28, Epsilon: 0.46344155742846993
Episode 87, Score: 65.0, Timesteps: 65, Epsilon: 0.45842075605341903
Episode 88, Score: 13.0, Timesteps: 13, Epsilon: 0.4534573365218689
Episode 89, Score: 118.0, Timesteps: 118, Epsilon: 0.4485500020271248
Episode 90, Score: 200.0, Timesteps: 200, Epsilon: 0.44369749923271273
Episode 91, Score: 91.0, Timesteps: 91, Epsilon: 0.43889861635094396
Episode 92, Score: 35.0, Timesteps: 35, Epsilon: 0.43415218132648237
Episode 93, Score: 38.0, Timesteps: 38, Epsilon: 0.4294570601181025
Episode 94, Score: 29.0, Timesteps: 29, Epsilon: 0.42481215507233894
Episode 95, Score: 152.0, Timesteps: 152, Epsilon: 0.4202164033831899
Episode 96, Score: 200.0, Timesteps: 200, Epsilon: 0.4156687756324692
Episode 97, Score: 49.0, Timesteps: 49, Epsilon: 0.41116827440579273
Episode 98, Score: 129.0, Timesteps: 129, Epsilon: 0.4067139329795427
Episode 99, Score: 57.0, Timesteps: 57, Epsilon: 0.4023048140744877
Episode 100, Score: 140.0, Timesteps: 140, Epsilon: 0.3979400086720376
Episode 101, Score: 90.0, Timesteps: 90, Epsilon: 0.3936186348893951
Episode 102, Score: 54.0, Timesteps: 54, Epsilon: 0.3893398369101201
Episode 103, Score: 103.0, Timesteps: 103, Epsilon: 0.38510278396686537
Episode 104, Score: 51.0, Timesteps: 51, Epsilon: 0.38090666937325723
Episode 105, Score: 140.0, Timesteps: 140, Epsilon: 0.37675070960209955
Episode 106, Score: 35.0, Timesteps: 35, Epsilon: 0.3726341434072673
Episode 107, Score: 96.0, Timesteps: 96, Epsilon: 0.368556230986828
Episode 108, Score: 200.0, Timesteps: 200, Epsilon: 0.36451625318508785
Episode 109, Score: 80.0, Timesteps: 80, Epsilon: 0.3605135107314139
Episode 110, Score: 70.0, Timesteps: 70, Epsilon: 0.3565473235138126
Episode 111, Score: 39.0, Timesteps: 39, Epsilon: 0.35261702988538013
Episode 112, Score: 36.0, Timesteps: 36, Epsilon: 0.348721986001856
Episode 113, Score: 22.0, Timesteps: 22, Epsilon: 0.3448615651886179
Episode 114, Score: 51.0, Timesteps: 51, Epsilon: 0.341035157335565
Episode 115, Score: 27.0, Timesteps: 27, Epsilon: 0.3372421683184259
Episode 116, Score: 200.0, Timesteps: 200, Epsilon: 0.3334820194451191
Episode 117, Score: 84.0, Timesteps: 84, Epsilon: 0.329754146925876
Episode 118, Score: 128.0, Timesteps: 128, Epsilon: 0.32605800136591223
Episode 119, Score: 26.0, Timesteps: 26, Epsilon: 0.3223930472795069
Episode 120, Score: 48.0, Timesteps: 48, Epsilon: 0.31875876262441283
Episode 121, Score: 16.0, Timesteps: 16, Epsilon: 0.31515463835558755
Episode 122, Score: 53.0, Timesteps: 53, Epsilon: 0.31158017799728943
Episode 123, Score: 34.0, Timesteps: 34, Epsilon: 0.30803489723263966
Episode 124, Score: 74.0, Timesteps: 74, Epsilon: 0.30451832350980257
Episode 125, Score: 59.0, Timesteps: 59, Epsilon: 0.30102999566398114
Episode 126, Score: 57.0, Timesteps: 57, Epsilon: 0.29756946355447467
Episode 127, Score: 61.0, Timesteps: 61, Epsilon: 0.2941362877160807
Episode 128, Score: 45.0, Timesteps: 45, Epsilon: 0.2907300390241693
Episode 129, Score: 175.0, Timesteps: 175, Epsilon: 0.2873502983727886
Episode 130, Score: 140.0, Timesteps: 140, Epsilon: 0.2839966563652008
Episode 131, Score: 88.0, Timesteps: 88, Epsilon: 0.2806687130162734
Episode 132, Score: 80.0, Timesteps: 80, Epsilon: 0.2773660774661877
Episode 133, Score: 108.0, Timesteps: 108, Epsilon: 0.2740883677049518
Episode 134, Score: 82.0, Timesteps: 82, Epsilon: 0.2708352103072299
Episode 135, Score: 90.0, Timesteps: 90, Epsilon: 0.2676062401770315
Episode 136, Score: 45.0, Timesteps: 45, Epsilon: 0.26440110030182007
Episode 137, Score: 172.0, Timesteps: 172, Epsilon: 0.2612194415156308
Episode 138, Score: 42.0, Timesteps: 42, Epsilon: 0.25806092227080113
Episode 139, Score: 75.0, Timesteps: 75, Epsilon: 0.25492520841794253
Episode 140, Score: 200.0, Timesteps: 200, Epsilon: 0.25181197299379965
Episode 141, Score: 113.0, Timesteps: 113, Epsilon: 0.2487208960166577
Episode 142, Score: 74.0, Timesteps: 74, Epsilon: 0.2456516642889811
Episode 143, Score: 31.0, Timesteps: 31, Epsilon: 0.24260397120697585
Episode 144, Score: 200.0, Timesteps: 200, Epsilon: 0.23957751657678794
Episode 145, Score: 127.0, Timesteps: 127, Epsilon: 0.23657200643706278
Episode 146, Score: 103.0, Timesteps: 103, Epsilon: 0.23358715288760057
Episode 147, Score: 172.0, Timesteps: 172, Epsilon: 0.2306226739238615
Episode 148, Score: 134.0, Timesteps: 134, Epsilon: 0.22767829327708022
Episode 149, Score: 40.0, Timesteps: 40, Epsilon: 0.22475374025976358
Episode 150, Score: 107.0, Timesteps: 107, Epsilon: 0.22184874961635637
Episode 151, Score: 50.0, Timesteps: 50, Epsilon: 0.21896306137886812
Episode 152, Score: 200.0, Timesteps: 200, Epsilon: 0.2160964207272651
Episode 153, Score: 200.0, Timesteps: 200, Epsilon: 0.21324857785443885
Episode 154, Score: 200.0, Timesteps: 200, Epsilon: 0.2104192878355745
Episode 155, Score: 200.0, Timesteps: 200, Epsilon: 0.2076083105017461
Episode 156, Score: 200.0, Timesteps: 200, Epsilon: 0.204815410317576
Episode 157, Score: 200.0, Timesteps: 200, Epsilon: 0.2020403562628038
Episode 158, Score: 200.0, Timesteps: 200, Epsilon: 0.19928292171761497
Episode 159, Score: 200.0, Timesteps: 200, Epsilon: 0.19654288435158607
Episode 160, Score: 200.0, Timesteps: 200, Epsilon: 0.1938200260161128
Episode 161, Score: 200.0, Timesteps: 200, Epsilon: 0.19111413264018784
Episode 162, Score: 200.0, Timesteps: 200, Epsilon: 0.1884249941294066
Episode 163, Score: 200.0, Timesteps: 200, Epsilon: 0.18575240426807982
Episode 164, Score: 200.0, Timesteps: 200, Epsilon: 0.18309616062433975
Episode 165, Score: 200.0, Timesteps: 200, Epsilon: 0.18045606445813134
Episode 166, Score: 200.0, Timesteps: 200, Epsilon: 0.1778319206319825
Episode 167, Score: 200.0, Timesteps: 200, Epsilon: 0.1752235375244543
Episode 168, Score: 200.0, Timesteps: 200, Epsilon: 0.17263072694617476
Episode 169, Score: 200.0, Timesteps: 200, Epsilon: 0.17005330405836405
Episode 170, Score: 200.0, Timesteps: 200, Epsilon: 0.16749108729376372
Episode 171, Score: 200.0, Timesteps: 200, Epsilon: 0.16494389827988376
Episode 172, Score: 200.0, Timesteps: 200, Epsilon: 0.16241156176448868
Episode 173, Score: 200.0, Timesteps: 200, Epsilon: 0.15989390554324223
Episode 174, Score: 200.0, Timesteps: 200, Epsilon: 0.1573907603894379
Episode 175, Score: 200.0, Timesteps: 200, Epsilon: 0.1549019599857432
Episode 176, Score: 200.0, Timesteps: 200, Epsilon: 0.15242734085788778
Episode 177, Score: 200.0, Timesteps: 200, Epsilon: 0.149966742310231
Episode 178, Score: 200.0, Timesteps: 200, Epsilon: 0.14752000636314366
Episode 179, Score: 200.0, Timesteps: 200, Epsilon: 0.1450869776921444
Episode 180, Score: 200.0, Timesteps: 200, Epsilon: 0.14266750356873148
Episode 181, Score: 200.0, Timesteps: 200, Epsilon: 0.14026143380285305
Episode 182, Score: 200.0, Timesteps: 200, Epsilon: 0.13786862068696282
Episode 183, Score: 200.0, Timesteps: 200, Epsilon: 0.13548891894160808
Episode 184, Score: 200.0, Timesteps: 200, Epsilon: 0.1331221856625011
Episode 185, Score: 200.0, Timesteps: 200, Epsilon: 0.13076828026902376
Episode 186, Score: 200.0, Timesteps: 200, Epsilon: 0.12842706445412122
Episode 187, Score: 200.0, Timesteps: 200, Epsilon: 0.12609840213553858
Episode 188, Score: 200.0, Timesteps: 200, Epsilon: 0.1237821594083578
Episode 189, Score: 200.0, Timesteps: 200, Epsilon: 0.12147820449879354
Episode 190, Score: 200.0, Timesteps: 200, Epsilon: 0.11918640771920863
Episode 191, Score: 200.0, Timesteps: 200, Epsilon: 0.11690664142431006
Episode 192, Score: 200.0, Timesteps: 200, Epsilon: 0.11463877996848804
Episode 193, Score: 200.0, Timesteps: 200, Epsilon: 0.11238269966426384
Episode 194, Score: 200.0, Timesteps: 200, Epsilon: 0.11013827874181159
Episode 195, Score: 200.0, Timesteps: 200, Epsilon: 0.10790539730951965
Episode 196, Score: 200.0, Timesteps: 200, Epsilon: 0.10568393731556158
Episode 197, Score: 200.0, Timesteps: 200, Epsilon: 0.1034737825104447
Episode 198, Score: 200.0, Timesteps: 200, Epsilon: 0.10127481841050645
Episode 199, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 200, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 201, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 202, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 203, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 204, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 205, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 206, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 207, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 208, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 209, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 210, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 211, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 212, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 213, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 214, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 215, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 216, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 217, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 218, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 219, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 220, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 221, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 222, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 223, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 224, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 225, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 226, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 227, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 228, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 229, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 230, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 231, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 232, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 233, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 234, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 235, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 236, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 237, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 238, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 239, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 240, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 241, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 242, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 243, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 244, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 245, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 246, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 247, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 248, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 249, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 250, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 251, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 252, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 253, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 254, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 255, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 256, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 257, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 258, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 259, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 260, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 261, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 262, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 263, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 264, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 265, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 266, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 267, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 268, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 269, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 270, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 271, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 272, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 273, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 274, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 275, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 276, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 277, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 278, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 279, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 280, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 281, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 282, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 283, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 284, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 285, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 286, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 287, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 288, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 289, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 290, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 291, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 292, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 293, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 294, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 295, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 296, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 297, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 298, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 299, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 300, Score: 200.0, Timesteps: 200, Epsilon: 0.1

--- Evaluation ---
Average score: 125.72666666666667
Episodes: 300

--- Model (Q-table) ---
[[[[[  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]]

   [[ 26.63994576  28.49148154]
    [ 23.20222149  26.46118573]
    [ 21.13854075  26.53317167]
    [ 17.30265287  26.27408546]
    [ 16.72772951  23.51441257]
    [  0.          23.83321618]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]]

   [[328.05320439  95.25088858]
    [553.20592401 162.97711459]
    [537.45198849 255.21496651]
    [577.1094048  502.60218571]
    [566.48400495 383.81600409]
    [578.25566476 563.78102189]
    [577.65832768 554.68481   ]
    [528.49528721 578.19611935]
    [508.65127267 578.01982338]
    [241.71554661 530.27294801]
    [165.98990398 507.88389605]
    [ 79.41272507 169.88054431]]

   [[230.61126595  71.25934918]
    [519.82742333 129.4517015 ]
    [531.26507526 192.56401542]
    [578.50717041 516.87577006]
    [577.38442447 532.41981888]
    [558.66474821 578.55925508]
    [563.71522864 578.50416538]
    [363.28307807 565.98936464]
    [496.11067625 576.50651204]
    [264.02898773 527.26034577]
    [238.63787799 554.40185893]
    [202.70950775 371.15281394]]

   [[  0.           0.        ]
    [  0.           0.        ]
    [  0.          38.52296094]
    [  0.           0.        ]
    [ 22.9402602   63.74988989]
    [ 52.03343446  23.82415025]
    [ 45.87911801  23.97587393]
    [ 23.58355396  49.89961519]
    [ 47.69941954  40.08997247]
    [ 46.96344845  33.3477977 ]
    [ 49.27767993  40.11629936]
    [ 50.15477103  46.12943125]]

   [[  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]]]]]

Evaluation

Episode 1, Score: 200.0, Timesteps: 200
Episode 2, Score: 200.0, Timesteps: 200
Episode 3, Score: 200.0, Timesteps: 200
Episode 4, Score: 200.0, Timesteps: 200
Episode 5, Score: 200.0, Timesteps: 200
Episode 6, Score: 200.0, Timesteps: 200
Episode 7, Score: 200.0, Timesteps: 200
Episode 8, Score: 200.0, Timesteps: 200
Episode 9, Score: 200.0, Timesteps: 200
Episode 10, Score: 200.0, Timesteps: 200
Episode 11, Score: 200.0, Timesteps: 200
Episode 12, Score: 200.0, Timesteps: 200
Episode 13, Score: 200.0, Timesteps: 200
Episode 14, Score: 200.0, Timesteps: 200
Episode 15, Score: 200.0, Timesteps: 200
Episode 16, Score: 200.0, Timesteps: 200
Episode 17, Score: 200.0, Timesteps: 200
Episode 18, Score: 200.0, Timesteps: 200
Episode 19, Score: 200.0, Timesteps: 200
Episode 20, Score: 200.0, Timesteps: 200
Episode 21, Score: 200.0, Timesteps: 200
Episode 22, Score: 200.0, Timesteps: 200
Episode 23, Score: 200.0, Timesteps: 200
Episode 24, Score: 200.0, Timesteps: 200
Episode 25, Score: 200.0, Timesteps: 200
Episode 26, Score: 200.0, Timesteps: 200
Episode 27, Score: 200.0, Timesteps: 200
Episode 28, Score: 200.0, Timesteps: 200
Episode 29, Score: 200.0, Timesteps: 200
Episode 30, Score: 200.0, Timesteps: 200
Episode 31, Score: 200.0, Timesteps: 200
Episode 32, Score: 200.0, Timesteps: 200
Episode 33, Score: 200.0, Timesteps: 200
Episode 34, Score: 200.0, Timesteps: 200
Episode 35, Score: 200.0, Timesteps: 200
Episode 36, Score: 200.0, Timesteps: 200
Episode 37, Score: 200.0, Timesteps: 200
Episode 38, Score: 200.0, Timesteps: 200
Episode 39, Score: 200.0, Timesteps: 200
Episode 40, Score: 200.0, Timesteps: 200
Episode 41, Score: 200.0, Timesteps: 200
Episode 42, Score: 200.0, Timesteps: 200
Episode 43, Score: 200.0, Timesteps: 200
Episode 44, Score: 200.0, Timesteps: 200
Episode 45, Score: 200.0, Timesteps: 200
Episode 46, Score: 200.0, Timesteps: 200
Episode 47, Score: 200.0, Timesteps: 200
Episode 48, Score: 200.0, Timesteps: 200
Episode 49, Score: 200.0, Timesteps: 200
Episode 50, Score: 200.0, Timesteps: 200
Episode 51, Score: 200.0, Timesteps: 200
Episode 52, Score: 200.0, Timesteps: 200
Episode 53, Score: 200.0, Timesteps: 200
Episode 54, Score: 200.0, Timesteps: 200
Episode 55, Score: 200.0, Timesteps: 200
Episode 56, Score: 200.0, Timesteps: 200
Episode 57, Score: 200.0, Timesteps: 200
Episode 58, Score: 200.0, Timesteps: 200
Episode 59, Score: 200.0, Timesteps: 200
Episode 60, Score: 200.0, Timesteps: 200
Episode 61, Score: 200.0, Timesteps: 200
Episode 62, Score: 200.0, Timesteps: 200
Episode 63, Score: 200.0, Timesteps: 200
Episode 64, Score: 200.0, Timesteps: 200
Episode 65, Score: 200.0, Timesteps: 200
Episode 66, Score: 200.0, Timesteps: 200
Episode 67, Score: 200.0, Timesteps: 200
Episode 68, Score: 200.0, Timesteps: 200
Episode 69, Score: 200.0, Timesteps: 200
Episode 70, Score: 200.0, Timesteps: 200
Episode 71, Score: 200.0, Timesteps: 200
Episode 72, Score: 200.0, Timesteps: 200
Episode 73, Score: 200.0, Timesteps: 200
Episode 74, Score: 200.0, Timesteps: 200
Episode 75, Score: 200.0, Timesteps: 200
Episode 76, Score: 200.0, Timesteps: 200
Episode 77, Score: 200.0, Timesteps: 200
Episode 78, Score: 200.0, Timesteps: 200
Episode 79, Score: 200.0, Timesteps: 200
Episode 80, Score: 200.0, Timesteps: 200
Episode 81, Score: 200.0, Timesteps: 200
Episode 82, Score: 200.0, Timesteps: 200
Episode 83, Score: 200.0, Timesteps: 200
Episode 84, Score: 200.0, Timesteps: 200
Episode 85, Score: 200.0, Timesteps: 200
Episode 86, Score: 200.0, Timesteps: 200
Episode 87, Score: 200.0, Timesteps: 200
Episode 88, Score: 200.0, Timesteps: 200
Episode 89, Score: 200.0, Timesteps: 200
Episode 90, Score: 200.0, Timesteps: 200
Episode 91, Score: 200.0, Timesteps: 200
Episode 92, Score: 200.0, Timesteps: 200
Episode 93, Score: 200.0, Timesteps: 200
Episode 94, Score: 200.0, Timesteps: 200
Episode 95, Score: 200.0, Timesteps: 200
Episode 96, Score: 200.0, Timesteps: 200
Episode 97, Score: 200.0, Timesteps: 200
Episode 98, Score: 200.0, Timesteps: 200
Episode 99, Score: 200.0, Timesteps: 200
Episode 100, Score: 200.0, Timesteps: 200

--- Evaluation ---
Average score: 200.0
Episodes: 100

Leave a Reply

Your email address will not be published. Required fields are marked *