Reinforcement Learning

In order to train an RL agent, we need to have (i) an environment and (ii) a learning method. In this work, we define a foraging environment where the goal of the agent is to find as many targets as possible in a given time. We consider environments with non-destructive -or replenishable- targets, which we implement by displacing the agent a distance \(l_\textrm{c}\) from the center of the found target.

As for the agent, we use Projective Simulation (PS) to model its decision making process and learning method. However, other algorithms that work with stochastic policies can also be used.

First, we import the classes that define the environment (TargetEnv), the forager dynamics (Forager), and its learning method.

import numpy as np
from rl_opts.rl_framework.legacy import TargetEnv, Forager
from tqdm.notebook import tqdm

Note: the class Forager as it currently is inherits the methods of a PS agent for decision making and learning. However, other learning algorithms can be directly implemented by changing this inheritance. The learning algorithm should contain a method for decision making, called deliberate, which inputs a state; and another one for updating the policy, called learn, which inputs a reward.

We set up the parameters defining the length of the episodes (number of RL steps) and the number of episodes.

TIME_EP = 200 #time steps per episode
EPISODES = 1200 #number of episodes

We initialize the environment.

#Environment parameters
Nt = 100 #number of targets
L = 100 #world size
r = 0.5 #target detection radius
lc = 1 #cutoff length

#Initialize environment
env = TargetEnv(Nt, L, r, lc)

We initialize the agent. As states, the agent perceives the value of an internal counter that keeps track of the number of small steps that it has performed without turning. The possible actions are continue walking in the same direction or turning. The agent performs a small step of length \(d=1\) in any case after making a decision. Let’s define the parameters of the PS forager agent and initialize it:

NUM_ACTIONS = 2 # continue in the same direction, turn
STATE_SPACE = [np.linspace(0, TIME_EP-1, TIME_EP), np.arange(1), np.arange(1)] # one state per value that the counter may possibly have within an episode.
#--the last two entries are just placeholders here, but the code is general enough to implement ensembles of interacting agents that forage together.--
GAMMA = 0.00001 #forgetting parameter in PS
ETA_GLOW = 0.1 #glow damping parameter in PS
INITIAL_DISTR = [] #set a different initialization policy
for percept in range(TIME_EP): 
    INITIAL_DISTR.append([0.99, 0.01]) 
    

#Initialize agent
agent = Forager(num_actions=NUM_ACTIONS,
                state_space=STATE_SPACE,
                gamma_damping=GAMMA,
                eta_glow_damping=ETA_GLOW,
                initial_prob_distr=INITIAL_DISTR)

We run the learning process.

for e in tqdm(range(EPISODES)):
        
    #restart environment and agent's counter and g matrix
    env.init_env()
    agent.agent_state = 0
    agent.reset_g()

    for t in range(TIME_EP):
        
        #step to set counter to its min. value n=1
        if t == 0 or env.kicked[0]:
            #do one step with random direction (no learning in this step)
            env.update_pos(1)
            #check boundary conditions
            env.check_bc()
            #reset counter
            agent.agent_state = 0
            #set kicked value to false again
            env.kicked[0] = 0
            
        else:
            #get perception
            state = agent.get_state()
            #decide
            action = agent.deliberate(state)
            #act (update counter)
            agent.act(action)
            
            #update positions
            env.update_pos(action)
            #check if target was found + kick if it is
            reward = env.check_encounter()
                
            #check boundary conditions
            env.check_bc()
            #learn
            agent.learn(reward)

Note: the code can directly accomodate environments with several agents that interact. For this reason, you will find methods in both the environment class TargetEnv and the forager class Forager that deal with agents that have visual cones and can perceive the presence of other agents in their surroundings. However, these features are not used in this work.

Reproduction of results

Here, we explain how to reproduce the results of the paper that concern the training of RL agents in the foraging environment.

Training

You can run the training that is detailed above by means of the method [`learning`](https://gorkamunoz.github.io/rl_opts/lib_nbs/learning_and_benchmark.html#learning), which also saves the agent’s memory periodically.

First, import [`learning`](https://gorkamunoz.github.io/rl_opts/lib_nbs/learning_and_benchmark.html#learning):

from rl_opts.learn_and_bench import learning

learning inputs a configuration dictionary (config), a path to the folder where the results are saved (results_path) and the agent’s identifier (run). Let us detail each input separately.

Configuration dictionary (config): it contains the parameters to initialize both the environment and the agent. For each set of parameters we ran, there is an identifier of the form “exp_numconfig” (e.g. exp_0) that uniquely identifies the config file. The config files for the experiments that give the results of the paper can be found in the directory ‘configurations/learning/’.

These are the parameters that you can find in the config files:

NUM_TARGETS : number of targets
WORLD_SIZE : side of the square that defines the world (with periodic boundary conditions)
r : target detection radius
lc : cutoff length
MAX_STEP_L : maximum value of the step counter (which coincides with the number of RL steps per episode)
NUM_BINS : number of bins in which the state space is split. This is set to have one state per value of the counter
NUM_ACTIONS : number of actions
GAMMA : forgetting parameter \(\gamma\) in PS
ETA_GLOW : glow damping parameter \(\eta_g\) in PS
PI_INIT : policy initialization \(\pi_0\) (\(\forall n\)). Note that it is given as \(\pi_0(\uparrow|n)\)
NUM_EPISODES : number of episodes

We study foraging in enviroments with different cutoff lengths \(l_\textrm{c}\). Exp_0 corresponds to \(l_\textrm{c}=0.6\). Exp_1..10 correspond to \(l_\textrm{c}=1..10\), respectively. In experiments exp_0..10, the initialization policy is \(\pi_0(\Rsh|n)=0.01\) \(\forall n\). Exp_11 and exp_12 correspond to experiments where the initialization policy is \(\pi_0(\Rsh|n)=0.5\) \(\forall n\). Each experiment is run with 10 independent, different agents (run \(\in [0,9]\)).

As an example, you can import the configuration from experiment exp_8 by running:

from rl_opts.utils import get_config

config = get_config('exp_8.cfg')

Alternatively, you can also define your own config dictionary with the parameters detailed above:

my_config = {'NUM_TARGETS' : 100,
             'WORLD_SIZE' : 100,
             'r' : 0.5,
             'lc' : 2,
             'MAX_STEP_L' : 100,
             'NUM_BINS' : 100,
             'NUM_ACTIONS' : 2,
             'GAMMA' : 0.00001,
             'ETA_GLOW' : 0.1,
             'PI_INIT' : 0.99,
             'NUM_EPISODES' : 500}

Results path (results_path): Path where you want to save the results. The agent’s memory (h matrix) is saved every 500 episodes on the file ‘memory_agent…’ (e.g. ‘memory_agent_0_episode_500.npy’).

results_path = 'results/learning/test/'

Agent’s identifier (run): integer that identifies the agent. With this identifier, you can later retrieve the agent’s memory or its performance (see the following section on Postlearning analysis).

After defining the inputs, you can run the learning:

learning(my_config, results_path, run=0)

Once the training is finished, you can get the policy of the agent (as \(\pi(\uparrow|n)\)) at any of the episodes in which the memory was saved by running:

from rl_opts.utils import get_policy

saved_policy = get_policy(results_path, run=0, training_episode=500)

Note: in the code, the policies are always given as \(\pi(\uparrow|n)\).

Fig. 3 and Fig. 4 show the policies of the agents at the end of a training consisting of 12000 episodes of 20000 RL steps each. The policies can be retrieved with [`get_policy`](https://gorkamunoz.github.io/rl_opts/lib_nbs/utils.html#get_policy) as detailed above, by setting training_episode = 12000 and the corresponding agent identifier.

Postlearning analysis

In order to fairly compare the performance of the RL agents throughout the training with that of the benchmark models (Fig. 2), we need to run the same number of walks. In the training, the agent’s policy changes from one episode to the next one, and taking the efficiency of just one episode -i.e. one walk- is not enough since we consider \(10^4\) walks for the benchmark policies. Thus, we save the agent’s policy at different stages of the training and then, in a postlearning analysis, we run \(10^4\) walks with that frozen policy to get a more accurate evaluation of its performance.

This performance analysis is done with the method [`agent_efficiency`](https://gorkamunoz.github.io/rl_opts/lib_nbs/learning_and_benchmark.html#agent_efficiency), which is imported by running:

from rl_opts.learn_and_bench import agent_efficiency

To run it, you first need to define:

The results path from where it retrieves the agent’s memory at different stages of the training. Thus, it needs to be the same path where you saved the results of the training. The results of this analysis are also saved there.
The configuration file you used to train the agent. To reproduce the results from Fig. 2, first get the corresponding config file as detailed in the previous section.
The agent’s identifier.
The number of walks. To reproduce the results from Fig. 2, set this parameter to 10000.
An episode interval. This function analyzes the performance of the agent at different stages of the training. To reproduce our results from Fig. 2, you should set this parameter to 2000, which means the performance is analyzed every 2000 episodes, until the end of the training.

To do the postlearning analysis on the example of the previous section, you run:

agent_efficiency(results_path, my_config, run=0, num_walks=100, episode_interval=500)

Essentially, this analysis is carried out by the method walk_from_policy, which inputs a policy (that is not changing) and runs the walks in parallel. It outputs a list with the efficiency achieved in each walk.

You can find the results of, for example, the last episode, in the file ‘performance_post_training_agent_0_episode_500.npy’.

To get an array with the average performances (over the number of walks) of several agents throughout the training, you can run:

from rl_opts.utils import get_performance

ag_list = [0] #in this example, we only ran one agent, but you can input here the identifiers of all the agents you ran.
ep_list = [500] #get the performance at episode 500 of the agents in ag_list.

av_performance, sem = get_performance(results_path, agent_list=ag_list, episode_list=ep_list)