## February 27, 2019

### Solving Exploration-Exploitation Dilemma in Reinforcement Learning [Source]

In this blog we will be discussing about methods that can solve the explore exploit dilemma. It’s going to be little bit more technical compared to last two blogs. Read about foundational blogs here and here.

Epsilon-Greedy Algorithm
This is one of the simplest possible algorithms for trading off exploration and exploitation, where you make the decision at each step of the way; you can either choose to take a random action or choose the agent’s trusted action. This algorithm more or less mimics greedy algorithm as it generally exploits the best available option, but every once in a while the Epsilon-Greedy algorithm explores the other available options. The probability of taking random action is known as Epsilon in this algorithm and 1-Epsilon as the probability of exploiting the recommended action. [Source]

In the above figure, it is clear that algorithm at any moment exploits/chooses the best with the probability of 1-Epsilon whereas, explores the best with the probability of Epsilon/n.
There is one drawback to this algorithm, i.e when it explores it chooses equally among the all possible actions which mean it is equally likely to choose the worst action and best action. There are better approaches than this like Softmax-Action selection, which we will be discussing in the future blogs.

That’s it. That was some high-level details on Epsilon-Greedy Algorithm. We will now discuss it’s implementation in Python. If we see carefully, then we would like to initialize and update certain parameters as the code executes.

If we notice carefully, then we would like to initialize and update certain parameters as the code executes.
1. epsilon: Setting this value to 0.1 tells we’ll explore with 10% probability and exploit with 90% probability.
2. counts: Stores the count for each action that tells us how many times we’ve played each of the available actions.
3. values: Stores the average amount of reward we’ve gotten when playing each of the actions available to us.
We initialize counts and values to 0 and 0.0 respectively for available actions.
`````` def explore(values):
'''
Returns the index of maximum average value
for an action
'''
return values.index(max(values))
def exploit(values):
'''
Returns random average value for an action
'''
return random.randrange(len(values))
def select_action(epsilon, values):
'''
If the random number is greater than epsilon
then we exploit else we explore.
'''
if random.random() > epsilon:
return exploit(values)
else:
return explore(values) ``````
We now need a function that does an update to counts and values array that we had initialized.
`````` def update_counts(action):
'''
Update the value of count corresponding
to the action that is taken.
'''
counts[action] = counts[action] + 1
return counts
def update_values(action, reward, counts):
'''
Calculates the estimated value for the chosen
action. If this is the first experience ever with
the chosen arm, we set the estimated value directly
to the reward we just received from playing that arm.
If we had played the arm in the past, i.e n != 0,
we update the estimated value of the chosen arm to
be a weighted average of the previously estimated
value and the reward we just received.
'''
current_value = values[action]
n = counts[action]
values[action] = ((n-1)/float(n))*value + (1/float(n))*reward
return values
def update_all(action, reward):
counts = update_counts(action)
values = update_values(action, reward, counts)
return counts, values  ``````
This completes all the important functions required to head start a cool project.

P.S. I might not be optimal in my implementation, suggestions are always appreciated.

Feel free to share your thoughts - Thanks