February 27, 2019

Solving Exploration-Exploitation Dilemma in Reinforcement Learning

Explore-Exploit Reinforcement Learning

In this blog we will be discussing about methods that can solve the explore exploit dilemma. It’s going to be little bit more technical compared to last two blogs. Read about foundational blogs here and here.

Epsilon-Greedy Algorithm 
This is one of the simplest possible algorithms for trading off exploration and exploitation, where you make the decision at each step of the way; you can either choose to take a random action or choose the agent’s trusted action. This algorithm more or less mimics greedy algorithm as it generally exploits the best available option, but every once in a while the Epsilon-Greedy algorithm explores the other available options. The probability of taking random action is known as Epsilon in this algorithm and 1-Epsilon as the probability of exploiting the recommended action.

Epsilon-Greedy Solution to RL

In the above figure, it is clear that algorithm at any moment exploits/chooses the best with the probability of 1-Epsilon whereas, explores the best with the probability of Epsilon/n.
There is one drawback to this algorithm, i.e when it explores it chooses equally among the all possible actions which mean it is equally likely to choose the worst action and best action. There are better approaches than this like Softmax-Action selection, which we will be discussing in the future blogs.

That’s it. That was some high-level details on Epsilon-Greedy Algorithm. We will now discuss it’s implementation in Python. If we see carefully, then we would like to initialize and update certain parameters as the code executes.

If we notice carefully, then we would like to initialize and update certain parameters as the code executes.
  1. epsilon: Setting this value to 0.1 tells we’ll explore with 10% probability and exploit with 90% probability.
  2. counts: Stores the count for each action that tells us how many times we’ve played each of the available actions.
  3. values: Stores the average amount of reward we’ve gotten when playing each of the actions available to us.
We initialize counts and values to 0 and 0.0 respectively for available actions.
 def explore(values):  
   Returns the index of maximum average value   
   for an action  
   return values.index(max(values))  
 def exploit(values):  
   Returns random average value for an action  
   return random.randrange(len(values))  
 def select_action(epsilon, values):  
   If the random number is greater than epsilon  
   then we exploit else we explore.  
   if random.random() > epsilon:  
     return exploit(values)  
     return explore(values) 
We now need a function that does an update to counts and values array that we had initialized.
 def update_counts(action):  
   Update the value of count corresponding   
   to the action that is taken.  
   counts[action] = counts[action] + 1  
   return counts  
 def update_values(action, reward, counts):  
   Calculates the estimated value for the chosen  
   action. If this is the first experience ever with   
   the chosen arm, we set the estimated value directly   
   to the reward we just received from playing that arm.  
   If we had played the arm in the past, i.e n != 0,   
   we update the estimated value of the chosen arm to   
   be a weighted average of the previously estimated   
   value and the reward we just received.  
   current_value = values[action]  
   n = counts[action]  
   values[action] = ((n-1)/float(n))*value + (1/float(n))*reward  
   return values  
 def update_all(action, reward):  
   counts = update_counts(action)  
   values = update_values(action, reward, counts)  
   return counts, values  
This completes all the important functions required to head start a cool project. 

P.S. I might not be optimal in my implementation, suggestions are always appreciated.

Feel free to share your thoughts - Thanks 

No comments:

Post a Comment

Please share your valuable feedback. It will help me and community grow.