function [As,Q,R,p]=banditP(N,Aq,beta) %FUNCTION [As,Q,R,p]=banditP(N,Aq,beta) % Performs the N-armed bandit example using pursuit strategy, fixed beta % Inputs: % N=number of trials total % Aq=Actual rewards for each bandit (these are the mean rewards) % beta=learning parameter % Ouputs: % As=Action selected on trial j, j=1:N % Q= Current reward estimates % R= Reward on action j, j=1:N % p= probability estimates (Use P for cumulative) numbandits=length(Aq); %Number of Bandits ActNum=zeros(numbandits,1); %Keep a running sum of the number of times %each action is selected. ActVal=zeros(numbandits,1); %Keep a running sum of the total reward %obtained for each action. Q=zeros(1,numbandits); %Current reward estimates As=zeros(N,1); %Storage for action R=zeros(N,1); %Storage for averaging reward (for Figure 2.1, p 28) p=(1/numbandits)*ones(1, numbandits); %Set initial probabilities %to all equal %******************************************************************** % % Now we're ready for the main loop %******************************************************************** for j=1:N %STEP ONE: SELECT AN ACTION (cQ) using the probabilities p P=cumsum([0 p]); t=rand; %Random number between 0 and 1 n1=histc(t,P); cQ=find(n1==1); %This is the action we select. cR=randn+Aq(cQ); %This is our reward. %STEP TWO: STORAGE AND UPDATES: R(j)=cR; As(j)=cQ; ActNum(cQ)=ActNum(cQ)+1; ActVal(cQ)=ActVal(cQ)+cR; Q(cQ)=ActVal(cQ)/ActNum(cQ); %Increase the probability associated to best reward % and decrease the others using step size beta: [val,idx]=find(Q==max(Q)); m=length(idx); %counts how many maxs there are p=(1-beta)*p; p(idx)=p(idx)+(beta/m); end