function [As,Q,R]=banditE(N,Aq,E) %FUNCTION [As,Q,R]=banditE(N,Aq,E) % Performs the N-armed bandit example % Inputs: % N=number of trials total % Aq=Actual rewards for each bandit (these are the mean rewards) % E=epsilon for epsilon-greedy algorithm % Ouputs: % As=Action selected on trial j, j=1:N % Eq=Estimates of the rewards, N x numbandits numbandits=length(Aq); %Number of Bandits ActNum=zeros(numbandits,1); %Keep a running sum of the number of times %each action is selected. ActVal=zeros(numbandits,1); %Keep a running sum of the total reward %obtained for each action. Q=zeros(1,numbandits); %Current reward estimates As=zeros(N,1); %Storage for action %Eq=zeros(N,numbandits); %Keep track of rewards estimation, debugging R=zeros(N,1); %Storage for averaging reward (for Figure 2.1, p 28) %********************************************************************* % Set up a flag so we know when to choose at random (using epsilon) %********************************************************************* greedy=zeros(1,N); if E>0 m=round(E*N); %Total number of times we should choose at random greedy(1:m)=ones(1,m); m=randperm(N); greedy=greedy(m); clear m end if E>=1 error('The epsilon should be between 0 and 1\n'); end %******************************************************************** % % Now we're ready for the main loop %******************************************************************** for j=1:N %STEP ONE: SELECT AN ACTION (cQ) , GET THE REWARD (cR) ! %Check to see if we should do it at random: if greedy(j)>0 cQ=ceil(rand*numbandits); cR=randn+Aq(cQ); else [val,idx]=find(Q==max(Q)); m=ceil(rand*length(idx)); cQ=idx(m); cR=randn+Aq(cQ); end R(j)=cR; %UPDATE FOR NEXT GO AROUND! As(j)=cQ; ActNum(cQ)=ActNum(cQ)+1; ActVal(cQ)=ActVal(cQ)+cR; Q(cQ)=ActVal(cQ)/ActNum(cQ); % Eq(j,:)=Q; Only for debugging (set output to Eq instead of Q) end