function [As,Q,R,p]=banditS(N,Aq,tau) %FUNCTION [As,Q,R,p]=banditS(N,Aq,tau) % Performs the N-armed bandit example using softmax strategy, fixed tau % Inputs: % N=number of trials total % Aq=Actual rewards for each bandit (these are the mean rewards) % tau=temperature for softmax update % Ouputs: % As=Action selected on trial j, j=1:N % Q= Current reward estimates % R= Reward on action j, j=1:N % p= probability estimates (Use P for cumulative) numbandits=length(Aq); %Number of Bandits ActNum=zeros(numbandits,1); %Keep a running sum of the number of times %each action is selected. ActVal=zeros(numbandits,1); %Keep a running sum of the total reward %obtained for each action. Q=zeros(1,numbandits); %Current reward estimates As=zeros(N,1); %Storage for action R=zeros(N,1); %Storage for averaging reward (for Figure 2.1, p 28) p=(1/numbandits)*ones(1, numbandits); %Set initial probabilities %to all equal %******************************************************************** % % Now we're ready for the main loop %******************************************************************** for j=1:N %STEP ONE: SELECT AN ACTION (cQ) using the probabilities P=cumsum([0 p]); t=rand; %Random number between 0 and 1 n1=histc(t,P); cQ=find(n1==1); %This is the action we select. cR=randn+Aq(cQ); %This is our reward. %STEP TWO: STORAGE AND UPDATES: R(j)=cR; As(j)=cQ; ActNum(cQ)=ActNum(cQ)+1; ActVal(cQ)=ActVal(cQ)+cR; Q(cQ)=ActVal(cQ)/ActNum(cQ); p=exp(Q./tau); %This is the softmax update p=p./sum(p); end