Soft Actor-Critic

$$$$Soft Actor-Critic

$$$$def expectation(l):

N=10000

return mean(l() for _ in range(N))

V(s_t) = expectation(lambda: a_t = sample(pi); (Q(s_t, a_t) -

alpha * log(pi(a_t / s_t))))

Soft Actor-Critic

$$$$def expectation(l):

N=10000

return mean(l() for _ in range(N))

T**pi * Q(s_t, a_t) = r(s_t, a_t) + gamma * expectation(lambda:

s_(t + 1) = sample(p(s_t, a_t)); V(s_(t + 1)))

Soft Actor-Critic

$$$$Soft Actor-Critic

$$$$def expectation(l):

N=10000

return mean(l() for _ in range(N))

V(s_t) = expectation(lambda: a_t = sample(pi); (Q(s_t, a_t) -

alpha * log(pi(a_t / s_t))))

Soft Actor-Critic

$$$$def expectation(l):

N=10000

return mean(l() for _ in range(N))

T**pi * Q(s_t, a_t) = r(s_t, a_t) + gamma * expectation(lambda:

s_(t + 1) = sample(p(s_t, a_t)); V(s_(t + 1)))