不用告诉该怎么做,而是给定奖励函数,什么时候做好。
回归
增加折现因子


强化学习的形式化
A policy is a function $\pi(s) = a$ mapping from states to actions, that tells you what $action \space a$ to take in a given $state \space s$.
goal: Find a $policy \space \pi$ that tells you what $action (a = (s))$ to take in every $state (s)$ so as to maximize the return.
