Model-free (reinforcement learning)

Class of reinforcement learning algorithm

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Model-free" reinforcement learning – news · newspapers · books · scholar · JSTOR (April 2019) (Learn how and when to remove this message)

Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural radiance field Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Journals and conferences ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

In reinforcement learning (RL), a model-free algorithm is an algorithm which does not estimate the transition probability distribution (and the reward function) associated with the Markov decision process (MDP), which, in RL, represents the problem to be solved. The transition probability distribution (or transition model) and the reward function are often collectively called the "model" of the environment (or MDP), hence the name "model-free". A model-free RL algorithm can be thought of as an "explicit" trial-and-error algorithm. Typical examples of model-free algorithms include Monte Carlo (MC) RL, SARSA, and Q-learning.

Monte Carlo estimation is a central component of many model-free RL algorithms. The MC learning algorithm is essentially an important branch of generalized policy iteration, which has two periodically alternating steps: policy evaluation (PEV) and policy improvement (PIM). In this framework, each policy is first evaluated by its corresponding value function. Then, based on the evaluation result, greedy search is completed to produce a better policy. The MC estimation is mainly applied to the first step of policy evaluation. The simplest idea is used to judge the effectiveness of the current policy, which is to average the returns of all collected samples. As more experience is accumulated, the estimate will converge to the true value by the law of large numbers. Hence, MC policy evaluation does not require any prior knowledge of the environment dynamics. Instead, only experience is needed (i.e., samples of state, action, and reward), which is generated from interacting with an environment (which may be real or simulated).

Value function estimation is crucial for model-free RL algorithms. Unlike MC methods, temporal difference (TD) methods learn this function by reusing existing value estimates. TD learning has the ability to learn from an incomplete sequence of events without waiting for the final outcome. It can also approximate the future return as a function of the current state. Similar to MC, TD only uses experience to estimate the value function without knowing any prior knowledge of the environment dynamics. The advantage of TD lies in the fact that it can update the value function based on its current estimate. Therefore, TD learning algorithms can learn from incomplete episodes or continuing tasks in a step-by-step manner, while MC must be implemented in an episode-by-episode fashion.

Model-free reinforcement learning algorithms

Model-free RL algorithms can start from a blank policy candidate and achieve superhuman performance in many complex tasks, including Atari games, StarCraft and Go. Deep neural networks are responsible for recent artificial intelligence breakthroughs, and they can be combined with RL to create superhuman agents such as Google DeepMind's AlphaGo. Mainstream model-free RL algorithms include Deep Q-Network (DQN), Dueling DQN, Double DQN (DDQN), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), Distributional Soft Actor-Critic (DSAC), etc. Some model-free (deep) RL algorithms are listed as follows:

Algorithm	Description	Policy	Action Space	State Space	Operator
DQN	Deep Q Network	Off-policy	Discrete	Typically Discrete or Continuous	Q-value
DDPG	Deep Deterministic Policy Gradient	Off-policy	Continuous	Discrete or Continuous	Q-value
A3C	Asynchronous Advantage Actor-Critic Algorithm	On-policy	Continuous	Discrete or Continuous	Advantage
TRPO	Trust Region Policy Optimization	On-policy	Continuous or Discrete	Discrete or Continuous	Advantage
PPO	Proximal Policy Optimization	On-policy	Continuous or Discrete	Discrete or Continuous	Advantage
TD3	Twin Delayed Deep Deterministic Policy Gradient	Off-policy	Continuous	Continuous	Q-value
SAC	Soft Actor-Critic	Off-policy	Continuous	Discrete or Continuous	Advantage
DSAC	Distributional Soft Actor-Critic	Off-policy	Continuous	Continuous	Value distribution

References

^ Sutton, Richard S.; Barto, Andrew G. (November 13, 2018). Reinforcement Learning: An Introduction (PDF) (Second ed.). A Bradford Book. p. 552. ISBN 0262039249. Retrieved 18 February 2019.
^ Li, Shengbo Eben (2023). Reinforcement Learning for Sequential Decision and Optimal Control (First ed.). Springer Verlag, Singapore. pp. 1–460. doi:10.1007/978-981-19-7784-8. ISBN 978-9-811-97783-1. S2CID 257928563.{{cite book}}: CS1 maint: location missing publisher (link)
J Duan; Y Guan; S Li (2021). "Distributional Soft Actor-Critic: Off-policy reinforcement learning for addressing value estimation errors". IEEE Transactions on Neural Networks and Learning Systems. 33 (11): 6584–6598. arXiv:2001.02811. doi:10.1109/TNNLS.2021.3082568. PMID 34101599. S2CID 211259373.

Category:

Reinforcement learning