Generated by GPT-5-mini| Markov decision processes | |
|---|---|
| Name | Markov decision processes |
| Field | John von Neumann; Andrey Markov; Richard Bellman |
| Introduced | 1950s |
| Notable examples | Bellman equation; Dynamic programming; Reinforcement learning |
Markov decision processes are mathematical models for sequential decision making where outcomes are partly random and partly under the control of a decision maker. Originating from early 20th-century work by Andrey Markov and later formalized in the context of optimization by Richard Bellman, these processes underpin methods in computational decision theory, control theory, and artificial intelligence. They provide the foundation for algorithmic frameworks such as Value iteration, Policy iteration, and modern reinforcement learning algorithms used in domains spanning from operations research to robotics.
A typical formalization specifies a tuple (S, A, P, R, γ) with a state space S, an action set A, transition probabilities P, reward function R, and discount factor γ; the formulation builds on predecessors like Andrey Markov and incorporates optimization principles by Richard Bellman. States S often relate to representations used in Alan Turing-inspired computation models and in applications tied to Claude Shannon-style information frameworks; actions A can be discrete as in John von Neumann-type games or continuous as in control problems defined by Norbert Wiener. Transition dynamics P(s'|s,a) express stochastic evolution, while R(s,a) or R(s,a,s') encodes immediate returns; the discounted-sum criterion with γ∈[0,1) links to convergence analyses developed in Richard Bellman's dynamic programming and in subsequent work by David Blackwell and Lloyd Shapley. Formal definitions accommodate finite S and A as in classical treatments published in venues such as Operations Research and extend to measurable spaces used in studies by Yurii Nesterov and E. S. Page.
Optimality criteria center on policies π mapping states to actions or distributions, evaluated by value functions V^π and action-value functions Q^π; these concepts are descendants of game-theoretic solution ideas from John Nash and optimization techniques advanced by Richard Bellman. Key dynamic programming algorithms include Value iteration, Policy iteration, and linear programming formulations related to work presented at INFORMS conferences; value-based iterative schemes trace their theoretical roots to Richard Bellman and computational analyses by Andrew Yao. Model-free and model-based approaches diverge: model-based planning uses transition models as in early Operations Research implementations, whereas model-free methods underpin contemporary Reinforcement learning studies led by researchers affiliated with institutions such as DeepMind and OpenAI. Convergence guarantees for algorithms often invoke contraction mappings and Banach fixed-point principles, echoing theoretical results attributed to Stefan Banach and applied in analyses by David Blackwell.
Fundamental properties include existence and uniqueness of optimal stationary policies under standard assumptions and the Bellman optimality principle originally articulated by Richard Bellman. Structural results identify conditions for deterministic optimal policies in finite-horizon and discounted settings, with ties to minimax theorems explored by John von Neumann and equilibrium concepts related to John Nash. Complexity results classify exact solution computation as polynomial-time for finite-state discounted cases via linear programming and dynamic programming, while certain formulations relate to PSPACE or EXPTIME hardness in extensions studied by scholars publishing in SIAM Journal on Computing and presenting at the ACM Symposium on Theory of Computing. Probabilistic analyses draw on martingale theory developed by Joseph Doob and ergodic theorems connected to Andrey Kolmogorov, yielding long-run average reward characterizations and recurrence classifications studied in stochastic process literature.
Numerous generalizations adapt the core formalism: Partially Observable variants incorporate observation models as in frameworks studied by Ronald A. Howard and applied in aerospace research sponsored by organizations like NASA; Constrained formulations add risk or resource limits akin to approaches in Operations Research at INFORMS venues; Continuous-time analogues connect to stochastic control problems investigated by Karl Åström and linked to stochastic calculus advanced by Kiyoshi Itô. Multi-agent extensions give rise to stochastic games and decentralized decision processes related to game-theoretic formulations by Lloyd Shapley and subsequent work by Michael Littman; hierarchical and options-based formulations draw on ideas from hierarchical planning research affiliated with MIT and Carnegie Mellon University. Bayesian and robust variants incorporate uncertainty over model parameters as in research disseminated at NeurIPS and ICML, and risk-sensitive criteria align with utility-theoretic analyses inspired by John von Neumann and Oskar Morgenstern.
Applications span inventory and queuing control in businesses studied in Operations Research, asset allocation in finance examined at institutions like London School of Economics and Harvard University, control of robotic systems explored by labs at MIT and Stanford University, and sequential clinical decision strategies evaluated in trials overseen by World Health Organization. Examples include elevator scheduling problems addressed in industrial projects by General Electric, autonomous vehicle planning explored by research teams at Toyota Research Institute and Waymo, and game-playing agents developed in competitions such as matches involving DeepMind agents and in tournaments with entries from OpenAI. Educational deployments appear in courses at Massachusetts Institute of Technology and Carnegie Mellon University, while software ecosystems implementing algorithms include libraries maintained by communities around GitHub repositories and research groups at Google Research.