CS 159: Advanced Topics in Machine Learning (Spring 2016)

Course Description

This course will cover a mixture of the following topics:

Online Learning
Multi-Armed Bandits
Active Learning
Human-in-the-Loop Learning
Reinforcement Learning

Course Details

Lectures on Tu/Th at 2:30pm-4pm in Steele 102
This is a paper reading course, where we read and discuss research papers in class
Student participation is required, including presenting papers in class (20% of total grade)
Mini-quiz on papers for every lecture, given after lecture/discussion (10% of total grade)
Final project that explores some topic covered in class (70% of final grade)
Piazza Forum: link

Instructor

Yisong Yue yyue@caltech.edu

Teaching Assistants

Stephan Zheng stephan@caltech.edu
Hoang Le hmle@caltech.edu

Office Hours

Datasets & Testbeds

May be interesting for final project

Helicopter Reinforcement Learning Simulator
Contextual Bandit Simulator
RoboCup Simulator
Deepmind Atari simulator
- GitHub Version
- Atari ROMs (easy to search for more)
Go-Playing Programs
StarCraft Broodwar API
OpenAI Gym

Presentation Schedule

Note: schedule is subject to change.

Date	Papers	Presenters		Materials
3/29/2016	Introduction & Administrivia Follow the Leader Algorithm & Perceptron	Yisong Yue	[slides]	Perceptron Mistake Bounds (Sections 1 & 2) Online Learning (Chapter 1)
3/31/2016	Online Learning with Experts & Multiplicative Weights Algorithm	Stephan Zheng	[slides]	The Multiplicative Weights Update Method: A Meta-Algorithm and Applications (Section 1.1, Section 2.0 & Section 2.1)
4/5/2016	Online Convex Optimization	Ellen Feldman, Gautam Goel, Milan Cvitkovic Mentor: Yisong	[slides]	Online Learning and Online Convex Optimization (primarily Section 2.4, although you may need to read beginning of Section 2 for notation)
4/7/2016	Multi-armed Bandits & UCB1 Algorithm	Connor Lee, Ritvik Mishra, Hoang Le Mentor: Hoang	[slides]	Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems (Section 1) Finite-time Analysis of the Multiarmed Bandit Problem (Primarily UCB1 Algorithm & Theorem 1)
4/12/2016	Linear Bandits & Applications	Feng Bi, Joon Sik Kim, Leiya Ma, Pengchuan Zhang Mentor: Yisong	[slides]	A Contextual-Bandit Approach to Personalized News Article Recommendation course notes Improved Algorithms for Linear Stochastic Bandits (Theorem 2 & Theorem 3)
4/14/2016	Monte Carlo Tree Search & Applications	Suraj Nair, Peter Kundzicz, Vansh Kumar, Kevin An Mentor: Stephan	[slides]	A Survey of Monte Carlo Tree Search Methods (Chapter 3, although may need parts of Chapter 2 for background) Mastering the game of Go with deep neural networks and tree search (focus on the application of tree search, not the details of deep learning)
4/19/2016	Q-Learning for Reinforcement Learning & Applications	Timothy Chou, Charlie Tong, Vincent Zhuang Mentor: Stephan	[slides]	coures notes Playing Atari with Deep Reinforcement Learning (focus on the application of Q-learning and epsilon-greedy exploration, not the details of deep learning) Convergence of Stochastic Iterative Dynamic Programming Algorithms
4/21/2016	Apprenticeship Learning for Reinforcement Learning & Applications	Nick Haliday, Audrey Huang, Ritwik Anand, Dryden Bouamalay Mentor: Hoang	[slides]	course notes An Application of Reinforcement Learning to Aerobatic Helicopter Flight Exploration and Apprenticeship Learning in Reinforcement Learning (theory reference) Apprenticeship Learning via Inverse Reinforcement Learning (theory reference)
4/26/2016	Imitation Learning	Richard Zhu, Andrew Kang Mentor: Hoang	[slides]	A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
4/28/2016	Active Learning for Supervised Learning	Daniel Gu, Matthew Morgan, Keegan Ryan, Matthew Clark Mentor: Hoang	[slides]	course notes overviewing active learning Importance Weighted Active Learning
5/3/2016	Active Learning for Decision Making	Joe Marino, Grant Van Horn, Alvita Tran, Remy Yang Mentor: Yisong	[slides]	Near Optimal Bayesian Active Learning for Decision Making Jupyter Python Demo
5/5/2016	Crowdsourcing	Madhav Mohandas, Vincent Zhuang, Richard Zhu Mentor: Yisong	[slides]	Optimistic Knowledge Gradient Policy for Optimal Budget Allocation in Crowdsourcing [appendix][journal version]
5/10/2016	Machine Teaching	Justin Leong, Kevin Tang, Zilong Chen, Kaikai Sheng Mentor: Yisong	[slides]	How Do Humans Teach: On Curriculum Learning and Teaching Dimension Machine Teaching: An Inverse Problem to Machine Learning and an Approach Toward Optimal Education (supplemental application paper) Using Machine Teaching to Identify Optimal Training-Set Attacks on Machine Learners (supplemental application paper) Becoming the Expert - Interactive Multi-Class Machine Teaching
5/12/2016	Machine Teaching for Crowdsourcing	Nancy Cao, Andrew Chico, Betsy Fu, Daniel Wang Mentor: Yisong	[slides]	Near-Optimally Teaching the Crowd to Classify
5/17/2016	Modeling Human Decision Making	Zachary Fein, Eric Gorlin, Emily Mazo, Kc Emezie Mentor: Hoang	[slides]	Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
5/19/2016	Combinatorial Action Spaces & Adaptive Routing	Luciana Cendon, Tobias Bischoff, Jiyun Ivy Xiao, Brennan Young Mentor: Yisong	[slides]	Adaptive Collective Routing Using Gaussian Process Dynamic Congestion Models [journal version]
5/24/206	Dueling Bandits	Fabian Boemer, Kushal Agarwal, Jialin Song, Aman Agarwal Mentor: Yisong	[slides]	The K-armed Dueling Bandits Problem (no need to read the theoretical analysis in detail) How Does Clickthrough Data Reflect Retrieval Quality? (primarily Section 5) [journal version]
5/26/2016	Coactive Learning	Rohan Batra, Avishek Dutta, Nand Kishore, Siddarth Murching Mentor: Hoang	[slides]	Online Structured Prediction via Coactive Learning [journal version] Learning Trajectory Preferences for Manipulators via Iterative Improvement
5/31/2016	Bayesian Optimization	Dimitar Ho, Danni Ma Mentor: Stephan	[slides]	Practical Bayesian Optimization of Machine Learning Algorithms
6/2/2016	Off-Policy Evaluation	Miguel Aroca-Ouellete, Akshta Athawale, Mannat Singh Mentor: Hoang	[slides]	Exploration Scavenging

Reading List

Presentation Signup Sheet

(Course Notes on Online Learning) Online Learning, by Gabor Bartok, David Pal, Csaba Szepesvari, and Istvan Szita.
(Perceptron mistake bound) Perceptron Mistake Bounds, by Mehryar Mohri and Afshin Rostamizadeh. CoRR abs/1305.0208, 2013.
(Survey Paper on Online Learning) Online Learning and Online Convex Optimization, by Shai Shalev-Shwartz. Foundations and Trends in Machine Learning, 4(11), 107-194, 2011.
(Multi-armed Bandits, UCB1 algorithm) Finite-time Analysis of the Multiarmed Bandit Problem, by Peter Auer, Nicolo Cesa-Bianchi, Paul Fischer. Machine Learning, 47, 235-356, 2002.

(Survey Paper on Multi-armed Bandits) Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, by Sebastien Bubeck and Nicolo Cesa-Bianchi.

(Mult-armed Bandits, Personalized Recommender Systems) A Contextual-Bandit Approach to Personalized News Article Recommendation, by Lihong Li, Wei Chu, John Langford, and Robert Schapire. International World Wide Web Conference, 2010.

(LinUCB algorithm) Improved Algorithms for Linear Stochastic Bandits, by Yasin Abbasi-Yadkori, David Pal, and Csaba Czepesvari. Neural Information Processing Systems, 2011.

(Dueling Bandits) The K-armed Dueling Bandits Problem, by Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. Journal of Computer and System Sciences, DOI:10.1016/j.jcss.2011.12.028, 2012.
(Coactive Learning) Online Structured Prediction via Coactive Learning, by Pannaga Shivaswamy and Thorsten Joachims. International Conference on Machine Learning, 2012. [journal version]
(Active Learning) Near Optimal Bayesian Active Learning for Decision Making, by Shervin Javdani, Yuxin Chen, Amin Karbasi, Andreas Krause, Drew Bagnell, Siddhartha Srinivasa. International Conference on Artificial International and Statistics, 2014.
(Bayesian Optimization) Practical Bayesian Optimization of Machine Learning Algorithms, by Jasper Snoek, Hugo Larochelle, and Ryan Adams. Neural Information Processing Systems, 2012.
(Off-Policy Evaluation) Exploration Scavenging, by John Langford, Alexander Strehl, and Jenn Wortman Vaughan. International Conference on Machine Learning, 2008.
(Imitation Learning) A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning, by Stephane Ross, Geoff Gordon, and Drew Bagnell. International Conference on Artificial Intelligence and Statistics, 2011.
(Apprenticeship Learning, Applications to Helicopter Control) An Application of Reinforcement Learning to Aerobatic Helicopter Flight, by Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Ng. Neural Information Processing Systems, 2007.

(Theoretical Results) Exploration and Apprenticeship Learning in Reinforcement Learning, by Pieter Abbeel and Andrew Ng. International Conference on Machine Learning, 2005.

(Monte Carlo Tree Search) A Survey of Monte Carlo Tree Search Methods by Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis and Simon Colton. IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 2012.

(Applying Monte Carlo Tree Search to Go) Mastering the game of Go with deep neural networks and tree search, by David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Nature, 529, 484–489, doi:10.1038/nature16961, 2016.

(Machine Teaching) How Do Humans Teach: On Curriculum Learning and Teaching Dimension, by Faisal Khan, Xiaojin Zhu, and Bilge Mutlu. Neural Information Processing Systems, 2011.
(Crowdsourcing) Optimistic Knowledge Gradient Policy for Optimal Budget Allocation in Crowdsourcing, by Xi Chen, Qihang Lin, and Denny Zhou. International Conference on Machine Learning, 2013. [appendix][journal version]
(Teaching the Crowd) Near-Optimally Teaching the Crowd to Classify, by Adish Singla, Ilija Bogunovic, Gabor Bartok, Amin Karbasi, and Andreas Krause. International Conference on Machine Learning, 2014.
(Modeling Human Decision Making) Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting, by Shunan Zhang and Angela Yu. Neural Information Processing Systems, 2013.
(Combinatorial Action Spaces, Adaptive Routing) Adaptive Collective Routing Using Gaussian Process Dynamic Congestion Models, by Siyuan Liu, Yisong Yue, and Ramayya Krishnan. ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2013. [journal version]

Extended Reference Material (could be useful for picking final project)

Note: some papers belong to multiple categories.

Basic Online Learning

(survey paper) Online Learning and Online Convex Optimization, by Shai Shalev-Shwartz. Foundations and Trends in Machine Learning, 4(11), 107-194, 2011.
(course notes) Online Learning, by Gabor Bartok, David Pal, Csaba Szepesvari, and Istvan Szita.
(survey paper on Multiplicative Weights algorithm) The Multiplicative Weights Update Method: A Meta-Algorithm and Applications, by Sanjeev Arora, Elad Hazan, and Satyen Kale. Theory of Computing, 8, 121-164, 2012.
(Follow the Leader algorithm) Efficient algorithms for online decision problems, by Adam Kalai and Santosh Vempala. Journal of Computer and System Sciences, 71, 291-307, 2005.
(Online Convex Optimization) Online Convex Programming and Generalized Infinitesimal Gradient Ascent, by Martin Zinkevich. International Conference on Machine Learning, 2003.
(Perceptron mistake bound) Perceptron Mistake Bounds, by Mehryar Mohri and Afshin Rostamizadeh. CoRR abs/1305.0208, 2013.

Online Learning with Experts

The Weighted Majority Algorithm, by Nick Littlestone and Manfred Warmuth. Information and Computation, 108, 212-261, 1994.
A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, by Yoav Freund and Robert Schapire. Journal of Computer and System Sciences, 55(1), 119-139, 1997.
A Parameter-free Hedging Algorithm, by Kamalika Chaudhuri, Yoav Freund, and Daniel Hsu. Neural Information Processing Systems, 2009.

More Papers on Full Information Online Learning

Logarithmic Regret Algorithms for Online Convex Optimization, by Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal. Machine Learning, 69(2-3), 169-192, 2007.
Mind the Duality Gap: Logarithmic regret algorithms for online optimization, by Sham Kakade and Shai Shalev-Shwartz. Neural Information Processing Systems, 2009.
Follow the Leader If You Can, Hedge If You Must, by Steven de Rooij, Tim van Erven, Peter Grunwald, and Wouter Koolen. Journal of Machine Learning Research, 15, 1281-1316, 2014.
Adaptive Online Gradient Descent, by Peter Bartlett, Elad Hazan, and Sasha Rakhlin. Neural Information Processing Systems, 2008.
(AdaGrad algorithm) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, by John Duchi, Elad Hazan, and Yoram Singer. Journal of Machine Learning Research, 12, 2121-2159, 2011.
Online Convex Optimization using Predictions, by Niangjun Chen, Anish Agarwal, Adam Wierman, Siddharth Barman, and Lachlan Andrew. ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 2015.

Basic Multi-Armed Bandits (Partial Information Online Learning)

(survey paper) Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, by Sebastien Bubeck and Nicolo Cesa-Bianchi.
(survey paper) Algorithms for the multi-armed bandit problem, by Volodymyr Kuleshov and Doina Precup.
(UCB1 algorithm) Finite-time Analysis of the Multiarmed Bandit Problem, by Peter Auer, Nicolo Cesa-Bianchi, Paul Fischer. Machine Learning, 47, 235-356, 2002.
(EXP3 algorithm) The non-stochastic multi-armed bandit problem, by Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert Schapire. SIAM Journal on Computing, 32(1), 48-77, 2002.
(Thompson Sampling algorithm) Analysis of Thompson Sampling for the Multi-armed Bandit Problem, by Shipra Agrawal and Navin Goyal. Conference on Learning Theory, 2012.
An Empirical Evaluation of Thompson Sampling, by Olivier Chapelle and Lihong Li. Neural Information Processing Systems, 2012.

Bandit Convex Optimization

Online convex optimization in the bandit setting: gradient descent without a gradient, by Abraham Flaxman, Adam Kalai, and Brendan Flaxman. ACM Symposium on Discrete Algorithms, 2005.
Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback, by Alekh Agarwal, Ofer Dekel, and Lin Xiao. Conference on Learning Theory, 2010.
Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem, by Yisong Yue and Thorsten Joachims. International Conference on Machine Learning, 2009.

Bandits with Dependent Arms

(LinUCB algorithm) Improved Algorithms for Linear Stochastic Bandits, by Yasin Abbasi-Yadkori, David Pal, and Csaba Czepesvari. Neural Information Processing Systems, 2011.
(GP-UCB algorithm) Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design, by Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. International Conference on Machine Learning, 2010.
Parallelizing Exploration–Exploitation Tradeoffs with Gaussian Process Bandit Optimization, by Thomas Desautels, Andreas Krause, and Joel Burdick. International Conference on Machine Learning, 2012.

Pure Exploration in Multi-Armed Bandits

(Action Elimination Algorithm) Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, by Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Journal of Machine Learning Research, 7, 1079-1105, 2006.
Pure Exploration for Multi-Armed Bandit Problems, by Sebastien Bubeck, Remi Munos, and Gilles Stoltz. Algorithmic Learning Theory, 2009.

Contextual Bandits

A Contextual-Bandit Approach to Personalized News Article Recommendation, by Lihong Li, Wei Chu, John Langford, and Robert Schapire. International World Wide Web Conference, 2010.
The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits, by John Langford and Tong Zhang. Neural Information Processing Systems, 2007.
Thompson Sampling for Contextual Bandits with Linear Payoffs, by Shipra Agrawal and Navin Goyal. International Conference on Machine Learning, 2013.
Finite-Time Analysis of Kernelised Contextual Bandits, by Michal Valko, Nathan Korda, Remi Munos, Ilias Flaounas, and Nello Cristianini. Conference on Uncertainty in Artificial Intelligence, 2013.
Efficient Optimal Learning for Contextual Bandits, by Miro Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Conference on Uncertainty in Artificial Intelligence, 2011.
Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits, by Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. International Conference on Machine Learning, 2014.

Bayesian Optimization

(survey paper) Taking the Human Out of the Loop: A Review of Bayesian Optimization, by Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan Adams, and Nando de Freitas. Proceedings of the IEEE, 104(1), 2016.
Practical Bayesian Optimization of Machine Learning Algorithms, by Jasper Snoek, Hugo Larochelle, and Ryan Adams. Neural Information Processing Systems, 2012.
Scalable Bayesian Optimization Using Deep Neural Networks, by Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostafa Ali Patwary, Prabhat, Ryan Adams. International Conference on Machine Learning, 2015.
Bayesian Multi-Scale Optimistic Optimization, by Ziyu Wang, Babak Shakibi, Lin Jin, Nando de Freitas. International Conference on Artificial Intelligence and Statistics, 2014.

Online Learning in Combinatorial Action Spaces

Learning Diverse Rankings with Multi-Armed Bandits, by Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. International Conference on Machine Learning, 2008.
Linear Submodular Bandits and their Application to Diversified Retrieval, by Yisong Yue and Carlos Guestrin. Neural Information Processing Systems, 2011.
Non-Myopic Adaptive Route Planning in Uncertain Congestion Environments, by Siyuan Liu, Yisong Yue, and Ramayya Krishnan. ACM Transactions on Knowledge Discovery and Engineering, DOI 10.1109/TKDE.2015.2411278, 2015.
Learning to Diversify from Implicit Feedback, by Karthik Raman, Pannaga Shivaswamy, and Thorsten Joachims. ACM Conference on Web Search and Data Mining, 2012.

Active Learning

(survey) Active Learning Literature Survey, by Burr Settles.
Analysis of perceptron-based active learning, by Sanjoy Dasgupta, Adam Kalai, and Claire Monteleoni. Learning Theory, 249-263, 2005.
Importance Weighted Active Learning, by Alina Beygelzimer, Sanjoy Dasgupta, John Langford, and Daniel Hsu. International Conference on Machine Learning, 2009.
Agnostic Active Learning Without Constraints, by Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Neural Information Processing Systems, 2010.
Efficient and Parsimonious Agnostic Active Learning, by Tzu-Kuo Huang, Alekh Agarwal, Daniel Hsu, John Langford, Robert Schapire. Neural Information Processing Systems, 2015.
Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization, by Daniel Golovin and Andreas Krause. Journal of Artificial Intelligence Research, 42, 427-486, 2011.
Near Optimal Bayesian Active Learning for Decision Making, by Shervin Javdani, Yuxin Chen, Amin Karbasi, Andreas Krause, Drew Bagnell, Siddhartha Srinivasa. International Conference on Artificial International and Statistics, 2014.
Active Imitation Learning: Formal and Practical Reductions to I.I.D. Learning, by Kshitij Judah, Alan Fern, Tom Dietterich, Prasad Tadepalli. Journal of Machine Learning Research, 15, 4105-4143, 2015.
Adaptively Learning the Crowd Kernel, by Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Kalai. International Conference on Machine Learning, 2011.
Active Learning by Querying Informative and Representative Examples, by Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Neural Information Processing Systems, 2011.

Online Learning from Preference Feedback

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem, by Yisong Yue and Thorsten Joachims. International Conference on Machine Learning, 2009.
The K-armed Dueling Bandits Problem, by Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. Journal of Computer and System Sciences, DOI:10.1016/j.jcss.2011.12.028, 2012.
Coactive Learning, by Pannaga Shivaswamy and Thorsten Joachims. Journal of Artificial Intelligence Research, 53, 1-40, 2015.
Stable Coactive Learning via Perturbation, by Karthik Raman, Thorsten Joachims, Pannaga Shivaswamy, and Tobias Schnabel. International Conference on Machine Learning, 2013.
Learning to Diversify from Implicit Feedback, by Karthik Raman, Pannaga Shivaswamy, and Thorsten Joachims. ACM Conference on Web Search and Data Mining, 2012.
Learning Trajectory Preferences for Manipulators via Iterative Improvement, by Ashesh Jain, Brian Wojcik, Thorsten Joachims, and Ashutosh Saxena. Neural Information Processing Systems, 2013.

Reinforcement Learning and Imitation Learning

(survey paper) Bayesian Reinforcement Learning: A Survey, by Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Foundations and Trends in Machine Learning, 8(5-6), 359-483, 2015.
Near-Optimal Reinforcement Learning in Polynomial Time, by Michael Kearns and Satinder Singh. Machine Learning 49, 209-232, 2002.
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning, by Stephane Ross, Geoff Gordon, and Drew Bagnell. International Conference on Artificial Intelligence and Statistics, 2011.
Apprenticeship Learning via Inverse Reinforcement Learning, by Pieter Abbeel and Andrew Ng. International Conference on Machine Learning, 2004.
Exploration and Apprenticeship Learning in Reinforcement Learning, by Pieter Abbeel and Andrew Ng. International Conference on Machine Learning, 2005.
An Application of Reinforcement Learning to Aerobatic Helicopter Flight, by Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Ng. Neural Information Processing Systems, 2007.
A Natural Policy Gradient, by Sham Kakade. Neural Information Processing Systems, 2002.
Guided Policy Search, by Sergey Levine and Vladlen Koltun. International Conference on Machine Learning, 2013.
Active Imitation Learning: Formal and Practical Reductions to I.I.D. Learning, by Kshitij Judah, Alan Fern, Tom Dietterich, Prasad Tadepalli. Journal of Machine Learning Research, 15, 4105-4143, 2015.
Playing Atari with Deep Reinforcement Learning, by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Neural Information Processing Systems, 2015.
Safe Exploration in Markov Decision Processes, by Teodor Mihai Modolvan and Pieter Abbeel. International Conference on Machine Learning, 2012.
(Monte Carlo tree search) A Survey of Monte Carlo Tree Search Methods by Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis and Simon Colton. IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 2012.
Mastering the game of Go with deep neural networks and tree search, by David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Nature, 529, 484–489, doi:10.1038/nature16961, 2016.
R-max – A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning, by Ronen Brafman and Moshe Tennenholtz, Journal of Machine Learning Research, 3, 213-231, 2002.
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach, by Engin Ipek, Onur Mutlu, Jose Martinez, and Rich Caruana. International Symposium on Computer Architecture, 2008.
Learning Online Smooth Predictors for Realtime Camera Planning using Recurrent Decision Trees, by Jianhui Chen, Hoang Le, Peter Carr, Yisong Yue, and Jim Little. Computer Vision and Pattern Recognition, 2016.
Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning, by Christoph Dann and Emma Brunskill. Neural Information Processing Systems, 2015.

Off Policy Evaluation and Learning

Exploration Scavenging, by John Langford, Alexander Strehl, and Jenn Wortman Vaughan. International Conference on Machine Learning, 2008.
Doubly Robust Policy Evaluation and Learning, by Miro Dudik, John Langford, and Lihong Li. International Conference on Machine Learning, 2011.
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback, by Adith Swaminathan and Thorsten Joachims. International Conference on Machine Learning, 2015.

Crowdsourcing

Optimistic Knowledge Gradient Policy for Optimal Budget Allocation in Crowdsourcing, by Xi Chen, Qihang Lin, and Denny Zhou. International Conference on Machine Learning, 2013. [appendix][journal version]
(position paper) Online Decision Making in Crowdsourcing Markets: Theoretical Challenges, by Alex Slivkins and Jenn Wortmann Vaughan. ACM SIGecom Exchanges 12(2), 4-23, 2013.
Near-Optimally Teaching the Crowd to Classify, by Adish Singla, Ilija Bogunovic, Gabor Bartok, Amin Karbasi, and Andreas Krause. International Conference on Machine Learning, 2014.
Adaptively Learning the Crowd Kernel, by Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Kalai. International Conference on Machine Learning, 2011.
Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem, by Ittai Abraham, Omar Alonso, Vasilis Kandylas, and Alex Slivkins. Conference on Learning Theory, 2013.

Machine teaching

(survey paper) Machine Teaching: An Inverse Problem to Machine Learning and an Approach Toward Optimal Education, by Xiaojin Zhu. AAAI Conference on Artificial Intelligence, 2015.
How Do Humans Teach: On Curriculum Learning and Teaching Dimension, by Faisal Khan, Xiaojin Zhu, and Bilge Mutlu. Neural Information Processing Systems, 2011.
Optimal Teaching for Limited-Capacity Human Learners, by Kaustubh Patil, Xiaojin Zhu, Lukasz Kopec, and Bradley Love. Neural Information Processing Systems, 2014.
Near-Optimally Teaching the Crowd to Classify, by Adish Singla, Ilija Bogunovic, Gabor Bartok, Amin Karbasi, and Andreas Krause. International Conference on Machine Learning, 2014.

Modeling Human Decision Making & Interpreting Human Feedback

Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting, by Shunan Zhang and Angela Yu. Neural Information Processing Systems, 2013.
Minimally Invasive Randomization for Collecting Unbiased Preferences from Clickthrough Logs, by Filip Radlinski and Thorsten Joachims. AAAI Conference on Artificial Intelligence, 2006.
Large-Scale Validation and Analysis of Interleaved Search Evaluation, by Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. ACM Transactions on Information Systems, 30(1), 6:1-6:41, 2012.
How Do Humans Teach: On Curriculum Learning and Teaching Dimension, by Faisal Khan, Xiaojin Zhu, and Bilge Mutlu. Neural Information Processing Systems, 2011.
Optimal Teaching for Limited-Capacity Human Learners, by Kaustubh Patil, Xiaojin Zhu, Lukasz Kopec, and Bradley Love. Neural Information Processing Systems, 2014.

Safe Exploration

Safe Exploration in Markov Decision Processes, by Teodor Mihai Modolvan and Pieter Abbeel. International Conference on Machine Learning, 2012.
Safe Exploration for Optimization with Gaussian Processes, by Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. International Conference on Machine Learning, 2015.

Connections to Game Theory

The Price of Truthfulness for Pay-Per-Click Auctions, by Nikhil Devanur and Sham Kakade. ACM Conference on Economics and Computation, 2009.
Incentivizing Exploration, by Peter Frazier, David Kempe, Jon Kleinberg, and Robert Kleinberg. ACM Conference on Economics and Computation, 2014.
Composable and Efficient Mechanisms, by Vasilis Syrgkanis and Eva Tardos. Symposium on Theory of Computing, 2013.
Online Learning and Profit Maximization from Revealed Preferences, by Kareem Amin, Rachel Cummings, Lili Dworkin, Michael Kearns, and Aaron Roth. AAAI Conference on Artificial Intelligence, 2015.

Related Courses, Tutorials, and Textbooks

Reinforcement Learning: An Introduction, by Rich Sutton and Andrew Barto. MIT Press, 1998.
CSE599s: Online Learning, taught by Brendan McMahan and Ofer Dekel.
ICML Tutorial on bandits, taught by Jean-Yves Audibert and Rémi Munos.
CSE 291 Winter 2011: Topics in Online Learning and Bandit Problems, taught by Kamalika Chaudhuri.
Active Learning and Optimized Information Gathering, taught by Andreas Krause.
Active Learning Tutorial, ICML 2009, taught by Sanjoy Dasgupta and John Langford.
Advanced Topics in Machine Learning, taught by Thorsten Joachims.
CS 294: Deep Reinforcement Learning, Fall 2015, taught by John Schulman and Pieter Abbeel.
Advanced Topics: Reinforcement Learning, taught by David Silver.
Learning, Games, and Electronic Markets, taught by Bobby Kleinberg.
Real Life Reinforcement Learning, taught by Emma Brunskill.