eesti teaduste
akadeemia kirjastus
SINCE 1952
Proceeding cover
of the estonian academy of sciences
ISSN 1736-7530 (Electronic)
ISSN 1736-6046 (Print)
Impact Factor (2022): 0.9
Empirical explorations of strategic reinforcement learning: a case study in the sorting problem; pp. 186–196
PDF | 10.3176/proc.2020.3.02

Ching-Sheng Lin, Jung-Sing Jwo, Cheng-Hsiung Lee, Ya-Ching Lo

Recent advances in deep learning and reinforcement learning have made it possible to create an agent that is capable of mimicking human behaviours. In this paper, we are interested in how the reinforcement learning agent behaves under different learning strategies and whether it is able to complete the task similar to human performance in principle. To study the effect of different reward types, two reward schemes which include immediate reward and pure-delayed reward are introduced. To build a more human-like agent when interacting with the environment, we propose a goal-driven design that forces the agent to achieve a level close to human ability and a training mechanism that learns only from good trajectories. Q-learning is one of the most popular reinforcement learning algorithms and we employ it for our study. As the sorting problem is a classical topic in theoretical computer science with widespread applications, it is used for the empirical evaluation. We compare our results against the algorithmic solutions.


1. Watkins, C. J. C. H. and Dayan, P. Q-learning. Mach. Learn., 1992, 8(3–4), 279–292.

2. Ng, A. Y. and Jordan, M. I. Shaping and Policy Search in Reinforcement Learning. University of California, Berkeley, 2003.

3. Devlin, S. and Kudenko, D. Theoretical considerations of potential-based reward shaping for multi-agent systems. In The 10th International Conference on Autonomous Agents and Multiagent Systems, Volume 1, 2011, 225–232.

4. Gaskett, C. Q-Learning for Robot Control. Australian National University, 2002.

5. Andrade, G., Ramalho, G., Santana, H., and Corruble, V. Extending reinforcement learning to provide dynamic game balancing. In Proceedings of the Workshop on Reasoning, Representation, and Learning in Computer Games, 19th International Joint Conference on Artificial Intelligence (IJCAI) (Aha, D. W., Muñoz-Avila, H., and van Lent, M.). Edinburgh, Scotland, 2005, 7–12.

6. Erez, T. and Smart, W. D. What does shaping mean for computational reinforcement learning? In 2008 7th IEEE International Conference on Development and Learning, Monterey, California. IEEE, 2008, 215–219.

7. Konidaris, G. and Barto, A. Autonomous shaping: knowledge transfer in reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning (Cohen, W. W. and Moore, A., eds). Association for Computing Machinery, NY, 2006, 489–496.

8. Asada, M., Noda, S., Tawaratsumida, S., and Hosoda, K. Purposive behavior acquisition for a real robot by vision-based reinforcement learning. Mach. Learn., 1996, 23(2–3), 279–303.

9. Rummery, G. A. and Niranjan, M. On-Line Q-Learning Using Connectionist Systems. Department of Engineering, University of Cambridge, England, 1994.

10. Tangkaratt, V., Abdolmaleki, A., and Sugiyama, M. Guide actor-critic for continuous control. 2017, arXiv:1705.07606.

11. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, USA, 2011.

12. Hasselt, H. V. Double Q-learning. In Advances in Neural Information Processing Systems (Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A., eds). 2010, 2613–2621.

13. Kim, Y. Convolutional neural networks for sentence classification. 2014, arXiv:1408.5882.

14. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with deep reinforcement learning. 2013, arXiv:1312.5602.

15. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Comput., 1997, 9(8), 1735–1780.

16. Hausknecht, M. and Stone, P. Deep recurrent Q-learning for partially observable MDPs. In 2015 AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents. 2015, 1–9. 2015, arXiv:1507.06527.

17. Zhao, D., Wang, H., Shao, K., and Zhu, Y. Deep reinforcement learning with experience replay based on SARSA. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 2016, 1–6.

18. Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems. 2000, 1057–1063.

19. Kakade, S. M. A natural policy gradient. In Advances in Neural Information Processing Systems. 2002, 1531–1538.

20. Konda, V. R. and Tsitsiklis, J. N. On actor-critic algorithms. SIAM J. Control Optim., 2003, 42(4), 1143–1166.

21. Marbach, P. and Tsitsiklis, J. N. Simulation-based optimization of Markov reward processes. IEEE Trans. Autom. Control, 2001, 46(2), 191–209.

22. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937. 2016, arXiv:1602.01783. 

23. Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. A brief survey of deep reinforcement learning. 2017, arXiv:1708.05866.

24. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. 2015, arXiv:1511.05952.

25. Baker, B., Gupta, O., Naik, N., and Raskar, R. Designing neural network architectures using reinforcement learning. 2016, arXiv:1611.02167.

26. Zhong, Z., Yan, J., Wu, W., Shao, J., and Liu, C. L. Practical block-wise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018, 2423–2432.

27. O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. PGQ: Combining policy gradient and Q-learning. 2016, arXiv:1611.01626.

28. Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-Prop: Sample efficient policy gradient with an off-policy critic. 2017, arXiv:1611.02247.

29. Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., et al. Deep Q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence. 2018, 3223–3230.

30. Dietterich, T. G. Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res., 2000, 13, 227–303.

31. Li, Y. Deep reinforcement learning: an overview. 2017, arXiv:1701.07274.



Back to Issue