A study of first-passage time minimization via Q-learning in heated gridworlds

Maria A. Larchenko, Pavel Osinenko, Grigory Yaremenko, Vladimir V. Palyulin

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)


Optimization of first-passage times is required in applications ranging from nanobots navigation to market trading. In such settings, one often encounters unevenly distributed noise levels across the environment. We extensively study how a learning agent fares in 1- and 2- dimensional heated gridworlds with an uneven temperature distribution. The results show certain bias effects in agents trained via simple tabular Q-learning, SARSA, Expected SARSA and Double Q-learning. Namely, the state-dependency of noise triggers convergence to suboptimal solutions and the respective policies follow them for practically long learning times. The high learning rate prevents exploration of regions with higher temperature, while the low enough rate increases the presence of agents in such regions. These biases of temporal-difference-based reinforcement learning methods may have implications for their application in real-world physical scenarios and agent design.

Original languageEnglish
Pages (from-to)159349-159363
Number of pages15
JournalIEEE Access
Publication statusPublished - 2021


  • First-passage times
  • Path planning
  • Reinforcement learning
  • Stochastic systems


Dive into the research topics of 'A study of first-passage time minimization via Q-learning in heated gridworlds'. Together they form a unique fingerprint.

Cite this