TY - GEN
T1 - Distributed Multi-Agent Gradient Based Q-Learning with Linear Function Approximation
AU - Stanković, Miloš S.
AU - Beko, Marko
AU - Stanković, Srdjan S.
N1 - DBLP License: DBLP's bibliographic metadata records provided through http://dblp.org/ are distributed under a Creative Commons CC0 1.0 Universal Public Domain Dedication. Although the bibliographic metadata records are provided consistent with CC0 1.0 Dedication, the content described by the metadata records is not. Content may be subject to copyright, rights of privacy, rights of publicity and other restrictions.
PY - 2024/6/25
Y1 - 2024/6/25
N2 - In this paper we propose a novel distributed gradient-based two-time-scale algorithm for multi-agent off-policy learning of linear approximation of the optimal action-value function (Q-function) in Markov decision processes (MDPs). The algorithm is composed of: 1) local parameter updates based on an off-policy gradient temporal difference learning algorithm with target policy belonging to either the greedy or the Gibbs distribution class and stationary behavior policies possibly different for each agent, and 2) a linear stochastic time-varying consensus scheme. It is proved, under general assumptions, that the parameter estimates generated by the proposed algorithm weakly converge to a bounded invariant set of the corresponding ordinary differential equation (ODE). Simulation results illustrate effectiveness of the proposed algorithm.
AB - In this paper we propose a novel distributed gradient-based two-time-scale algorithm for multi-agent off-policy learning of linear approximation of the optimal action-value function (Q-function) in Markov decision processes (MDPs). The algorithm is composed of: 1) local parameter updates based on an off-policy gradient temporal difference learning algorithm with target policy belonging to either the greedy or the Gibbs distribution class and stationary behavior policies possibly different for each agent, and 2) a linear stochastic time-varying consensus scheme. It is proved, under general assumptions, that the parameter estimates generated by the proposed algorithm weakly converge to a bounded invariant set of the corresponding ordinary differential equation (ODE). Simulation results illustrate effectiveness of the proposed algorithm.
UR - http://www.scopus.com/inward/record.url?scp=85200578554&partnerID=8YFLogxK
U2 - 10.23919/ecc64448.2024.10590764
DO - 10.23919/ecc64448.2024.10590764
M3 - Conference contribution
AN - SCOPUS:85200578554
T3 - 2024 European Control Conference (ECC)
SP - 2500
EP - 2505
BT - ECC
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 European Control Conference, ECC 2024
Y2 - 25 June 2024 through 28 June 2024
ER -