TY - GEN
T1 - Decentralized Multi-Agent Multi-Task Q-Learning with Function Approximation for POMDPs
AU - Stankovic, Milos S.
AU - Beko, Marko
AU - Stankovic, Srdjan S.
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In this paper we propose a novel distributed gradient-based two-time-scale algorithm for decentralized multi-agent multi-task learning (MTL) using a linear approximation of the optimal action value function (Q -function) in POMDPs. The algorithm is based on the idea of using in a concurrent way recursive Bayesian state belief filters for estimation of the system model parameters, prediction of the hidden state and definition of the optimal approximation parameters of the local Q-functions. The main MTL algorithm is composed of: 1) local parameter updates based on an off-policy gradient-based learning algorithm with target policy belonging to the greedy or Gibbs classes, and 2) a linear stochastic time-varying consensus scheme for parameters shared between the agents in order to achieve the MTL goal. It is proved, under general assumptions, that the parameter estimates generated by the proposed algorithm weakly converge to a bounded invariant set of the corresponding ordinary differential equations (ODE). Simulation results illustrate the effectiveness of the algorithm.
AB - In this paper we propose a novel distributed gradient-based two-time-scale algorithm for decentralized multi-agent multi-task learning (MTL) using a linear approximation of the optimal action value function (Q -function) in POMDPs. The algorithm is based on the idea of using in a concurrent way recursive Bayesian state belief filters for estimation of the system model parameters, prediction of the hidden state and definition of the optimal approximation parameters of the local Q-functions. The main MTL algorithm is composed of: 1) local parameter updates based on an off-policy gradient-based learning algorithm with target policy belonging to the greedy or Gibbs classes, and 2) a linear stochastic time-varying consensus scheme for parameters shared between the agents in order to achieve the MTL goal. It is proved, under general assumptions, that the parameter estimates generated by the proposed algorithm weakly converge to a bounded invariant set of the corresponding ordinary differential equations (ODE). Simulation results illustrate the effectiveness of the algorithm.
UR - https://www.scopus.com/pages/publications/86000613748
U2 - 10.1109/CDC56724.2024.10886386
DO - 10.1109/CDC56724.2024.10886386
M3 - Conference contribution
AN - SCOPUS:86000613748
T3 - Proceedings of the IEEE Conference on Decision and Control
SP - 7680
EP - 7685
BT - 2024 IEEE 63rd Conference on Decision and Control, CDC 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 63rd IEEE Conference on Decision and Control, CDC 2024
Y2 - 16 December 2024 through 19 December 2024
ER -