7 Citações (Scopus)

Resumo

In this paper we propose two new distributed consensus-based algorithms for temporal-difference learning in multi-agent Markov decision processes. The algorithms are of off-policy type and are aimed at linear approximation of the value function. Restricting agents’ observations to local data and communications to their small neighborhoods, the algorithms consist of: (a) local updates of the parameter estimates based on either the standard TD(λ) or the emphatic ETD(λ) algorithm, and (b) dynamic consensus scheme implemented over a time-varying lossy communication network. The algorithms are completely decentralized, allowing efficient parallelization and applications where the agents may have different behavior policies and different initial state distributions while evaluating a common target policy. It is proved under nonrestrictive assumptions that the proposed algorithms weakly converge to the solutions of the mean ordinary differential equation (ODE) common for all the agents. It is also proved that the whole system may be stabilized by a proper choice of the network and that the parameter estimates weakly converge to consensus. Discussion is given on the asymptotic bias and variance of the estimates, on the projected forms of the proposed algorithms, as well as on restrictiveness of the adopted assumptions. Simulation results illustrate the main properties of the algorithms and provide comparisons with similar schemes.

Idioma originalInglês
Número do artigo110922
RevistaAutomatica
Volume151
DOIs
Estado da publicaçãoPublicadas - mai. 2023

Nota bibliográfica

Publisher Copyright:
© 2023 The Author(s)

Financiamento

Financiadoras/-esNúmero do financiador
AI-DECIDE
Fundação para a Ciência e a TecnologiaUIDB/04111/2020
Science Fund of the Republic of Serbia6524745

    Impressão digital

    Mergulhe nos tópicos de investigação de “Distributed consensus-based multi-agent temporal-difference learning“. Em conjunto formam uma impressão digital única.

    Citar isto