Distributed gradient temporal difference off-policy learning with eligibility traces: Weak convergence

Miloš S. Stankovic, Marko Beko, Srdjan S. Stankovic

Resultado de pesquisarevisão de pares

3 Citações (Scopus)

Resumo

In this paper we propose two novel distributed algorithms for multi-agent off-policy learning of linear approximation of the value function in Markov decision processes. The algorithms differ in the way of how distributed consensus iterations are incorporated in a basic, recently proposed, single agent scheme. The proposed completely decentralized off-policy learning schemes subsume local eligibility traces, and allow applications in which all the agents may have different behavior policies while evaluating a single target policy. Under nonrestrictive assumptions on the time-varying network topology and the individual state-visiting distributions of the agents, we prove that the parameter estimates of the algorithms weakly converge to a consensus. The variance reduction properties of the proposed algorithms are demonstrated. We also formulate specific guidelines on how to design the network weights and topology. The results are illustrated using simulations.

Idioma originalInglês
Páginas (de-até)1563-1568
Número de páginas6
RevistaIFAC-PapersOnLine
Volume53
DOIs
Estado da publicaçãoPublicadas - 2020
Evento21st IFAC World Congress 2020 - Berlin
Duração: 12 jul. 202017 jul. 2020

Nota bibliográfica

Publisher Copyright:
Copyright © 2020 The Authors. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0)

Financiamento

Financiadoras/-esNúmero do financiador
Fundação para a Ciência e a TecnologiaPCIF/SSI/0102/2017, IF/00325/2015, UIDB/04111/2020

    Impressão digital

    Mergulhe nos tópicos de investigação de “Distributed gradient temporal difference off-policy learning with eligibility traces: Weak convergence“. Em conjunto formam uma impressão digital única.

    Citar isto