Resumo
In this paper we propose two novel distributed algorithms for multi-agent off-policy learning of linear approximation of the value function in Markov decision processes. The algorithms differ in the way of how distributed consensus iterations are incorporated in a basic, recently proposed, single agent scheme. The proposed completely decentralized off-policy learning schemes subsume local eligibility traces, and allow applications in which all the agents may have different behavior policies while evaluating a single target policy. Under nonrestrictive assumptions on the time-varying network topology and the individual state-visiting distributions of the agents, we prove that the parameter estimates of the algorithms weakly converge to a consensus. The variance reduction properties of the proposed algorithms are demonstrated. We also formulate specific guidelines on how to design the network weights and topology. The results are illustrated using simulations.
Idioma original | Inglês |
---|---|
Páginas (de-até) | 1563-1568 |
Número de páginas | 6 |
Revista | IFAC-PapersOnLine |
Volume | 53 |
DOIs | |
Estado da publicação | Publicadas - 2020 |
Evento | 21st IFAC World Congress 2020 - Berlin Duração: 12 jul. 2020 → 17 jul. 2020 |
Nota bibliográfica
Publisher Copyright:Copyright © 2020 The Authors. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0)
Financiamento
Financiadoras/-es | Número do financiador |
---|---|
Fundação para a Ciência e a Tecnologia | PCIF/SSI/0102/2017, IF/00325/2015, UIDB/04111/2020 |