Multi-agent reinforcement learning for partially observable cooperative systems with acyclic dependence structure
Résumé
Single-agent reinforcement learning algorithms can be directly applied to multiagent systems in an independent learning approach, but they then lose any convergence properties due to non-stationarity. We prove that in transition-independent Decentralized Partially Observable Decentralized Markov Decision Process (Dec-POMDP) non-stationarity can be mitigated by a multi-scale approach when the interdependence of agents dynamics can be represented by a directed acyclic graph (DAG). We propose a multi-scale Q-learning algorithm (MQL) where agents update local q-learning iterates at different timescales without communication and still converge. To this purpose, we first show that we can model the loss of information on the global state as a state-dependent Markovian noise. Then, we show that results from stochastic approximation theory can be used to prove the convergence of the MQL under partial state observability. Next, we give practical solutions to exploit knowledge about agent interaction to assign learning rates that ensure convergence, and propose a NetworkMQL algorithm that can achieve convergence in Network-Distributed POMDP (ND-POMDP). Finally, we validate both MQL and NetworkMQL on a wind farm control problem from the energy industry.
Origine : Fichiers produits par l'(les) auteur(s)