Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation. (arXiv:2302.03673v3 [cs.LG] UPDATED)
By: <a href="http://arxiv.org/find/cs/1/au:+Cui_Q/0/1/0/all/0/1">Qiwen Cui</a>, <a href="http://arxiv.org/find/cs/1/au:+Zhang_K/0/1/0/all/0/1">Kaiqing Zhang</a>, <a href="http://arxiv.org/find/cs/1/au:+Du_S/0/1/0/all/0/1">Simon S. Du</a> Posted: June 23, 2023
We propose a new model, independent linear Markov game, for multi-agent
reinforcement learning with a large state space and a large number of agents.
This is a class of Markov games with independent linear function approximation,
where each agent has its own function approximation for the state-action value
functions that are marginalized by other players’ policies. We design new
algorithms for learning the Markov coarse correlated equilibria (CCE) and
Markov correlated equilibria (CE) with sample complexity bounds that only scale
polynomially with each agent’s own function class complexity, thus breaking the
curse of multiagents. In contrast, existing works for Markov games with
function approximation have sample complexity bounds scale with the size of the
emph{joint action space} when specialized to the canonical tabular Markov game
setting, which is exponentially large in the number of agents. Our algorithms
rely on two key technical innovations: (1) utilizing policy replay to tackle
non-stationarity incurred by multiple agents and the use of function
approximation; (2) separating learning Markov equilibria and exploration in the
Markov games, which allows us to use the full-information no-regret learning
oracle instead of the stronger bandit-feedback no-regret learning oracle used
in the tabular setting. Furthermore, we propose an iterative-best-response type
algorithm that can learn pure Markov Nash equilibria in independent linear
Markov potential games. In the tabular case, by adapting the policy replay
mechanism for independent linear Markov games, we propose an algorithm with
$widetilde{O}(epsilon^{-2})$ sample complexity to learn Markov CCE, which
improves the state-of-the-art result $widetilde{O}(epsilon^{-3})$ in
Daskalakis et al. 2022, where $epsilon$ is the desired accuracy, and also
significantly improves other problem parameters.
Provided by:
http://arxiv.org/icons/sfx.gif