Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories. (arXiv:2210.06518v3 [cs.LG] UPDATED)

Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories. (arXiv:2210.06518v3 [cs.LG] UPDATED)
By: <a href="">Qinqing Zheng</a>, <a href="">Mikael Henaff</a>, <a href="">Brandon Amos</a>, <a href="">Aditya Grover</a> Posted: June 23, 2023

Natural agents can effectively learn from multiple data sources that differ
in size, quality, and types of measurements. We study this heterogeneity in the
context of offline reinforcement learning (RL) by introducing a new,
practically motivated semi-supervised setting. Here, an agent has access to two
sets of trajectories: labelled trajectories containing state, action and reward
triplets at every timestep, along with unlabelled trajectories that contain
only state and reward information. For this setting, we develop and study a
simple meta-algorithmic pipeline that learns an inverse dynamics model on the
labelled data to obtain proxy-labels for the unlabelled data, followed by the
use of any offline RL algorithm on the true and proxy-labelled trajectories.
Empirically, we find this simple pipeline to be highly successful — on several
D4RL benchmarks~cite{fu2020d4rl}, certain offline RL algorithms can match the
performance of variants trained on a fully labelled dataset even when we label
only 10% of trajectories which are highly suboptimal. To strengthen our
understanding, we perform a large-scale controlled empirical study
investigating the interplay of data-centric properties of the labelled and
unlabelled datasets, with algorithmic design choices (e.g., choice of inverse
dynamics, offline RL algorithm) to identify general trends and best practices
for training RL agents on semi-supervised offline datasets.

Provided by:



Moderator and Editor