Precise 6D pose estimation of rigid objects from RGB images is a critical but
challenging task in robotics, augmented reality and human-computer interaction.
To address this problem, we propose DeepRM, a novel recurrent network
architecture for 6D pose refinement. DeepRM leverages initial coarse pose
estimates to render synthetic images of target objects. The rendered images are
then matched with the observed images to predict a rigid transform for updating
the previous pose estimate. This process is repeated to incrementally refine
the estimate at each iteration. The DeepRM architecture incorporates LSTM units
to propagate information through each refinement step, significantly improving
overall performance. In contrast to current 2-stage Perspective-n-Point based
solutions, DeepRM is trained end-to-end, and uses a scalable backbone that can
be tuned via a single parameter for accuracy and efficiency. During training, a
multi-scale optical flow head is added to predict the optical flow between the
observed and synthetic images. Optical flow prediction stabilizes the training
process, and enforces the learning of features that are relevant to the task of
pose estimation. Our results demonstrate that DeepRM achieves state-of-the-art
performance on two widely accepted challenging datasets.