Hand hygiene is a standard six-step hand-washing action proposed by the World
Health Organization (WHO). However, there is no good way to supervise medical
staff to do hand hygiene, which brings the potential risk of disease spread.
Existing action assessment works usually make an overall quality prediction on
an entire video. However, the internal structures of hand hygiene action are
important in hand hygiene assessment. Therefore, we propose a novel
fine-grained learning framework to perform step segmentation and key action
scorer in a joint manner for accurate hand hygiene assessment. Existing
temporal segmentation methods usually employ multi-stage convolutional network
to improve the segmentation robustness, but easily lead to over-segmentation
due to the lack of the long-range dependence. To address this issue, we design
a multi-stage convolution-transformer network for step segmentation. Based on
the observation that each hand-washing step involves several key actions which
determine the hand-washing quality, we design a set of key action scorers to
evaluate the quality of key actions in each step. In addition, there lacks a
unified dataset in hand hygiene assessment. Therefore, under the supervision of
medical staff, we contribute a video dataset that contains 300 video sequences
with fine-grained annotations. Extensive experiments on the dataset suggest
that our method well assesses hand hygiene videos and achieves outstanding
performance.