As deep convolutional neural networks (DNNs) are widely used in various
fields of computer vision, leveraging the overfitting ability of the DNN to
achieve video resolution upscaling has become a new trend in the modern video
delivery system. By dividing videos into chunks and overfitting each chunk with
a super-resolution model, the server encodes videos before transmitting them to
the clients, thus achieving better video quality and transmission efficiency.
However, a large number of chunks are expected to ensure good overfitting
quality, which substantially increases the storage and consumes more bandwidth
resources for data transmission. On the other hand, decreasing the number of
chunks through training optimization techniques usually requires high model
capacity, which significantly slows down execution speed. To reconcile such, we
propose a novel method for high-quality and efficient video resolution
upscaling tasks, which leverages the spatial-temporal information to accurately
divide video into chunks, thus keeping the number of chunks as well as the
model size to minimum. Additionally, we advance our method into a single
overfitting model by a data-aware joint training technique, which further
reduces the storage requirement with negligible quality drop. We deploy our
models on an off-the-shelf mobile phone, and experimental results show that our
method achieves real-time video super-resolution with high video quality.
Compared with the state-of-the-art, our method achieves 28 fps streaming speed
with 41.6 PSNR, which is 14$times$ faster and 2.29 dB better in the live video
resolution upscaling tasks. Code available in
https://github.com/coulsonlee/STDO-CVPR2023.git