Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution

(Coming Soon!) [Paper]      (Coming Soon!) [Code]     

Abstract

Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies. Our code and demonstrations will be available at this page.

Approach

Our method consists of several modules, including Spatial-Temporal Continuous Mamba (STCM) and Self-supervised ControlNet (MoCoCtrl). The STCM incorporates 3D-Mamba Block within its structure, which, with the addition of spatial-temporal continuous scan, ensures comprehensive 3D attention for both inter-frame and intra-frame modeling. The Self-supervised ControlNet adopts the MoCo architecture to employ contrastive learning between LR and HR features, aligning LR features to noise-free HR features, thus reducing the impact of degradation.

 

Our method utilizes Patch-Level Momentum Contrast for degradation removal. In Patch-Level Momentum Contrast, LR and HR images are separately processed by ControlNet to extract feature maps, followed by patch-level contrastive learning on these features. The ControlNet for LR is online updated, whereas the ControlNet for HR updates its weights using a momentum way. Additionally, a multi-stage HR-LR hybrid training strategy is applied to stabilize the training process.

 

Video demo

Compare with other Models:

 

 

We borrow the source code of this project page from DreamBooth.