Introduction

Reference-to-video (R2V) generation aims to synthesize videos that align with text prompts while preserving the subject identity from reference images. However, current methods are hindered by reliance on explicit R2V datasets containing triplets of reference images, videos, and text prompts—a process that is costly, difficult to scale, and restricts generalization to unseen subject categories.

We introduce Saber, a scalable zero-shot framework that bypasses this data bottleneck. Trained exclusively on video-text pairs, Saber employs a masked training strategy where randomly sampled and partially masked video frames serve as references, compelling the model to learn identity-consistent representations without explicit R2V data. A tailored attention mechanism guided by attention masks directs the model to focus on reference-aware features while suppressing background noise. Spatial mask augmentations further mitigate copy-paste artifacts common in reference-to-video generation.

Saber naturally supports a varying number of reference images and multiple views of the same subject, enabling richer multi-subject customization. On the OpenS2V-Eval benchmark, it consistently outperforms models explicitly trained on R2V data, demonstrating strong zero-shot generalization and scalability.

Method

Our goal is to train a diffusion model capable of generating videos that preserve the identity and appearance of subjects in given reference images while following the provided text prompt. Unlike previous methods that rely on costly reference image-video-text triplets, Saber achieves R2V capabilities using only video-text pairs—the same data paradigm used for text-to-video (T2V) and image-to-video (I2V) training.

Our core idea is to simulate the R2V task by replacing explicitly collected reference images with randomly masked frames during training. This masked training strategy is supported by two key components: (i) mask augmentations designed to mitigate copy-paste artifacts, and (ii) a tailored attention mechanism that guides the model to focus on relevant reference features.

Masked Frames as Reference

Instead of relying on pre-collected reference images with limited subject diversity, we use randomly masked frames as dynamic substitutes during training. This strategy naturally introduces highly diverse reference samples, allowing the model to learn more effective subject integration and achieve stronger generalization.

For each reference, we randomly sample a frame from the video and apply a mask generator to produce diverse binary masks with controllable foreground area ratios. To mitigate copy-paste artifacts, we apply mask augmentation—random affine transformations (rotation, scaling, shear, translation, flip) to both the image and mask—disrupting spatial correspondence between masked references and video frames.

Masked reference generation. Given a video, the mask generator produces diverse random masks (ellipse, Fourier blob, polygon, etc.), which are then applied to randomly sampled video frames with spatial augmentations (rotation, scaling, shear, translation, flip).

Model Design

We adopt a simple yet effective architecture that concatenates reference images along the temporal dimension at the end of target video frames in latent space. The model manages interaction between target video latents and reference latents through a tailored attention mechanism in each transformer block.

Input Format: The masked frames are encoded into latent space and concatenated with the noisy video latents along the temporal dimension. Reference latents remain noise-free to preserve accurate conditioning, while binary masks indicate reference regions in the attention mechanism.

Attention Mechanism: In self-attention, video and reference parts interact with each other, where attention masks ensure that only valid reference regions are attended. Cross-attention then incorporates text guidance, allowing video tokens to follow the text prompt while reference tokens learn semantic alignment under textual constraints.

Model design overview. Masked frames serve as reference images and are concatenated to the video tokens in latent space. Self-attention enables interaction between video and reference tokens under the attention mask, while cross-attention incorporates text guidance for semantic alignment.

Zero-Shot Inference

During inference, we use a pre-trained segmenter to extract foreground subject masks from reference images. The model naturally handles both foreground subjects (with segmentation) and background scenes (without segmentation), demonstrating remarkable flexibility. Reference images are resized and padded to match the target video size while preserving aspect ratios, then fed into the model following our input format.

Visualization Results

We conduct qualitative comparisons between Saber and other methods (Kling1.6, Phantom, VACE) across various visual scenarios, demonstrating superior subject preservation and video quality.

The video features a man with a rugged beard, wearing a leather jacket, riding a vintage motorcycle along a desert highway. His expression is focused, eyes narrowed slightly against the wind, as the setting sun casts a warm glow over the landscape.

Kling1.6

Phantom

VACE

Saber (Ours)

A woman with long, flowing black hair pulled back into a ponytail dances gracefully in a sunlit meadow. She is wearing a flowing dress that billows gently around her as she moves. The scene captures her mid-twirl, with the fabric of her dress swirling elegantly.

Kling1.6

Phantom

VACE

Saber (Ours)

This video shows a peaceful morning scene at a cozy farmhouse. The camera slowly pans across the exterior, revealing the rustic charm of the wooden structure. As the camera moves, a dog is seen lying near the entrance, its tail wagging gently. The soft morning light creates long shadows, adding to the tranquil atmosphere.

Kling1.6

Phantom

VACE

Saber (Ours)

The video shows a table setting with a floral cup, an ashtray and an open book, as a stream of tea is slowly poured into the cup.

Kling1.6

Phantom

VACE

Saber (Ours)

The video begins with a close-up of a ring resting delicately on a soft, red rose petal. The camera slowly zooms in to highlight the intricate details of the ring, its polished surface catching the light. A gentle breeze causes the rose petal to slightly flutter, moving the ring ever so slightly.

Kling1.6

Phantom

VACE

Saber (Ours)

The video captures two individuals engaged in cross-country skiing on a snowy landscape during what appears to be late afternoon or early evening, judging by the warm, golden light of the setting sun.

Kling1.6

Phantom

VACE

Saber (Ours)

The video features a woman standing in front of a large screen displaying the words "Tech Minute". She is wearing a purple top and appears to be presenting or speaking about technology-related topics.

Kling1.6

Phantom

VACE

Saber (Ours)

The video showcases a person engaged in work at a desk, with a focus on two computer monitors displaying various content. Initially, one monitor shows a social media interface with multiple posts and images, while the other monitor displays a grid of images, some of which are blurred.

Kling1.6

Phantom

VACE

Saber (Ours)

The video begins with a close-up of a beer can placed at the center of a circle made of international currencies paper bills and coins from various countries. As the camera moves, the currencies subtly shimmer, reflecting light, while the beer can remains the focal point.

Kling1.6

Phantom

VACE

Saber (Ours)

A man is playing with an American football on the beach.

Kling1.6

Phantom

VACE

Saber (Ours)

The video features a young man who appears to be a content creator or streamer. he is wearing a green sleeveless top and red headphones. The background is illuminated with vibrant neon lights, predominantly in shades of purple and blue, creating a lively and energetic atmosphere.

Kling1.6

Phantom

VACE

Saber (Ours)

The video depicts an emotional scene set outdoors, likely in a park or wooded area, given the blurred greenery and trees in the background. A man and a woman are the central figures. The man has his arm around the woman's shoulder, offering comfort. The woman appears distressed, covering her face with her hands at one point. The man leans in closer to the woman, maintaining physical contact, which indicates he is trying to console her.

Kling1.6

Phantom

VACE

Saber (Ours)

The video begins with a close-up of a watch casually left on an old wooden table, its shiny surface reflecting the soft light in the room. A cup of coffee sits beside the watch, steam gently rising from it, creating a soft swirl in the air.

Kling1.6

Phantom

VACE

Saber (Ours)

The video depicts two men engaged in a discussion within an office setting. The man on the left, dressed in a white shirt and gray pants, is holding a tablet and appears to be explaining something to the other man. The man on the right, wearing a light blue shirt and jeans, listens attentively while occasionally gesturing with his hand.

Kling1.6

Phantom

VACE

Saber (Ours)

The video begins with a close-up of a vase sitting on a rustic wooden table, bathed in the soft, golden light of the morning sun. As the sunlight shifts, the light slowly moves across the vase, creating subtle reflections and shadows on the table. A gentle breeze flows through an open window, causing a few flowers in the vase to sway slightly.

Kling1.6

Phantom

VACE

Saber (Ours)

The video depicts an elderly couple sitting on a light gray sofa in a well-lit living room. Both individuals appear engaged with the laptop screen, occasionally gesturing towards it with their hands. Then the couple's expressions change from focused to surprised and then to joyful. The man raises his hand in a gesture of excitement, while the woman laughs heartily, her head tilted back slightly.

Kling1.6

Phantom

VACE

Saber (Ours)

The video features a young woman with long blonde hair standing in front of a lush, green bush adorned with white flowers. The woman is seen smiling and looking at the camera while gently touching the flowers on the bush. She then bends down slightly and smells one of the flowers, taking in its fragrance.

Kling1.6

Phantom

VACE

Saber (Ours)

The video features a woman dressed as a mermaid, swimming underwater. She is wearing a silver tail and a matching top, which is adorned with colorful patterns. The woman has long, wavy hair that flows freely underwater. The water around her is clear and blue, with sunlight filtering through the surface, creating a serene and magical atmosphere.

Kling1.6

Phantom

VACE

Saber (Ours)

The video features a man standing at an easel, focused intently as his brush dances across the canvas. His expression is one of deep concentration, with a hint of satisfaction as each brushstroke adds color and form. He wears a paint-splattered apron, and his hands move with confident precision. The setting, filled with scattered art supplies, open paint tubes, and unfinished sketches pinned to the wall, suggests an artist's studio.

Kling1.6

Phantom

VACE

Saber (Ours)

The video showcases an outdoor dining setup, set in a picturesque vineyard or garden. The scene is characterized by a round table covered with a light-colored tablecloth, adorned with a vibrant floral centerpiece. The chairs surrounding the table are made of light-colored wood.

Kling1.6

Phantom

VACE

Saber (Ours)

Emergent Abilities

Beyond the standard R2V task, Saber demonstrates several interesting capabilities that emerged from its training strategy, showcasing remarkable robustness and flexibility.

Single Subject Multiple Views

Saber can handle multiple reference images corresponding to different views of the same subject.

The video opens with a slow, cinematic rotation around a robot, its body gleaming with a fusion of transparent panels and brushed metal, circuits pulsing faintly beneath the surface like living veins of light. As the rotation completes, the scene flows into motion, following the robot as it strides along lush green country lanes.

Saber (Ours)

The video shows a charming Corgi dog, personified as a professional baker. The Corgi stands on a featureless gray background. The animation is a slow, steady rotation, starting from a side profile, turning to show the dog's back, and then completing the circle to face the viewer head-on.

Saber (Ours)

Cross-Modal Alignment

Saber demonstrates robust alignment between reference images and text prompts. By swapping subject descriptions in prompts (e.g., clothing color or subject positions), Saber accurately reflects the corresponding visual changes.

The video depicts two men in an office environment. The focus is on a man wearing a [Type 1: blue shirt] / [Type 2: black vest] seated at a desk, working intently on a laptop. Another man enters the scene, walking towards the seated individual. As he approaches, he leans over the desk, engaging in conversation with the seated man.

Type 1: Blue Shirt

Type 2: Black Vest

The video opens with a serene outdoor setting in a forest, featuring a green tent and two individuals seated on a log. The individuals are dressed casually, with [Type 1: left] / [Type 2: right] person wearing a white tank top and patterned shorts, and [Type 1: right] / [Type 2: left] person in a plaid shirt and dark pants. They appear to be engaged in a relaxed conversation.

Type 1: Left - White Tank Top

Type 2: Right - White Tank Top

Conclusion and Limitations

In this work, we present Saber, a scalable zero-shot framework for reference-to-video generation that eliminates the need for explicitly R2V datasets. Trained solely on large-scale video-text pairs, Saber leverages a masked training strategy, a tailored attention mechanism, and mask augmentation to achieve identity-consistent, natural, and coherent video generation. It further scales to multiple references, supporting both multi-identity and multi-view inputs without additional data preparation or changes to the training pipeline. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that Saber consistently outperforms methods trained on explicit R2V data. These results show that effective R2V models can be trained without dedicated datasets, paving the way for future research in scalable and generalizable reference-to-video generation.

While Saber achieves strong zero-shot performance and scalability, several limitations remain. First, R2V generation may collapse when the number of reference images increases significantly (e.g., $12$), resulting in fragmented compositions where references are combined without coherent understanding. Second, Saber primarily focuses on identity preservation and visual coherence, while fine-grained motion control and temporal consistency under complex prompts remain challenging. Future work can explore more effective integration of numerous reference images into unified video generation, as well as adaptive guidance to further improve controllability and realism in reference-to-video generation.

Citation

If you find Saber useful for your research, please cite our paper:

@article{zhou2025scaling,
    title={Scaling Zero-Shot Reference-to-Video Generation},
    author={Zhou, Zijian and Liu, Shikun and Liu, Haozhe and Qiu, Haonan and An, Zhaochong and Ren, Weiming and Liu, Zhiheng and Huang, Xiaoke and Ng, Kam Woh and Xie, Tian and Han, Xiao and Cong, Yuren and Li, Hang and Zhu, Chuyan and Patel, Aditya and Xiang, Tao and He, Sen},
    journal={arXiv preprint},
    year={2025}
}