ComfyUI Wan 2.1 Image to Video Native LoRA Workflow

Download Workflow

Workflow Name

wan-2.1-i2v-native-lora.json

Workflow Description

Wan 2.1 Image-to-Video NATIVE + LoRA Workflow.

Workflow Dependencies

Download Wan 2.1 Image-to-Video Model (save in: ComfyUI\models\diffusion_models):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/diffusion_models

Download Wan 2.1 Text Encoder (save in: ComfyUI\models\text_encoders):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/text_encoders

Download Wan 2.1 VAE (save in: ComfyUI\models\vae):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/vae

Download Wan 2.1 Image Encoder (save in: ComfyUI\models\clip_vision):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/clip_vision

Download LoRA (save in: ComfyUI\models\loras).

Install “ComfyUI-VideoHelperSuite” custom node via ComfyUI Manager.

Workflow Details

UNETLoader

Purpose:
This node loads a UNET model, which is the core component of the Stable Diffusion architecture responsible for the denoising process during image generation.

Customizable Settings:
unet_name:
This setting allows the user to select the specific UNET model file to load. Changing this selection determines the core model used for image generation, impacting the style and characteristics of the generated images.
weight_type: This setting is for the weight type of the model.

VAELoader

Purpose:
This node loads a VAE (Variational Autoencoder) model, which is used to encode and decode latent images.

Customizable Settings:
vae_name:
This setting allows the user to select the specific VAE model file to load. Changing this selection affects how the latent images are encoded and decoded, influencing the final image quality and details.

CLIPVisionLoader

Purpose:
This node loads a CLIP Vision model, which is used to encode images into embeddings for image understanding.

Customizable Settings:
clip_vision_name:
This setting allows the user to select the specific CLIP Vision model file to load. This selection impacts how images are interpreted by the model when used as conditions.

CLIPLoader

Purpose:
This node loads a CLIP (Contrastive Language–Image Pre-training) model, which is used to encode text prompts into embeddings that the Stable Diffusion model can understand.

Customizable Settings:
clip_name: This setting allows the user to select the specific CLIP model file to load. Changing this selection influences how text prompts are interpreted, affecting the generated image’s adherence to the prompt.
output: This setting has two options, wan and default.

LoadImage

Purpose:
This node loads an image from a file into the workflow.

Customizable Settings:
image:
This setting allows the user to select the image file to load. The selected image will be used as the starting point for the image-to-video generation.
image_type: This setting allows the user to select how the image is loaded.
mask_type: This setting allows the user to choose how the mask is handled.

LoraLoaderModelOnly

Purpose:
This node loads a LoRA (Low-Rank Adaptation) model, which applies fine-tuning to the UNET model to modify the image generation process.

Customizable Settings:
lora_name:
This setting allows the user to select the specific LoRA model file to load. Changing this selection alters the style and characteristics of the generated images based on the LoRA’s training.
strength_model: This setting controls the strength of the LoRA’s effect on the UNET model. Increasing this value amplifies the LoRA’s influence, while decreasing it reduces it.

CLIPTextEncode (Positive Prompt)

Purpose:
This node encodes the positive text prompt into CLIP embeddings, guiding the image generation towards the desired content.

Customizable Settings:
text:
This setting allows the user to enter the positive prompt describing the desired content of the generated image. Modifying this text directly influences the image’s content.
clip: This input accepts the CLIP model from the CLIPLoader node. It ensures that the same CLIP model used is consistent.

CLIPTextEncode (Negative Prompt)

Purpose:
This node encodes the negative text prompt into CLIP embeddings, guiding the image generation away from unwanted content.

Customizable Settings:
text:
This setting allows the user to enter the negative prompt specifying elements to avoid in the generated image. Modifying this text helps refine the output.
clip: This input accepts the CLIP model from the CLIPLoader node. It ensures that the same CLIP model used is consistent.

CLIPVisionEncode

Purpose:
This node encodes an image using the CLIP Vision model, producing embeddings used for image-to-video generation.

Customizable Settings:
clip_vision:
This input accepts the CLIP Vision model from the CLIPVisionLoader node.
image: This input accepts the image from the LoadImage node.
output: This setting selects the type of output.

WanImageToVideo

Purpose:
This node generates a latent video from an input image and text prompts, using the Wan Image-to-Video model.

Customizable Settings:
width:
This setting determines the width of the generated video frames in pixels.
height: This setting determines the height of the generated video frames in pixels.
motion_bucket_id: This setting influences the motion added to the generated video.
fps: This setting determines the frames per second of the generated video.
positive: this input takes the positive conditioning from the positive clip text encode node.
negative: this input takes the negative conditioning from the negative clip text encode node.
vae: this input takes the VAE model from the VAE loader node.
clip_vision_output: This input takes the clip vision output from the clip vision encode node.
start_image: this input takes the image from the load image node.

KSampler

Purpose:
This node performs the iterative denoising process to generate the latent video frames.

Customizable Settings:
seed:
This setting initializes the random number generator, influencing the specific details of the generated video. Using the same seed produces the same video given identical settings.
steps: This setting determines the number of denoising steps per frame. Higher step counts usually result in more detailed and refined videos but require more processing time.
cfg: This setting controls the classifier-free guidance scale, balancing prompt adherence and creative freedom.
sampler_name: This setting selects the sampling algorithm used for denoising.
scheduler: This setting chooses the scheduler used to control the noise schedule during sampling.
denoise: This setting controls the amount of noise added to the latent image during denoising.
model: This input takes the UNET model from the LoraLoaderModelOnly node.
positive: This input takes the positive prompt embeddings from the WanImageToVideo node.
negative: This input takes the negative prompt embeddings from the WanImageToVideo node.
latent_image: This input takes the latent video from the WanImageToVideo node.

VAEDecode

Purpose:
This node decodes the latent video frames back into pixel-based images.

Customizable Settings:
samples:
This input takes the latent video frames from the KSampler node.
vae: This input takes the VAE model from the VAELoader node.

VHS_VideoCombine

Purpose:
This node combines the decoded image frames into a video file.

Customizable Settings:
frame_rate:
This setting determines the frame rate of the output video.
loop_count: This setting controls how many times the video loops.
filename_prefix: This setting allows the user to specify a prefix for the saved video file name.
format: This setting selects the video file format.
pix_fmt: This setting selects the pixel format of the video.
crf: This setting controls the Constant Rate Factor, affecting video quality and file size.
save_metadata: This setting determines whether metadata is saved with the video.
trim_to_audio: This setting trims the video to the length of the audio.
pingpong: This setting plays the video forwards and then backwards.
save_output: This setting enables or disables saving the video output.
videopreview: This setting controls the video preview.
images: This input takes the decoded image frames from the VAEDecode node.
audio: this input takes audio input.
meta_batch: this input takes meta batch information.
vae: this input takes the VAE model.

Leave a Reply

Your email address will not be published. Required fields are marked *