ComfyUI Wan 2.1 Image to Video GGUF LoRA Workflow

Workflow Name
wan-2.1-i2v-gguf-lora.json
Workflow Description
Wan 2.1 Image-to-Video GGUF + LoRA Workflow.
Workflow Dependencies
Download Wan 2.1 GGUF 480p Image-to-Video Model (save in: ComfyUI\models\diffusion_models):
https://huggingface.co/city96/Wan2.1-I2V-14B-480P-gguf/tree/main
Download Wan 2.1 GGUF 720p Image-to-Video Model (save in: ComfyUI\models\diffusion_models):
https://huggingface.co/city96/Wan2.1-I2V-14B-720P-gguf/tree/main
Download Wan 2.1 Text Encoder (save in: ComfyUI\models\text_encoders):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/text_encoders
Download Wan 2.1 VAE (save in: ComfyUI\models\vae):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/vae
Download Wan 2.1 Clip Vision Image Encoder (save in: ComfyUI\models\clip_vision):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/clip_vision
Download LoRA (save in: ComfyUI\models\loras).
Install “ComfyUI-VideoHelperSuite” custom node via ComfyUI Manager.
Install “ComfyUI-GGUF” custom node via ComfyUI Manager.
Workflow Details
UnetLoaderGGUF
Purpose:
Loads a UNet model in GGUF format, optimized for efficient processing. This model is responsible for the denoising process in video generation.
Customizable Settings:
Model File (dropdown): Allows the user to select the specific GGUF model file. This choice determines the core video generation model, impacting the video’s style and quality.
VAELoader
Purpose:
Loads a Variational Autoencoder (VAE) model. The VAE is used to encode images into latent space and decode latent space back into images.
Customizable Settings:
VAE Name (dropdown): Allows the user to select the specific VAE model file. The selected VAE can affect the color, detail, and overall quality of the generated video frames.
CLIPVisionLoader
Purpose:
Loads a CLIP Vision model, which is used to encode images into CLIP vision embeddings. These embeddings help the model understand the visual content of the input image.
Customizable Settings:
CLIP Vision Name (dropdown): Allows the user to select the specific CLIP Vision model file. This choice determines how the model interprets the visual input.
CLIPLoader
Purpose:
Loads a CLIP model, which is used to encode text prompts into CLIP embeddings. These embeddings guide the video generation process.
Customizable Settings:
CLIP Name (dropdown): Allows the user to select the specific CLIP model file. This choice affects how text prompts are interpreted by the model.
Tokenizer (dropdown): Selects the tokenizer used to process text input.
BPE (dropdown): Selects the Byte Pair Encoding file used by the tokenizer.
LoadImage
Purpose:
Loads an image file into the workflow, serving as the starting point for video generation.
Customizable Settings:
Image (dropdown): Allows the user to select the image file to load. Changing this alters the initial frame of the generated video.
Image Type (dropdown): Selects the method for loading the image.
Mask (text box): allows a mask file to be loaded.
LoraLoaderModelOnly
Purpose:
Loads a LoRA (Low-Rank Adaptation) model and applies it to the UNet model. LoRAs fine-tune the model for specific styles or subjects.
Customizable Settings:
LoRA Name (dropdown): Allows the user to select the LoRA model file. This choice influences the style and details of the generated video.
Strength Model (slider): Controls the strength of the LoRA effect on the UNet model.
CLIPTextEncode (Positive Prompt)
Purpose:
Encodes the positive text prompt into CLIP embeddings, guiding the video generation towards the desired content.
Customizable Settings:
text (text box): Allows the user to enter the positive prompt. Changing this alters the content and style of the generated video.
clip (input connection): Receives the CLIP model from the CLIPLoader node.
CLIPTextEncode (Negative Prompt)
Purpose:
Encodes the negative text prompt into CLIP embeddings, guiding the video generation away from unwanted content.
Customizable Settings:
text (text box): Allows the user to enter the negative prompt. Changing this refines the generated video by removing unwanted elements.
clip (input connection): Receives the CLIP model from the CLIPLoader node.
CLIPVisionEncode
Purpose:
Encodes the loaded input image into CLIP vision embeddings, providing visual context for the video generation.
Customizable Settings:
clip_vision (input connection): Receives the CLIP Vision model from the CLIPVisionLoader node.
image (input connection): Receives the image from the LoadImage node.
Image Processor (dropdown): Selects how the image is processed before encoding.
WanImageToVideo
Purpose:
Generates a series of latent frames based on the input image, positive and negative prompts, VAE, and CLIP vision embeddings. This is the core image to video generation node.
Customizable Settings:
Width (integer): Sets the width of the generated video frames.
Height (integer): Sets the height of the generated video frames.
Frames (integer): Sets the number of frames to generate.
Seed (integer): Sets the random seed for the generation process.
KSampler
Purpose:
Performs the iterative denoising process in latent space, guided by the encoded prompts, to generate the final latent frames.
Customizable Settings:
seed (integer): Sets the random seed for the denoising process.
sampler_name (dropdown): Selects the sampling algorithm used for denoising.
scheduler (dropdown): Chooses the scheduler for the denoising process.
steps (integer): Determines the number of denoising steps.
cfg (float): Controls the strength of the prompt guidance.
denoise (float): Controls the amount of denoising applied.
model (input connection): Receives the UNet model from the LoraLoaderModelOnly node.
positive (input connection): Receives the positive prompt embeddings.
negative (input connection): Receives the negative prompt embeddings.
latent_image (input connection): Receives the latent frames from the WanImageToVideo node.
VAEDecode
Purpose:
Decodes the latent frames generated by the KSampler into pixel-based image frames.
Customizable Settings:
samples (input connection): Receives the latent frames from the KSampler node.
vae (input connection): Receives the VAE model from the VAELoader node.
VHS_VideoCombine
Purpose:
Combines the generated image frames into a video file.
Customizable Settings:
frame_rate (integer): Sets the frame rate of the output video.
loop_count (integer): Sets the number of times the video loops.
filename_prefix (text box): Sets the prefix for the output video file name.
format (dropdown): Selects the video file format.
pix_fmt (dropdown): Selects the pixel format of the video.
crf (integer): Controls the video compression quality.
save_metadata (boolean): Determines if metadata is saved with the video.
trim_to_audio (boolean): Trims the video to the length of the audio.
pingpong (boolean): Plays the video forward and then backward.
save_output (boolean): Determines if the video is saved to a file.
videopreview (widget): provides a preview of the video.
images (input connection): Receives the image frames from the VAEDecode node.
audio (input connection): Receives audio input, if any.
meta_batch (input connection): Receives batch metadata.
vae (input connection): Receives VAE model.