ComfyUI Wan 2.1 Image to Video GGUF Workflow

Download Workflow

Workflow Name

wan-2.1-i2v-gguf

Workflow Description

Basic Wan 2.1 Image-to-Video GGUF workflow.

Workflow Dependencies

Download Wan 2.1 GGUF 480p Image-to-Video Models (save in: ComfyUI\models\diffusion_models):
https://huggingface.co/city96/Wan2.1-I2V-14B-480P-gguf/tree/main

Download Wan 2.1 GGUF 720p Image-to-Video Models (save in: ComfyUI\models\diffusion_models):
https://huggingface.co/city96/Wan2.1-I2V-14B-720P-gguf/tree/main

Download Wan 2.1 Text Encoders (save in: ComfyUI\models\text_encoders):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/text_encoders

Download Wan 2.1 VAE (save in: ComfyUI\models\vae):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/vae

Download Wan 2.1 Clip Vision Image Encoder (save in: ComfyUI\models\clip_vision):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/clip_vision

Install “ComfyUI-VideoHelperSuite” custom node via ComfyUI Manager.

Install “ComfyUI-GGUF” custom node via ComfyUI Manager.

Workflow Details

CLIPLoader

Purpose:

This node loads a CLIP (Contrastive Language–Image Pre-training) model, used to understand the relationship between text and images, which is essential for text-guided image and video generation.

Customizable Settings:

clip_name: This setting allows the user to select the specific CLIP model file. Changing this setting will affect how well the model understands and interprets the text prompts, influencing the image and video’s alignment with the prompt.
embedder_name: This setting allows the user to select the specific embedder for the loaded CLIP model.
tokenizer_name: This setting allows the user to select the specific tokenizer for the loaded CLIP model.

VAELoader

Purpose:

This node loads a VAE (Variational Autoencoder) model. The VAE is responsible for encoding images into latent space and decoding latent representations back into images.

Customizable Settings:

vae_name: This setting allows the user to select the specific VAE model file to load. Changing this setting can affect the image and video’s overall color, contrast, and detail because different VAEs are trained with different properties.

CLIPVisionLoader

Purpose:

This node loads a CLIP Vision model, which is specifically designed to encode image features into a format that can be used in conjunction with text embeddings for tasks like image-to-video generation.

Customizable Settings:

clip_vision_name: This setting allows the user to select the specific CLIP Vision model file to load. Changing this setting will alter the image feature extraction and affect the image to video generation.

UnetLoaderGGUF

Purpose:

This node loads a UNet model from a GGUF file. The UNet is the core of the Stable Diffusion model, responsible for the iterative denoising process in latent space, which is essential for image and video generation.

Customizable Settings:

unet_name: This setting allows the user to select the specific UNet GGUF model file to load. Changing this setting will fundamentally alter the style and characteristics of the generated video, as different UNets are trained on different datasets and styles.

LoadImage

Purpose:

This node loads an image from a file, which will be used as the starting point for the image-to-video generation.

Customizable Settings:

image: This setting allows the user to select the image file to load. Changing this setting will determine the initial frame of the generated video.
image_type: This setting allows the user to select the type of image being loaded.

CLIPTextEncode (Positive Prompt)

Purpose:

This node takes text input (the positive prompt) and encodes it into a format that the Stable Diffusion model can understand. This encoded representation guides the video generation process.

Customizable Settings:

text: This setting allows the user to input the positive prompt, which describes the desired content of the video. Modifying this text will directly change the subject, style, and details of the generated video.
clip: This setting takes the CLIP model output from the CLIPLoader node. Changing this setting would require loading a different CLIP model.

CLIPTextEncode (Negative Prompt)

Purpose:

This node takes text input (the negative prompt) and encodes it into a format that the Stable Diffusion model can understand. This encoded representation specifies what elements to avoid in the generated video.

Customizable Settings:

text: This setting allows the user to input the negative prompt, which describes the elements to exclude from the generated video. Modifying this text will prevent unwanted artifacts or styles from appearing in the output.
clip: This setting takes the CLIP model output from the CLIPLoader node. Changing this setting would require loading a different CLIP model.

CLIPVisionEncode

Purpose:

This node encodes the loaded image using the CLIP Vision model, creating a visual embedding that can be used to guide the video generation.

Customizable Settings:

clip_vision: This setting takes the CLIP Vision model from the CLIPVisionLoader node.
image: This setting takes the image from the LoadImage node.
output_type: This setting allows the user to select the output type.

WanImageToVideo

Purpose:

This node takes an input image, along with text prompts and model information, and generates a latent representation of a video.

Customizable Settings:

width: This setting determines the width of each frame in the generated video in pixels. Increasing or decreasing this value will change the horizontal resolution of the output video.
height: This setting determines the height of each frame in the generated video in pixels. Increasing or decreasing this value will change the vertical resolution of the output video.
frame_count: This setting determines the number of frames in the generated video. Increasing this value will result in a longer video.
batch_size: This setting determines the number of videos generated in a single batch. Increasing this value will produce multiple videos with the same prompt, while decreasing it will generate fewer.
positive: This setting takes the encoded positive prompt from the CLIPTextEncode (Positive Prompt) node.
negative: This setting takes the encoded negative prompt from the CLIPTextEncode (Negative Prompt) node.
vae: This setting takes the VAE model output from the VAELoader node.
clip_vision_output: This setting takes the encoded image from the CLIPVisionEncode node.
start_image: This setting takes the loaded image from the LoadImage node.

KSampler

Purpose:

This node performs the iterative denoising process, which transforms the latent video into a realistic video based on the prompts and image information.

Customizable Settings:

seed: This setting initializes the random number generator, which controls the noise pattern used in the denoising process. Changing the seed will produce different variations of the same prompt.
steps: This setting determines the number of denoising steps. Increasing this value will generally improve video quality but increase processing time.
cfg: This setting controls the classifier-free guidance scale, which influences how closely the generated video adheres to the positive prompt. Higher values result in stronger adherence but may reduce diversity.
sampler_name: This setting selects the sampling algorithm used for denoising. Different samplers have varying speed and quality characteristics.
scheduler: This setting selects the scheduler used to control the noise schedule during denoising. Different schedulers affect the denoising trajectory and final video quality.
denoise: This setting controls the amount of noise added to the latent video. A value of 1.0 means full denoising, while lower values result in less denoising and potentially more creative variations.
model: This setting takes the UNet model output from the UnetLoaderGGUF node. Changing this setting would require loading a different UNet model.
positive: This setting takes the encoded positive prompt from the WanImageToVideo node.
negative: This setting takes the encoded negative prompt from the WanImageToVideo node.
latent_image: This setting takes the latent video from the WanImageToVideo node.

VAEDecode

Purpose:

This node decodes the latent video from the KSampler into a pixel-based video.

Customizable Settings:

samples: This setting takes the latent video output from the KSampler node.
vae: This setting takes the VAE model output from the VAELoader node. Changing this setting would require loading a different VAE model.

VHS_VideoCombine

Purpose:

This node combines the decoded video frames into a video file.

Customizable Settings:

frame_rate: This setting controls the video’s frame rate, affecting its smoothness.
loop_count: This setting determines how many times the video will loop.
filename_prefix: This setting allows the user to specify a prefix for the saved video file name.
format: This setting selects the video file format.
pix_fmt: This setting selects the pixel format.
crf: This setting controls the Constant Rate Factor, affecting the video’s compression quality.
save_metadata: This setting determines whether to save metadata.
trim_to_audio: This setting determines whether to trim the video to the audio length.
pingpong: This setting enables ping-pong looping.
save_output: This setting determines whether to save the output video.
videopreview: This setting provides a preview of the video.
images: This setting takes the images from the VAEDecode node.
audio: This setting takes the audio input.
meta_batch: This setting takes the meta batch input.
vae: This setting takes the VAE input.

Leave a Reply

Your email address will not be published. Required fields are marked *