ComfyUI Wan 2.1 Text to Video GGUF Workflow

Workflow Name
wan-2.1-t2v-gguf
Workflow Description
Basic Wan 2.1 Text-to-Video GGUF workflow.
Workflow Dependencies
Download Wan 2.1 GGUF Text-to-Video Models (save in: ComfyUI\models\diffusion_models):
https://huggingface.co/city96/Wan2.1-T2V-14B-gguf/tree/main
Download Wan 2.1 Text Encoders (save in: ComfyUI\models\text_encoders):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/text_encoders
Download Wan 2.1 VAE (save in: ComfyUI\models\vae):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/vae
Install “ComfyUI-VideoHelperSuite” custom node via ComfyUI Manager.
Install “ComfyUI-GGUF” custom node via ComfyUI Manager.
Workflow Details
VAELoader
Purpose: This node loads a VAE (Variational Autoencoder) model. The VAE is responsible for encoding images into latent space and decoding latent representations back into images.
Customizable Settings:
vae_name: This setting allows the user to select the specific VAE model file to load. Changing this setting can affect the image’s overall color, contrast, and detail because different VAEs are trained with different properties.
CLIPLoader
Purpose: This node loads a CLIP (Contrastive Language–Image Pre-training) model. CLIP models are used to understand the relationship between text and images, enabling text-based image generation.
Customizable Settings:
clip_name: This setting allows the user to select the specific CLIP model file. Changing this setting affects how well the model understands and interprets the text prompts, influencing the image’s alignment with the prompt.
embedder_name: This setting allows the user to select the specific embedder for the loaded CLIP model.
tokenizer_name: This setting allows the user to select the specific tokenizer for the loaded CLIP model.
UnetLoaderGGUF
Purpose: This node loads a UNet model from a GGUF file. The UNet is the core of the Stable Diffusion model, responsible for the iterative denoising process.
Customizable Settings:
unet_name: This setting allows the user to select the specific UNet GGUF model file. Changing this setting will fundamentally alter the style and characteristics of the generated images, as different UNets are trained on different datasets and styles.
EmptyHunyuanLatentVideo
Purpose: This node creates an empty latent video, which serves as the starting point for the video generation process. The latent space is a compressed representation of video frames.
Customizable Settings:
width: This setting determines the width of each frame in the generated video in pixels. Increasing or decreasing this value will change the horizontal resolution of the output video.
height: This setting determines the height of each frame in the generated video in pixels. Increasing or decreasing this value will change the vertical resolution of the output video.
frame_count: This setting determines the number of frames in the generated video. Increasing this value will result in a longer video.
batch_size: This setting determines the number of videos generated in a single batch. Increasing this value will produce multiple videos with the same prompt, while decreasing it will generate fewer.
CLIPTextEncode (Positive Prompt)
Purpose: This node takes text input (the positive prompt) and encodes it into a format that the Stable Diffusion model can understand. This encoded representation guides the video generation process.
Customizable Settings:
text: This setting allows the user to input the positive prompt, which describes the desired content of the video. Modifying this text will directly change the subject, style, and details of the generated video.
clip: This setting takes the CLIP model output from the CLIPLoader node. Changing this setting would require loading a different CLIP model.
CLIPTextEncode (Negative Prompt)
Purpose: This node takes text input (the negative prompt) and encodes it into a format that the Stable Diffusion model can understand. This encoded representation specifies what elements to avoid in the generated video.
Customizable Settings:
text: This setting allows the user to input the negative prompt, which describes the elements to exclude from the generated video. Modifying this text will prevent unwanted artifacts or styles from appearing in the output.
clip: This setting takes the CLIP model output from the CLIPLoader node. Changing this setting would require loading a different CLIP model.
KSampler
Purpose: This node performs the iterative denoising process, which transforms the latent video into a realistic video based on the prompts.
Customizable Settings:
seed: This setting initializes the random number generator, which controls the noise pattern used in the denoising process. Changing the seed will produce different variations of the same prompt.
steps: This setting determines the number of denoising steps. Increasing this value will generally improve video quality but increase processing time.
cfg: This setting controls the classifier-free guidance scale, which influences how closely the generated video adheres to the positive prompt. Higher values result in stronger adherence but may reduce diversity.
sampler_name: This setting selects the sampling algorithm used for denoising. Different samplers have varying speed and quality characteristics.
scheduler: This setting selects the scheduler used to control the noise schedule during denoising. Different schedulers affect the denoising trajectory and final video quality.
denoise: This setting controls the amount of noise added to the latent video. A value of 1.0 means full denoising, while lower values result in less denoising and potentially more creative variations.
model: This setting takes the UNet model output from the UnetLoaderGGUF node. Changing this setting would require loading a different UNet model.
positive: This setting takes the encoded positive prompt from the CLIPTextEncode (Positive Prompt) node.
negative: This setting takes the encoded negative prompt from the CLIPTextEncode (Negative Prompt) node.
latent_image: This setting takes the empty latent video from the EmptyHunyuanLatentVideo node.
VAEDecode
Purpose: This node decodes the latent video from the KSampler into a pixel-based video.
Customizable Settings:
samples: This setting takes the latent video output from the KSampler node.
vae: This setting takes the VAE model output from the VAELoader node. Changing this setting would require loading a different VAE model.
VHS_VideoCombine
Purpose: This node combines the decoded video frames into a video file.
Customizable Settings:
frame_rate: This setting controls the video’s frame rate, affecting its smoothness.
loop_count: This setting determines how many times the video will loop.
filename_prefix: This setting allows the user to specify a prefix for the saved video file name.
format: This setting selects the video file format.
pix_fmt: This setting selects the pixel format.
crf: This setting controls the Constant Rate Factor, affecting the video’s compression quality.
save_metadata: This setting determines whether to save metadata.
trim_to_audio: This setting determines whether to trim the video to the audio length.
pingpong: This setting enables ping-pong looping.
save_output: This setting determines whether to save the output video.
videopreview: This setting provides a preview of the video.
images: This setting takes the images from the VAEDecode node.
audio: This setting takes the audio input.
meta_batch: This setting takes the meta batch input.
vae: This setting takes the VAE input.