超越sora，最强文生视频CogVideo模型落地分享

CogVideo是由智谱AI开源的视频生成模型，它是与商业版视频生成产品“清影”同源的模型。CogVideoX-2B是CogVideoX系列中的第一个模型，拥有20亿参数，支持在单张4090显卡上进行推理，推理时的显存消耗为18GB，微调时显存消耗为40GB。CogVideoX-2B通过3D VAE在空间和时间维度上压缩视频数据，实现了高压缩率和优秀的重建质量。此外，模型还包括编码器、解码器和潜在

杰说新技术

1773人浏览 · 2024-08-12 06:00:00

杰说新技术 · 2024-08-12 06:00:00 发布

CogVideo是智谱AI开发的一款基于深度学习的文本到视频生成模型，它能够根据文本描述自动生成3D环境的视频内容。

作为CogVideoX系列中的第一个模型，CogVideoX-2B拥有20亿参数，与智谱AI的视频生成产品“清影”同源。

CogVideoX-2B融合了多项前沿技术，包括三维变分自编码器（3D VAE）、端到端视频理解模型和专家Transformer技术，这些技术使得模型在视频生成领域处于领先地位。

该模型支持英语提示词，单GPU推理时显存消耗约为18GB（使用SAT技术）或23.9GB（使用diffusers）。

模型的微调显存消耗为42GB，提示词长度上限为226个Tokens，能够生成长度为6秒、每秒8帧、分辨率为720*480的视频。

github项目地址：https://github.com/THUDM/CogVideo。

一、环境安装

1、python环境

建议安装python版本在3.10以上。

2、pip库安装

pip install torch==2.4.0+cu118 torchvision==0.19.0+cu118 torchaudio==2.4.0 --extra-index-url https://download.pytorch.org/whl/cu118

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

3、CogVideoX-2b Diffusers模型下载：

git lfs install

git clone https://huggingface.co/THUDM/CogVideoX-2b

4、CogVideoX-2b SAT模型下载：

wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1

mv 'index.html?dl=1' vae.zip

unzip vae.zip

wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1

mv 'index.html?dl=1' transformer.zip

unzip transformer.zip

二、功能测试

1、命令行运行测试：

（1）python代码调用测试

import argparse
import tempfile
from typing import Union, List

import PIL.Image
import imageio
import numpy as np
import torch
from diffusers import CogVideoXPipeline

def export_to_video_imageio(
    video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
) -> str:
    """
    Export the video frames to a video file using imageio library to avoid the "green screen" issue.
    """
    if output_video_path is None:
        output_video_path = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False).name
    
    if isinstance(video_frames[0], PIL.Image.Image):
        video_frames = [np.array(frame) for frame in video_frames]
        
    with imageio.get_writer(output_video_path, fps=fps) as writer:
        for frame in video_frames:
            writer.append_data(frame)
    
    return output_video_path

def generate_video(
    prompt: str,
    model_path: str,
    output_path: str = "./output.mp4",
    num_inference_steps: int = 50,
    guidance_scale: float = 6.0,
    num_videos_per_prompt: int = 1,
    device: str = "cuda",
    dtype: torch.dtype = torch.float16,
):
    """
    Generates a video based on the given prompt and saves it to the specified path.

    Parameters:
    - prompt (str): The description of the video to be generated.
    - model_path (str): The path of the pre-trained model to be used.
    - output_path (str): The path where the generated video will be saved.
    - num_inference_steps (int): Number of steps for the inference process. More steps can result in better quality.
    - guidance_scale (float): The scale for classifier-free guidance. Higher values can lead to better alignment with the prompt.
    - num_videos_per_prompt (int): Number of videos to generate per prompt.
    - device (str): The device to use for computation (e.g., "cuda" or "cpu").
    - dtype (torch.dtype): The data type for computation (default is torch.float16).
    """
    try:
        # Load pre-trained model
        pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype).to(device)
    except Exception as e:
        raise RuntimeError(f"Failed to load model from {model_path}: {e}")
    
    print(f"Model loaded successfully from {model_path}")

    try:
        # Encode the prompt to get embeddings
        prompt_embeds, _ = pipe.encode_prompt(
            prompt=prompt,
            num_videos_per_prompt=num_videos_per_prompt,
            device=device,
            dtype=dtype,
        )
    except Exception as e:
        raise RuntimeError(f"Failed to encode prompt: {e}")
    
    print(f"Prompt encoded successfully: {prompt}")

    try:
        # Generate video frames
        video = pipe(
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            prompt_embeds=prompt_embeds,
            negative_prompt_embeds=torch.zeros_like(prompt_embeds),  # Not Supported negative prompt
        ).frames[0]
    except Exception as e:
        raise RuntimeError(f"Failed to generate video: {e}")
    
    print("Video generated successfully")

    try:
        # Export frames to video file
        export_to_video_imageio(video, output_path, fps=8)
    except Exception as e:
        raise RuntimeError(f"Failed to export video: {e}")

    print(f"Video saved successfully at {output_path}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate a video from a text prompt using CogVideoX")
    
    parser.add_argument("--prompt", type=str, required=True, help="The description of the video to be generated")
    parser.add_argument(
        "--model_path", type=str, default="THUDM/CogVideoX-2b", help="The path of the pre-trained model to be used"
    )
    parser.add_argument(
        "--output_path", type=str, default="./output.mp4", help="The path where the generated video will be saved"
    )
    parser.add_argument(
        "--num_inference_steps", type=int, default=50, help="Number of steps for the inference process"
    )
    parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
    parser.add_argument("--num_videos_per_prompt", type=int, default=1, help="Number of videos to generate per prompt")
    parser.add_argument(
        "--device", type=str, default="cuda", help="The device to use for computation (e.g., 'cuda' or 'cpu')"
    )
    parser.add_argument(
        "--dtype", type=str, default="float16", help="The data type for computation (e.g., 'float16' or 'float32')"
    )

    args = parser.parse_args()

    # Convert dtype argument to torch.dtype.
    dtype = torch.float16 if args.dtype == "float16" else torch.float32

    generate_video(
        prompt=args.prompt,
        model_path=args.model_path,
        output_path=args.output_path,
        num_inference_steps=args.num_inference_steps,
        guidance_scale=args.guidance_scale,
        num_videos_per_prompt=args.num_videos_per_prompt,
        device=args.device,
        dtype=dtype,
    )

未完......

更多详细的内容欢迎关注：杰哥新技术

魔乐社区

魔乐社区（Modelers.cn) 是一个中立、公益的人工智能社区，提供人工智能工具、模型、数据的托管、展示与应用协同服务，为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作，由全产业链共同建设、共同运营、共同享有，推动国产AI生态繁荣发展。

更多推荐

小参数・大码力・易部署 | Qwen3.6-27B上线魔乐社区，基于昇腾的部署教程来了

继一周前模型开源发布后，千问再度开源Qwen3.6-27B —— 一个拥有270亿参数的稠密多模态模型，也是社区呼声最高的模型规格。Qwen3.6-27B 依然支持多模态思考与非思考模式，在智能体编程方面达到了旗舰级表现，全面超越前代开源旗舰 Qwen3.5-397B-A17B（总参数397B / 激活参数17B的MoE模型）。作为稠密架构，它无需MoE路由即可部署，是开发者在实用、可广泛部署规模