01 模型介绍

Emu3.5: Native Multimodal Models are World Learners

Emu3.5 Team, BAAI

Project Page | 🤗HF Models | Paper

🔹 Core Concept Description
🧠 Unified World Modeling Predicts the next state jointly across vision and language, enabling coherent world modeling and generation.
🧩 End-to-End Pretraining Trained with a unified next-token prediction objective over interleaved vision–language sequences.
📚 Over 10T+ Multimodal Tokens Pre-trained on over 10 trillion interleaved tokens from video frames and transcripts, capturing spatiotemporal structure.
🔄 Native Multimodal I/O Processes and generates interleaved visual–text sequences without modality adapters or task-specific heads.
🎯 RL Post-Training Large-scale reinforcement learning enhances reasoning, compositionality, and generation quality.
Discrete Diffusion Adaptation (DiDA) Converts sequential decoding → bidirectional parallel prediction, achieving ≈20× faster inference without performance loss.
🖼️ Versatile Generation Excels in long-horizon vision–language generation, any-to-image (X2I) synthesis, and text-rich image creation.
🌐 Generalizable World Modeling Enables spatiotemporally consistent world exploration, and open-world embodied manipulation across diverse scenarios.
🏆 Performance Benchmark Matches Gemini 2.5 Flash Image (Nano Banana) on image generation/editing, and outperforms on interleaved generation tasks.

02 快速开始

1 准备资源

硬件资源:Atlas 800I/800T A2 (64G)或者Atlas 800I/800T A3

执行以下 Shell 命令,拉取vllm-ascned推理容器镜像:
A2

docker pull quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc3

A3

docker pull quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc3-a3

2 下载权重

权重名称 魔乐社区下载地址
Emu3.5 link
Emu3.5-Image link
Emu3.5-VisionTokenizer link

3 使用transformers推理

(1) 创建并进入容器

A2

docker run -it --net=host --shm-size=500g \
    --privileged \
    --name emu3.5 \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /usr/local/sbin:/usr/local/sbin:ro \
    -v /data:/data \
    quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc3 bash

A3

docker run -it --net=host --shm-size=500g \
    --privileged \
    --name emu3.5 \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /usr/local/sbin:/usr/local/sbin:ro \
    -v /data:/data \
    quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc3-a3 bash

(2) 环境搭建

cd /vllm-workspace/
git clone https://github.com/baaivision/Emu3.5.git
cd Emu3.5/
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r requirements/common.txt
pip install accelerate
pip install transformers==4.48.2

修改src/utils/model_utils.py文件中第38行

attn_implementation="flash_attention_2",
改为
attn_implementation="eager",

(3) 执行推理

example_config_*.py为不同任务参考配置文件。使用前修改配置文件中的

model_path = "./weights/Emu3.5-Image" 改为实际路径
vq_path = "./weights/Emu3.5-VisionTokenizer" # 改为实际路径
vq_device = "cuda:0" 改为 vq_device = "npu:0"

推理命令

# 🖼️ Text-to-Image (T2I) task
ASCEND_RT_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_t2i.py

# 🔄 Any-to-Image (X2I) task
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 python inference.py --cfg configs/example_config_x2i.py

# 🎯 Visual Guidance task
ASCEND_RT_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_guidance.py

# 📖 Visual Narrative task
ASCEND_RT_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_narrative.py
# After running inference, the model will generate results in protobuf format (.pb files) for each input prompt.

4 使用vLLM推理

(1) 创建并进入容器

A2

docker run -it --net=host --shm-size=500g \
    --privileged \
    --name emu3.5-vllm \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /usr/local/sbin:/usr/local/sbin:ro \
    -v /data:/data \
    quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc3 bash

A3

docker run -it --net=host --shm-size=500g \
    --privileged \
    --name emu3.5-vllm \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /usr/local/sbin:/usr/local/sbin:ro \
    -v /data:/data \
    quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc3-a3 bash

(2) 环境搭建

cd /vllm-workspace/
git clone https://github.com/baaivision/Emu3.5.git
cd Emu3.5/
python src/patch/apply.py
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r requirements/common.txt

修改代码

  • (1)/vllm-workspace/vllm/vllm/v1/core/sched/scheduler.py的701行做以下修改
    return CachedRequestData(
              req_ids=req_ids,
              resumed_from_preemption=resumed_from_preemption,
              new_token_ids=new_token_ids,
              new_block_ids=new_block_ids,
              num_computed_tokens=num_computed_tokens,
          )
    
    改为
    return CachedRequestData(
              req_ids=req_ids,
              resumed_from_preemption=resumed_from_preemption,
              new_token_ids=new_token_ids,
              new_block_ids=new_block_ids,
              num_computed_tokens=num_computed_tokens,
              sampling_params=None,
              hybrid_metadata=None
          )
    
  • (2)/vllm-workspace/vllm/vllm/v1/sample/logits_processor/builtin.py第58行
    for index, params, _, _, _ in batch_update.added:
    改为
    for index, params, _, _ in batch_update.added:
    
  • (3)/vllm-workspace/vllm/vllm/v1/sample/logits_processor/builtin.py第249行
    for index, params, prompt_tok_ids, output_tok_ids, _ in batch_update.added
    改为
    for index, params, prompt_tok_ids, output_tok_ids in batch_update.added
    
  • (4)src/utils/model_utils.py中第131行
    注释第131行内容 "full_cuda_graph": True, ,在131行之后添加"cudagraph_mode":"FULL_DECODE_ONLY",,如下:
compilation_config={
            "full_cuda_graph": True,
            "backend": "cudagraph",
            "cudagraph_capture_sizes": [1, 2],
        },
改为
compilation_config={
            #"full_cuda_graph": True,
            "cudagraph_mode":"FULL_DECODE_ONLY",
            "backend": "cudagraph",
            "cudagraph_capture_sizes": [1, 2],
        },

(3) 执行推理

  • 设置环境变量
    export VLLM_WORKER_MULTIPROC_METHOD="spawn"
    export TASK_QUEUE_ENABLE=1
    export CPU_AFFINITY_CONF=2
    export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"
    export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
    export HCCL_OP_EXPANSION_MODE="AIV"
    
  • 修改配置文件

example_config_*.py为不同任务参考配置文件。使用前修改配置文件中的

model_path = "./weights/Emu3.5-Image" 改为实际路径
vq_path = "./weights/Emu3.5-VisionTokenizer" # 改为实际路径
vq_device = "cuda:0" 改为 vq_device = "npu:0"
  • 推理命令
# 🖼️ Text-to-Image (T2I) task
ASCEND_RT_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_t2i.py --tensor-parallel-size 2 --gpu-memory-utilization 0.8

# 🔄 Any-to-Image (X2I) task
ASCEND_RT_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_x2i.py --tensor-parallel-size 2 --gpu-memory-utilization 0.7

# 🎯 Visual Guidance task
ASCEND_RT_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_guidance.py --tensor-parallel-size 2 --gpu-memory-utilization 0.8

# 📖 Visual Narrative task
ASCEND_RT_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_narrative.py --tensor-parallel-size 2 --gpu-memory-utilization 0.8
# After running inference, the model will generate results in protobuf format (.pb files) for each input prompt.

5 转换推理结果

执行以下命令将推理结果进行可视化转换

python src/utils/vis_proto.py --input <input_proto_path> [--output <output_dir>] [--video]

6. Gradio Demo

We provide two Gradio Demos for different application scenarios:

Emu3.5-Image Demo —— Interactive interface optimized for Text-to-Image (T2I) and Any-to-Image (X2I) tasks:

ASCEND_RT_VISIBLE_DEVICES=0,1 python gradio_demo_image.py --host 0.0.0.0 --port 7860

Emu3.5-Interleave Demo —— Launch Emu3.5 Interleave Tasks (Visual Guidance and Visual Narrate) Gradio Demo

ASCEND_RT_VISIBLE_DEVICES=0,1 python gradio_demo_interleave.py --host 0.0.0.0 --port 7860

03 量化

使用modelslim量化工具量化

1 modelslim源码

git clone https://gitcode.com/Ascend/msit.git

2 EMU3.5模型接入

(1) 实现模型量化代码

cd msit/msmodelslim/msmodelslim/model
mkdir emu3_5
cd emu3_5
cp -r Emu3.5源码路径/src/tokenizer_emu3_ibq/ ./
cp -r Emu3.5源码路径/src/tokenizer_emu3_ibq/ ./
cp -r Emu3.5源码路径/Emu3.5/src/emu3p5/modeling_emu3.py ./
cp -r Emu3.5源码路径/Emu3.5/src/emu3p5/configuration_emu3.py ./
将 __init__.py和model_adapter.py拷贝到emu3_5目录下

(2) 添加量化配置文件

vim quant.yaml,添加以下内容

apiversion: modelslim_v1
spec:
  process:
    - type: "iter_smooth"                   
      alpha: 0.9                           
      scale_min: 1e-5                      
      symmetric: True                        
      enable_subgraph_type: 
        - 'norm-linear'
        - 'linear-linear'
        - 'ov'
        - 'up-down'
      include: 
        - "*"
      exclude:                               
        - "*self_attn*"
    - type: "quarot"
      online: False
      block_size: -1
      max_tp_size: 4
      down_proj_online_layers: [ ]
    - type: "linear_quant" # 线性层量化
      qconfig:
        act: # 激活值量化
          scope: "per_token" # 动态量化
          dtype: "int8" # 8比特整数量化      
          symmetric: True # 对称量化
          method: "minmax" # 使用minmax算法
        weight: # 权重量化
          scope: "per_channel" # per_channel量化
          dtype: "int8" # 8比特整数量化
          symmetric: True # 对称量化      
          method: "minmax" # 使用minmax算法     
      include: [ "*" ] # 全局w8a8动态量化
      exclude: [ "*down_proj*" ] # 回退down_proj层
  save:
    - type: "ascendv1_saver"
      part_file_size: 4 # 每个safetensors权重文件最大4G`

(3) 注册模型

在配置文件config.ini中注册模型名称
在ModelAdapter中的注册emu3.5模型,en3对应下面的Qwen3ModelAdapter模型适配器

emu3_5 = EMU3.5, Emu3.5-Image

在ModelAdapterEntryPoints中添加模型适配器

emu3_5 = msmodelslim.model.emu3_5.model_adapter:Emu35ModelAdapter

3 安装modelslim

cd msit/msmodelslim
bash install.sh

4 执行量化

# tokenizer_emu3_ibq 路径
export TOKENIZER_PATH= ~/msit/msmodelslim/msmodelslim/model/emu3_5/tokenizer_emu3_ibq 
# 一键量化配置文件路径
yaml_config=msit/msmodelslim/msmodelslim/model/emu3_5/quant.yaml
# Emu3.5-Image原始权重路径
model_path=BAAI/Emu3.5-Image/
# 量化后权重保存路径
save_path=BAAI/Emu3.5-Image-w8a8
msmodelslim quant --model_path $model_path \
                  --save_path $save_path \
                  --device npu \
                  --model_type EMU3.5 \
                  --config_path  $yaml_config \
                  --trust_remote_code False

5 量化后推理

使用vllm推理,环境搭建参见第四节【使用vLLM推理】。
在src/utils/model_utils.py中第127行之后,添加参数quantization="ascend",

Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐