魔乐社区上线Qwen3.5全家桶！基于vLLM Ascend的昇腾部署教程已就位

继除夕Qwen3.5开源397B的模型之后，2月25日，Qwen3.5开源家族又添新成员，开源122B、35B、27B全尺寸模型。魔乐社区不仅同步上线Qwen3.5全尺寸模型，从环境搭建到单/多节点部署的全套实操教程也已同步发布，让开发者无需再为适配开发耗时，一键解锁高性能模型的昇腾部署，欢迎广大开发者下载体验！

魔乐社区

5524人浏览 · 2026-02-27 18:08:06

魔乐社区 · 2026-02-27 18:08:06 发布

继除夕Qwen3.5开源397B的模型之后，2月25日，Qwen3.5开源家族又添新成员，开源122B、35B、27B全尺寸模型。

魔乐社区不仅同步上线Qwen3.5全尺寸模型，从环境搭建到单/多节点部署的全套实操教程也已同步发布，让开发者无需再为适配开发耗时，一键解锁高性能模型的昇腾部署，欢迎广大开发者下载体验！

🔗 BF16版本模型权重：

🔗 Qwen3.5-397B-w8a8（无 mtp 的量化版本）：

https://modelers.cn/models/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp

🔗 vLLM Ascend部署教程：

01 模型亮点

35B性能全面超越235B

Qwen3.5-122B-A10B：与 27B（Dense）持续拉近中小模型与顶尖闭源模型的差距，在复杂 Agent 任务（如多步推理、工具调用、跨应用操作）中表现尤为亮眼。
Qwen3.5-35B-A3B：已全面超越Qwen3-235B-A22B与Qwen3-VL-235B-A22B —— 智能进步的关键，从来不是堆参数，而是更优架构 + 更高质量数据 + 强化学习（RL）的有效协同。
Qwen3.5-27B（Dense）：小尺寸高能效，显著降低大模型使用门槛。

以下以Qwen3.5-35B为例，手把手教你基于vLLM Ascend部署模型。

02 环境准备

1、模型权重

Qwen3.5-35B（BF16 版本）：
https://modelers.cn/models/Qwen-AI/Qwen3.5-35B-A3B

注：建议将模型权重下载至多节点共享目录，例如 /root/.cache/。

2、安装

官方 Docker 镜像

您可以通过镜像链接下载镜像压缩包来进行部署。链接：https://modelers.cn/models/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp/tree/main/vllm-image

具体流程如下：

# 使用docker加载下载的镜像压缩包
# 根据您的环境更新要加载的vllm-ascend镜像压缩包名称,以下以A3 arm为例：
docker load -i Vllm-ascend-Qwen3_5-A3-Ubuntu-v0.tar 
# 根据您的设备更新 --device（Atlas A3：/dev/davinci[0-15]）。

# 注意：您需要提前将权重下载至 /root/.cache。
# 更新 vllm-ascend 镜像，并配置对应的Image名
export IMAGE=vllm-ascend:qwen3_5-v0-a3 
export NAME=vllm-ascend

# 使用定义的变量运行容器
# 注意：若使用 Docker 桥接网络，请提前开放可供多节点通信的端口
docker run --rm \
--name $NAME \
--net=host \
--shm-size=100g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

源码构建

如果您不希望使用上述 Docker 镜像，也可通过源码完整构建：保证你的环境成功安装了CANN 8.5.0从源码安装 vllm-ascend，请参考安装指南 https://docs.vllm.ai/projects/ascend/en/latest/installation.html。

从源码安装vllm-ascend后，您需要将vllm、vllm-ascend、transformers 升级至主分支：

# 升级 vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout a75a5b54c7f76bc2e15d3025d6
git fetch origin pull/34521/head:pr-34521
git merge pr-34521
VLLM_TARGET_DEVICE=empty pip install -v .

# 升级 vllm-ascend
pip uninstall vllm-ascend -y
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
git checkout c63b7a11888e9e1caeeff8
git fetch origin pull/6742/head:pr-6742
git merge pr-6742
pip install -v .

# 重新安装 transformers
git clone https://github.com/huggingface/transformers.git
cd transformers
git reset --hard fc9137225880a9d03f130634c20f9dbe36a7b8bf
pip install .

如需部署多节点环境，您需要在每个节点上分别完成环境配置。

03 部署模型

单节点部署

A3 系列
执行以下脚本进行在线推理。

export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=1024
export OMP_NUM_THREADS=1
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export TASK_QUEUE_ENABLE=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/Qwen3.5-35B-A3B/ \
    --served-model-name "qwen3.5" \
    --host 0.0.0.0 \
    --port 8010 \
    --data-parallel-size 1 \
    --tensor-parallel-size 4 \
    --max-model-len 5000 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 128 \
    --gpu-memory-utilization 0.94 \
    --trust-remote-code \
    --async-scheduling \
    --allowed-local-media-path / \
    --mm-processor-cache-gb 0 \
	--enforce-eager \
    --additional-config '{"enable_cpu_binding":true, "multistream_overlap_shared_expert": true}'

执行以下脚本向模型发送一条请求：

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "The future of AI is",
        "path": "/path/to/model/Qwen3.5-35B-A3B/",
        "max_tokens": 100,
        "temperature": 0
        }'

执行结束后，您可以看到模型回答如下：

Prompt: 'The future of AI is', Generated text: ' not just about building smarter machines, but about creating systems that can collaborate with humans in meaningful, ethical, and sustainable ways. As AI continues to evolve, it will increasingly shape how we live, work, and interact — and the decisions we make today will determine whether this future is one of shared prosperity or deepening inequality.\n\nThe rise of generative AI, for example, has already begun to transform creative industries, education, and scientific research. Tools like ChatGPT, Midjourney, and'

也可执行以下脚本向模型发送一条多模态请求：

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3.5",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
                {"type": "text", "text": "What is the text in the illustrate?"}
            ]}
        ]
    }'

执行结束后，您可以看到模型回答如下：

{"id":"chatcmpl-9dab99d55addd8c0","object":"chat.completion","created":1771060145,"model":"qwen3.5","choices":[{"index":0,"message":{"role":"assistant","content":"TONGYI Qwen","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":119,"completion_tokens":7,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}