【vLLM 最新版v0.10.2】docker运行openai服务与GGUF量化使用方式

本文介绍 vLLM v0.10.2 版本通过 Docker 运行 OpenAI 服务及 GGUF 量化的使用方式，明确该量化不支持多模态模型，且仅节约显存、不提升速度。GGUF 量化需要 llama.cpp 环境，接着将 Hugging Face 模型转为 FP16 格式 GGUF，再量化为 Q4_0 等类型（文中列多种支持的量化类型），最后通过 Docker 启动量化后的模型，同样提供了测试请求

莽夫搞战术

863人浏览 · 2025-09-22 23:27:06

莽夫搞战术 · 2025-09-22 23:27:06 发布

【vLLM 最新版v0.10.2】docker运行openai服务与GGUF量化使用方式

docker运行教程
- 启动服务
- 测试服务
GGUF量化

~~注意：此量化方法不支持多模态模型~~
注意：vLLM量化只是节约显存，并不会提高速度

docker运行教程

官方运行示例: https://docs.vllm.ai/en/latest/deployment/docker.html
官方镜像：https://hub.docker.com/r/vllm/vllm-openai/tags

启动服务

#拉取镜像
docker pull vllm/vllm-openai:v0.10.0

#启动命令
docker run -idt --restart=always -e TZ="Asia/Shanghai"  --gpus device=0 --name qwen2.5-test -v Qwen2.5-7B-Instruct:/Qwen2.5-7B-Instruct -p 8192:8000 --ipc=host vllm/vllm-openai:v0.10.2  --model /Qwen2.5-7B-Instruct --gpu-memory-utilization 0.90

测试服务

服务地址：http://{ip}:{port}/v1/chat/completions
POST请求：

{
    "model": "/Qwen2.5-7B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": "介绍一下你自己"
        }
    ],
    "max_tokens": 200,
    "stream": false
}

返回结果：

{
    "id": "chatcmpl-f8f9d2835b84420db9ec956d4f884320",
    "object": "chat.completion",
    "created": 1758552650,
    "model": "/Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "您好！我叫Qwen，是阿里云推出的一种超大规模语言模型。我能够回答问题、创作文字，还能表达观点、撰写代码，是您的工作和生活中的得力助手。如果您有任何问题或需要帮助，请随时告诉我，我会尽力提供支持。",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning_content": null
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "token_ids": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 31,
        "total_tokens": 89,
        "completion_tokens": 58,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null,
    "prompt_token_ids": null,
    "kv_transfer_params": null
}

GGUF量化

准备llama.cpp环境

github地址：https://github.com/ggml-org/llama.cpp

# 拉取仓库
git clone -b b6545  https://github.com/ggml-org/llama.cpp.git

# 安装llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF
cmake --build build --config Release -j20

# 测试
cd build/bin
./llama-quantize --help

量化模型

# 将huggingface模型转为gguf类型
python3 convert_hf_to_gguf.py Qwen2.5-7B-Instruct --outtype f16 --outfile Qwen2.5-7B-Instruct-FP16.gguf

# 将FP16的gguf模型量化为Q4_0模型
llama-quantize ./Qwen2.5-7B-Instruct-FP16.gguf ./Qwen2.5-7B-Instruct-Q4_0.gguf  Q4_0

llama-quantize可支持的类型包含：

Allowed quantization types:
   2  or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B
   3  or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B
  38  or  MXFP4_MOE :  MXFP4 MoE
   8  or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B
   9  or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B
  19  or  IQ2_XXS :  2.06 bpw quantization
  20  or  IQ2_XS  :  2.31 bpw quantization
  28  or  IQ2_S   :  2.5  bpw quantization
  29  or  IQ2_M   :  2.7  bpw quantization
  24  or  IQ1_S   :  1.56 bpw quantization
  31  or  IQ1_M   :  1.75 bpw quantization
  36  or  TQ1_0   :  1.69 bpw ternarization
  37  or  TQ2_0   :  2.06 bpw ternarization
  10  or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B
  21  or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B
  23  or  IQ3_XXS :  3.06 bpw quantization
  26  or  IQ3_S   :  3.44 bpw quantization
  27  or  IQ3_M   :  3.66 bpw quantization mix
  12  or  Q3_K    : alias for Q3_K_M
  22  or  IQ3_XS  :  3.3 bpw quantization
  11  or  Q3_K_S  :  3.41G, +1.6321 ppl @ Llama-3-8B
  12  or  Q3_K_M  :  3.74G, +0.6569 ppl @ Llama-3-8B
  13  or  Q3_K_L  :  4.03G, +0.5562 ppl @ Llama-3-8B
  25  or  IQ4_NL  :  4.50 bpw non-linear quantization
  30  or  IQ4_XS  :  4.25 bpw non-linear quantization
  15  or  Q4_K    : alias for Q4_K_M
  14  or  Q4_K_S  :  4.37G, +0.2689 ppl @ Llama-3-8B
  15  or  Q4_K_M  :  4.58G, +0.1754 ppl @ Llama-3-8B
  17  or  Q5_K    : alias for Q5_K_M
  16  or  Q5_K_S  :  5.21G, +0.1049 ppl @ Llama-3-8B
  17  or  Q5_K_M  :  5.33G, +0.0569 ppl @ Llama-3-8B
  18  or  Q6_K    :  6.14G, +0.0217 ppl @ Llama-3-8B
   7  or  Q8_0    :  7.96G, +0.0026 ppl @ Llama-3-8B
   1  or  F16     : 14.00G, +0.0020 ppl @ Mistral-7B
  32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
   0  or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

测试量化后模型

启动容器：

docker run -idt --restart=always  -e TZ="Asia/Shanghai"  --gpus device=0 --name qwen2.5-test -v Qwen2.5-7B-Instruct-Q4_0:/Qwen2.5-7B-Instruct-Q4_0 -p 8192:8000 --ipc=host vllm/vllm-openai:v0.10.2  --model /Qwen2.5-7B-Instruct-Q4_0/Qwen2.5-7B-Instruct-Q4_0.gguf --gpu-memory-utilization 0.90

服务地址：http://{ip}:{port}/v1/chat/completions
POST请求：

{
    "model": "/Qwen2.5-7B-Instruct-Q4_0/Qwen2.5-7B-Instruct-Q4_0.gguf",
    "messages": [
        {
            "role": "user",
            "content": "介绍一下你自己"
        }
    ],
    "max_tokens": 200,
    "stream": false
}

返回结果

{
    "id": "chatcmpl-60b9025597c346518e6fc73e8fb431db",
    "object": "chat.completion",
    "created": 1758553950,
    "model": "/Qwen2.5-7B-Instruct-Q4_0/Qwen2.5-7B-Instruct-Q4_0.gguf",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "我是阿里云开发的一种超大规模语言模型，我叫Qwen。作为一个AI助手，我的主要功能是生成各种类型的文本，比如文章、故事、诗歌、故事等，并能够根据与用户的对话内容回答问题、表达观点、提供帮助。作为一款来自中国的大规模语言模型，我希望能够用自然、流畅的方式与人类进行交流，理解并满足用户的需求。同时，我也在不断学习和进步，以更好地服务大家。如果您有任何问题或需要帮助，欢迎随时与我交流！",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning_content": null
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": 151645,
            "token_ids": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 31,
        "total_tokens": 139,
        "completion_tokens": 108,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null,
    "prompt_token_ids": null,
    "kv_transfer_params": null
}

魔乐社区

魔乐社区（Modelers.cn) 是一个中立、公益的人工智能社区，提供人工智能工具、模型、数据的托管、展示与应用协同服务，为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作，由全产业链共同建设、共同运营、共同享有，推动国产AI生态繁荣发展。

更多推荐

全家桶集齐！Qwen3.5四款小模型上线魔乐社区，附昇腾全套实践教程

魔乐社区

Pont - 搭建前后端之桥：高效、灵活的接口管理工具

Pont 是一款强大的数据服务层解决方案，它能够帮助开发者快速搭建前后端之间的桥梁，实现接口的高效管理和代码自动生成。无论是新手还是有经验的开发者，都能通过 Pont 轻松处理接口文档、生成类型安全的 API 代码，从而显著提升开发效率。[![Pont 工具标志](https://raw.gitcode.com/gh_mirrors/po/pont/raw/3f1b7d4bbba3fd2dda

魔乐社区

如何快速上手 hvac：HashiCorp Vault Python 客户端零基础入门指南

**hvac** 是 HashiCorp Vault 的 Python 3.X 客户端库，专为开发者提供简单高效的 Vault 交互方式。无论你是需要管理密钥、配置身份验证，还是实现安全的秘密数据存储，hvac 都能帮助你轻松搞定 Vault 的各项操作。本文将带你零基础快速入门，从安装到基础操作，让你在几分钟内即可上手使用这个强大的工具。[![hvac 客户端 Logo](https://r