记录大模型启动命令,针对同一个模型不同量化精度及不同推理引擎分别记录

记录的显卡型号为测试模型时使用的环境,可以根据自己实际的显卡进行测试是否可以运行

Embedding

Qwen/Qwen3-VL-Embedding-8B

HuggingFace: Qwen/Qwen3-VL-Embedding-8B
ModelScope: Qwen/Qwen3-VL-Embedding-8B

SGLang

显卡:GB300

docker run -itd --name qwen3-vl-embedding-8b \
    -v /model:/model \
    -e SGLANG_DISABLE_CUDNN_CHECK=1 \
    --ipc=host \
    --network host \
    --shm-size 500G \
    --gpus all \
    lmsysorg/sglang:v0.5.7-cu130-arm64-runtime \
    python -m sglang.launch_server \
      --model-path /model/Qwen--Qwen3-VL-Embedding-8B \
      --served-model-name qwen3-vl-embedding-8b \
      --tp-size 2 \
      --host 0.0.0.0 \
      --port 30000 \
      --is-embedding \
      --trust-remote-code \
      --skip-server-warmup \
      --mem-fraction-static 0.85

验证:

curl -X POST http://localhost:30000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-vl-embedding-8b",
    "input": "What is the capital of France?",
    "encoding_format": "float"
  }' | jq '.data[0].embedding | length'

Rerank

BAAI/bge-reranker-v2-m3

HuggingFace: BAAI/bge-reranker-v2-m3
ModelScope: BAAI/bge-reranker-v2-m3

SGLang

docker run -itd --name bge-reranker-v2-m3 \
    -v /model:/model \
    -e SGLANG_DISABLE_CUDNN_CHECK=1 \
    -e CUDA_VISIBLE_DEVICES=2,3 \
    --ipc=host \
    --network host \
    --shm-size 500G \
    --gpus all \
    lmsysorg/sglang:v0.5.7-cu130-arm64-runtime \
    python -m sglang.launch_server \
      --model-path /model/BAAI--bge-reranker-v2-m3 \
      --served-model-name bge-reranker-v2-m3 \
      --host 0.0.0.0 \
      --port 30000 \
      --disable-radix-cache \
      --chunked-prefill-size -1 \
      --attention-backend triton \
      --tp-size 2 \
      --is-embedding \
      --trust-remote-code \
      --mem-fraction-static 0.85 \
      --skip-server-warmup

验证:

curl -X POST http://localhost:30000/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-reranker-v2-m3",
    "query": "法国首都是哪里?",
    "documents": [
      "法国的首都是巴黎。",
      "德国的首都是柏林。",
      "香蕉是黄色的水果。"
    ],
    "return_documents": true
  }'

LLM

DeepSeek-V3.2

SGLang

HuggingFace: deepseek-ai/DeepSeek-V3.2
ModelScope: deepseek-ai/DeepSeek-V3.2

mkdir -p /model

cat > /model/extra-llm-api-config-dp-mtp-noreuse-fp8.yml <<EOF
enable_attention_dp: true
cuda_graph_config:
  enable_padding: true
  max_batch_size: 128
kv_cache_config:
  dtype: fp8
  enable_block_reuse: false
  tokens_per_block: 64
moe_config:
  backend: DEEPGEMM
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 1
EOF

docker run -itd \
  --name deepseek-v32 \
  --ipc=host \
  --gpus all \
  --network host \
  -v /model:/model \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3 \
  mpirun -n 1 \
    --allow-run-as-root \
    --oversubscribe \
    trtllm-serve /model/DeepSeek-V3.2 \
      --host 0.0.0.0 \
      --port 30000 \
      --backend pytorch \
      --tp_size 4 \
      --ep_size 4 \
      --pp_size 1 \
      --gpus_per_node 4 \
      --extra_llm_api_options /model/extra-llm-api-config-dp-mtp-noreuse-fp8.yml \
      --kv_cache_free_gpu_memory_fraction 0.8 \
      --max_num_tokens 61440

验证:

curl http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-v3.2",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "你好,介绍一下你自己"}]}
        ]
    }'

ZhipuAI/GLM-4.7

Tengyunw/GLM-4.7-NVFP4

HuggingFace: Tengyunw/GLM-4.7-NVFP4

SGLang

显卡:GB300

2卡可直接启动

4卡情况下,需要对/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py文件做修改,修改后内容如下图(不同sglang版本中行号不一样)

docker run -itd --name glm-47 \
    -v /model:/model \
    -v /model/modelopt_quant.py:/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py \
    --ipc=host \
    --network host \
    --shm-size 500G \
    --gpus all \
    lmsysorg/sglang:v0.5.7-cu130-arm64-runtime \
    python -m sglang.launch_server \
      --model-path /model/Tengyunw--GLM-4.7-NVFP4 \
      --served-model-name glm-4.7 \
      --tp-size 4 \
      --tool-call-parser glm47 \
      --reasoning-parser glm45 \
      --host 0.0.0.0 \
      --port 30000 \
      --trust-remote-code \
      --mem-fraction-static 0.85 \
      --attention-backend trtllm_mha \
      --quantization modelopt_fp4

验证:

curl http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "glm-4.7",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "你是谁?介绍一下你自己"}]}
        ]
    }'

MiniMax/MiniMax-M2.1

lukealonso/MiniMax-M2.1-NVFP4

HuggingFace: lukealonso/MiniMax-M2.1-NVFP4

SGLang

显卡:GB300

2卡可直接启动
4卡情况下,需要对/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py文件做修改,修改后内容如下图(不同sglang版本中行号不一样)

docker run -itd --name minimax-m21 \
    -v /model:/model \
    -v /model/modelopt_quant.py:/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py \
    --ipc=host \
    --network host \
    --shm-size 500G \
    --gpus all \
    lmsysorg/sglang:v0.5.7-cu130-arm64-runtime \
    python -m sglang.launch_server \
      --model-path /model/lukealonso--MiniMax-M2.1-NVFP4 \
      --served-model-name minimax-m2.1 \
      --tp-size 4 \
      --tool-call-parser minimax-m2 \
      --reasoning-parser minimax-append-think \
      --host 0.0.0.0 \
      --port 30000 \
      --trust-remote-code \
      --mem-fraction-static 0.85 \
      --attention-backend trtllm_mha \
      --quantization modelopt_fp4

验证:

curl http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "minimax-m2.1",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
        ]
    }'

Qwen/Qwen3-Coder-480B-A35B-Instruct

OPENZEKA/Qwen3-Coder-480B-A35B-Instruct-NVFP4

HuggingFace: OPENZEKA/Qwen3-Coder-480B-A35B-Instruct-NVFP4

TensorRT-LLM

显卡:GB300

通过 TensorRT-LLM 运行的模型,不能像vLLM/SGLang等推理引擎一样指定调用时的模型名称,但是调用时使用任意值都可以

docker run -itd \
  --name qwen3-coder \
  --ipc=host \
  --gpus all \
  --network host \
  -v /model:/model \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3 \
  mpirun -n 1 \
    --allow-run-as-root \
    --oversubscribe \
    trtllm-serve /model/OPENZEKA--Qwen3-Coder-480B-A35B-Instruct-NVFP4 \
    --host '0.0.0.0' \
    --port 30000 \
    --backend pytorch \
    --tp_size 4 \
    --ep_size 4 \
    --pp_size 1 \
    --gpus_per_node 4 \
    --max_num_tokens 262144 \
    --kv_cache_free_gpu_memory_fraction 0.8

验证(这里的model参数可以随意指定)

curl http://127.0.0.1:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3-coder",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "你好,介绍一下你自己"}]}
        ]
    }'