大模型启动命令备忘单
记录大模型启动命令,针对同一个模型不同量化精度及不同推理引擎分别记录
记录的显卡型号为测试模型时使用的环境,可以根据自己实际的显卡进行测试是否可以运行
Embedding
Qwen/Qwen3-VL-Embedding-8B
HuggingFace: Qwen/Qwen3-VL-Embedding-8B
ModelScope: Qwen/Qwen3-VL-Embedding-8B
SGLang
显卡:GB300
docker run -itd --name qwen3-vl-embedding-8b \
-v /model:/model \
-e SGLANG_DISABLE_CUDNN_CHECK=1 \
--ipc=host \
--network host \
--shm-size 500G \
--gpus all \
lmsysorg/sglang:v0.5.7-cu130-arm64-runtime \
python -m sglang.launch_server \
--model-path /model/Qwen--Qwen3-VL-Embedding-8B \
--served-model-name qwen3-vl-embedding-8b \
--tp-size 2 \
--host 0.0.0.0 \
--port 30000 \
--is-embedding \
--trust-remote-code \
--skip-server-warmup \
--mem-fraction-static 0.85
验证:
curl -X POST http://localhost:30000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-vl-embedding-8b",
"input": "What is the capital of France?",
"encoding_format": "float"
}' | jq '.data[0].embedding | length'
Rerank
BAAI/bge-reranker-v2-m3
HuggingFace: BAAI/bge-reranker-v2-m3
ModelScope: BAAI/bge-reranker-v2-m3
SGLang
docker run -itd --name bge-reranker-v2-m3 \
-v /model:/model \
-e SGLANG_DISABLE_CUDNN_CHECK=1 \
-e CUDA_VISIBLE_DEVICES=2,3 \
--ipc=host \
--network host \
--shm-size 500G \
--gpus all \
lmsysorg/sglang:v0.5.7-cu130-arm64-runtime \
python -m sglang.launch_server \
--model-path /model/BAAI--bge-reranker-v2-m3 \
--served-model-name bge-reranker-v2-m3 \
--host 0.0.0.0 \
--port 30000 \
--disable-radix-cache \
--chunked-prefill-size -1 \
--attention-backend triton \
--tp-size 2 \
--is-embedding \
--trust-remote-code \
--mem-fraction-static 0.85 \
--skip-server-warmup
验证:
curl -X POST http://localhost:30000/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "bge-reranker-v2-m3",
"query": "法国首都是哪里?",
"documents": [
"法国的首都是巴黎。",
"德国的首都是柏林。",
"香蕉是黄色的水果。"
],
"return_documents": true
}'
LLM
DeepSeek-V3.2
SGLang
HuggingFace: deepseek-ai/DeepSeek-V3.2
ModelScope: deepseek-ai/DeepSeek-V3.2
mkdir -p /model
cat > /model/extra-llm-api-config-dp-mtp-noreuse-fp8.yml <<EOF
enable_attention_dp: true
cuda_graph_config:
enable_padding: true
max_batch_size: 128
kv_cache_config:
dtype: fp8
enable_block_reuse: false
tokens_per_block: 64
moe_config:
backend: DEEPGEMM
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1
EOF
docker run -itd \
--name deepseek-v32 \
--ipc=host \
--gpus all \
--network host \
-v /model:/model \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3 \
mpirun -n 1 \
--allow-run-as-root \
--oversubscribe \
trtllm-serve /model/DeepSeek-V3.2 \
--host 0.0.0.0 \
--port 30000 \
--backend pytorch \
--tp_size 4 \
--ep_size 4 \
--pp_size 1 \
--gpus_per_node 4 \
--extra_llm_api_options /model/extra-llm-api-config-dp-mtp-noreuse-fp8.yml \
--kv_cache_free_gpu_memory_fraction 0.8 \
--max_num_tokens 61440
验证:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v3.2",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "你好,介绍一下你自己"}]}
]
}'
ZhipuAI/GLM-4.7
Tengyunw/GLM-4.7-NVFP4
HuggingFace: Tengyunw/GLM-4.7-NVFP4
SGLang
显卡:GB300
2卡可直接启动
4卡情况下,需要对/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py文件做修改,修改后内容如下图(不同sglang版本中行号不一样)

docker run -itd --name glm-47 \
-v /model:/model \
-v /model/modelopt_quant.py:/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py \
--ipc=host \
--network host \
--shm-size 500G \
--gpus all \
lmsysorg/sglang:v0.5.7-cu130-arm64-runtime \
python -m sglang.launch_server \
--model-path /model/Tengyunw--GLM-4.7-NVFP4 \
--served-model-name glm-4.7 \
--tp-size 4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code \
--mem-fraction-static 0.85 \
--attention-backend trtllm_mha \
--quantization modelopt_fp4
验证:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.7",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "你是谁?介绍一下你自己"}]}
]
}'
MiniMax/MiniMax-M2.1
lukealonso/MiniMax-M2.1-NVFP4
HuggingFace: lukealonso/MiniMax-M2.1-NVFP4
SGLang
显卡:GB300
2卡可直接启动
4卡情况下,需要对/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py文件做修改,修改后内容如下图(不同sglang版本中行号不一样)

docker run -itd --name minimax-m21 \
-v /model:/model \
-v /model/modelopt_quant.py:/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py \
--ipc=host \
--network host \
--shm-size 500G \
--gpus all \
lmsysorg/sglang:v0.5.7-cu130-arm64-runtime \
python -m sglang.launch_server \
--model-path /model/lukealonso--MiniMax-M2.1-NVFP4 \
--served-model-name minimax-m2.1 \
--tp-size 4 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax-append-think \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code \
--mem-fraction-static 0.85 \
--attention-backend trtllm_mha \
--quantization modelopt_fp4
验证:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "minimax-m2.1",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
]
}'
Qwen/Qwen3-Coder-480B-A35B-Instruct
OPENZEKA/Qwen3-Coder-480B-A35B-Instruct-NVFP4
HuggingFace: OPENZEKA/Qwen3-Coder-480B-A35B-Instruct-NVFP4
TensorRT-LLM
显卡:GB300
通过 TensorRT-LLM 运行的模型,不能像vLLM/SGLang等推理引擎一样指定调用时的模型名称,但是调用时使用任意值都可以
docker run -itd \
--name qwen3-coder \
--ipc=host \
--gpus all \
--network host \
-v /model:/model \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3 \
mpirun -n 1 \
--allow-run-as-root \
--oversubscribe \
trtllm-serve /model/OPENZEKA--Qwen3-Coder-480B-A35B-Instruct-NVFP4 \
--host '0.0.0.0' \
--port 30000 \
--backend pytorch \
--tp_size 4 \
--ep_size 4 \
--pp_size 1 \
--gpus_per_node 4 \
--max_num_tokens 262144 \
--kv_cache_free_gpu_memory_fraction 0.8
验证(这里的model参数可以随意指定)
curl http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-coder",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "你好,介绍一下你自己"}]}
]
}'