【实验报告】sglang,vllm,transformers 在强制串行推理场景下的表现

【实验报告】sglang,vllm,transformers 在强制串行推理场景下的表现

我们现在考虑若干强制串行的需求。也就是说,必须推理完这个之后再推理下一个。

  • 调包范围是 transformers,vllm,sglang

  • 投机采样/不使用投机采样。

    投机采样对应 eagle3。容易找到一些英文语料训练的 eaglehead。注意:英文语聊训练的 eaglehead 在中文 prompt 表现极差,但是仍然可以让 accept-length > 1。

  • base model 是 qwen3-8b,运行的机器是单卡 l40。被huggingface 上动辄几百 tps 的实验结果吓哭了推理参数是 temperature 不等于 0 的,虽然可能模型输出不一样,但显然不影响 tps 的统计。精度全都是 16 位。

    推理主要的参数应当上面提到了,但实际上有很多影响因素,很难完全控制变量,由于这是一篇随手札记,那先这样。

  • 指标方面只看 token per second,主要是感受量级,我懒得做 mean\(\pm\) std 这种统计了。

  • 输入了 11 条 prompt

  1. transformers + eagle3
    也就是直接把 EAGLE 的 github repo 克隆下来,用它们的 eagenerate 来生成
    由于没有官方的计时工具,所以计算 tps 的方法是,计算 eagenerate 的运行时间,计算生成了几个 token,直接除。

    Generation time: 19.689866304397583s for 1128 tokens, speed: 57.28835242258957 tokens/s
    Generation time: 24.006053924560547s for 1469 tokens, speed: 61.192897617257664 tokens/s
    Generation time: 33.72415637969971s for 2217 tokens, speed: 65.73922784127895 tokens/s
    Generation time: 24.192238330841064s for 1477 tokens, speed: 61.05263927220292 tokens/s
    Generation time: 21.344391345977783s for 1268 tokens, speed: 59.40670686957521 tokens/s
    Generation time: 16.566300868988037s for 1122 tokens, speed: 67.72785360311629 tokens/s
    Generation time: 25.769388437271118s for 1559 tokens, speed: 60.49813730717671 tokens/s
    Generation time: 35.69959473609924s for 2051 tokens, speed: 57.4516325790679 tokens/s
    Generation time: 24.897949934005737s for 1422 tokens, speed: 57.11313597180247 tokens/s
    Generation time: 16.427077054977417s for 855 tokens, speed: 52.04821266367253 tokens/s
    Generation time: 26.052607536315918s for 1550 tokens, speed: 59.49500439982387 tokens/s
    

    显存开销是标准的 base_model 的开销+ eagle_head 的开销+ 预留的 max_length 个 kv-cache 的开销

  2. vllm + nothing

    [00:27<00:00, 27.20s/it, est. speed input: 103.04 toks/s, output: 42.86 toks/s]
    [00:15<00:00, 15.15s/it, est. speed input: 166.97 toks/s, output: 42.70 toks/s]
    [00:34<00:00, 34.53s/it, est. speed input: 82.83 toks/s, output: 42.89 toks/s]
    [00:38<00:00, 38.48s/it, est. speed input: 80.00 toks/s, output: 42.73 toks/s]
    [00:19<00:00, 19.70s/it, est. speed input: 147.42 toks/s, output: 42.59 toks/s]
    [00:36<00:00, 36.50s/it, est. speed input: 102.99 toks/s, output: 42.27 toks/s]
    [00:24<00:00, 24.61s/it, est. speed input: 107.77 toks/s, output: 42.91 toks/s]
    [00:51<00:00, 51.44s/it, est. speed input: 57.66 toks/s, output: 42.77 toks/s]
    [00:33<00:00, 33.09s/it, est. speed input: 89.75 toks/s, output: 42.76 toks/s]
    [00:36<00:00, 36.10s/it, est. speed input: 89.97 toks/s, output: 42.60 toks/s]
    [00:51<00:00, 51.08s/it, est. speed input: 55.79 toks/s, output: 42.83 toks/s]
    
  3. sglang + nothing

    Decode batch, #running-req: 1, #token: 3778, token usage: 0.46, cuda graph: True, gen throughput (token/s): 44.34
    Decode batch, #running-req: 1, #token: 4284, token usage: 0.52, cuda graph: True, gen throughput (token/s): 44.16
    Decode batch, #running-req: 1, #token: 4780, token usage: 0.58, cuda graph: True, gen throughput (token/s): 43.91
    Decode batch, #running-req: 1, #token: 5326, token usage: 0.65, cuda graph: True, gen throughput (token/s): 43.71
    Decode batch, #running-req: 1, #token: 4643, token usage: 0.57, cuda graph: True, gen throughput (token/s): 43.93
    Decode batch, #running-req: 1, #token: 4403, token usage: 0.54, cuda graph: True, gen throughput (token/s): 44.13
    Decode batch, #running-req: 1, #token: 4644, token usage: 0.57, cuda graph: True, gen throughput (token/s): 43.94
    Decode batch, #running-req: 1, #token: 4403, token usage: 0.54, cuda graph: True, gen throughput (token/s): 44.07
    Decode batch, #running-req: 1, #token: 4418, token usage: 0.54, cuda graph: True, gen throughput (token/s): 44.11
    Decode batch, #running-req: 1, #token: 5092, token usage: 0.62, cuda graph: True, gen throughput (token/s): 43.81
    Decode batch, #running-req: 1, #token: 5012, token usage: 0.61, cuda graph: True, gen throughput (token/s): 43.84
    

    感觉稍微比 vllm + nothing 好 1tps,这很不显著,而且可能是由于采样偏差带来的。所以我们忽略。

  4. vllm + eagle3

    [00:14<00:00, 14.67s/it, est. speed input: 225.82 toks/s, output: 56.59 toks/s]
    [00:23<00:00, 23.12s/it, est. speed input: 107.67 toks/s, output: 59.56 toks/s]
    [00:30<00:00, 30.37s/it, est. speed input: 75.89 toks/s, output: 62.78 toks/s]
    [00:30<00:00, 30.48s/it, est. speed input: 85.35 toks/s, output: 60.75 toks/s]
    [00:21<00:00, 21.64s/it, est. speed input: 142.03 toks/s, output: 60.78 toks/s]
    [00:31<00:00, 31.46s/it, est. speed input: 108.05 toks/s, output: 69.17 toks/s]
    [00:32<00:00, 32.62s/it, est. speed input: 95.64 toks/s, output: 62.65 toks/s]
    [00:39<00:00, 39.54s/it, est. speed input: 83.36 toks/s, output: 61.13 toks/s]
    [00:31<00:00, 31.13s/it, est. speed input: 106.44 toks/s, output: 61.07 toks/s]
    [00:30<00:00, 30.32s/it, est. speed input: 101.60 toks/s, output: 62.59 toks/s]
    

    加上 eagle3 之后 token per second 从之前的 42~43 暴力提升到了现在的 59~62

    github issue 上有一些对这个优化效果的提问。因为这个近 50% 的提升其实是远低于预期的。可以参考下面的 sglang+eagle 的运行效率

  5. sglang + eagle

    Decode batch, #running-req: 1, #token: 4365, token usage: 0.53, accept len: 3.45, accept rate: 0.06, cuda graph: True, gen throughput (token/s): 80.28,
    Decode batch, #running-req: 1, #token: 3652, token usage: 0.45, accept len: 3.23, accept rate: 0.05, cuda graph: True, gen throughput (token/s): 75.12,
    Decode batch, #running-req: 1, #token: 4962, token usage: 0.61, accept len: 4.22, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 97.56,
    Decode batch, #running-req: 1, #token: 5539, token usage: 0.68, accept len: 3.08, accept rate: 0.05, cuda graph: True, gen throughput (token/s): 71.04,
    Decode batch, #running-req: 1, #token: 5156, token usage: 0.63, accept len: 3.42, accept rate: 0.06, cuda graph: True, gen throughput (token/s): 79.04,
    Decode batch, #running-req: 1, #token: 4107, token usage: 0.50, accept len: 4.38, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 101.87,
    Decode batch, #running-req: 1, #token: 4976, token usage: 0.61, accept len: 4.00, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 92.61,
    Decode batch, #running-req: 1, #token: 4957, token usage: 0.61, accept len: 4.40, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 101.80,
    Decode batch, #running-req: 1, #token: 4508, token usage: 0.55, accept len: 4.65, accept rate: 0.08, cuda graph: True, gen throughput (token/s): 108.06,
    Decode batch, #running-req: 1, #token: 4950, token usage: 0.60, accept len: 4.10, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 94.86,
    Decode batch, #running-req: 1, #token: 5085, token usage: 0.62, accept len: 3.65, accept rate: 0.06, cuda graph: True, gen throughput (token/s): 84.59,
    

    tps 直接翻了 1~1.5 番,效果真是卓群。


由于一些原因我们可以进行 2 并发。

EAGLE3 的 repo 没有提供 batchsize \(\neq\) 1 的实现。我也懒得写了。所以 transformers + eagle 实验数据缺失