当前位置：首页 > news >正文

昇腾CANN ops-transformer MoE：专家混合路由的 NPU 融合优化实战

news 2026/5/23 22:27:38

MoEMixture of Experts是大模型规模扩展的关键技术——把一个巨大的 FFN 拆成多个小专家每个 token 只激活其中几个。DeepSeek-V3 用 256 个专家每个 token 只走 8 个——计算量是同等规模稠密模型的 1/32。但 MoE 的调度逻辑复杂路由打分 → top-k 选择 → 专家分发 → 结果聚合。ops-transformer 仓库的 MoE 算子把这些步骤融合成一个 kernel。MoE 的计算流程MoE 单层流程标准实现 ├─ 1. 路由打分gate_proj(x) → [batch, seq, num_experts] │ x 与每个专家的「门控向量」做内积 │ ├─ 2. Top-K 选择对每个 token选得分最高的 K 个专家 │ 输出expert_indices [batch, seq, K] │ expert_weights [batch, seq, K]softmax 归一化 │ ├─ 3. Token 分发把 token 发给对应的专家 │ 每个专家收到一批 token数量不固定 │ ├─ 4. 专家计算每个专家独立做 FFNup_proj gate_proj down_proj │ up_proj(x) × sigmoid(gate_proj(x)) → hidden │ down_proj(hidden) → output │ └─ 5. 结果聚合按 token 把专家输出加权求和 y Σ_k weight_k × expert_k(x)步骤 3-5 是瓶颈Token 分发涉及大量数据搬运专家计算涉及小矩阵乘batch size 不固定结果聚合又涉及不规则内存访问。MoE 融合 Kernel 的设计ops-transformer 把步骤 2-5 融合成一个 kernel// ops-transformer/kernels/moe_fused.cpp__aicore__voidMoEFusedKernel(GlobalTensorfloatinput,// [batch, seq, hidden]GlobalTensorfloatgate_weights,// [num_experts, hidden]GlobalTensorfloatexpert_up,// [num_experts, hidden, ffn_hidden]GlobalTensorfloatexpert_gate,// [num_experts, hidden, ffn_hidden]GlobalTensorfloatexpert_down,// [num_experts, ffn_hidden, hidden]GlobalTensorfloatoutput,// [batch, seq, hidden]GlobalTensorintexpert_indices,// [batch, seq, top_k] 输出GlobalTensorfloatexpert_weights,// [batch, seq, top_k] 输出intbatch,intseq_len,inthidden,intnum_experts,inttop_k,intffn_hidden){// 阶段 1路由打分 Top-K并行处理所有 token// 每个 block 处理一个 tokenfor(intb0;bbatch;b){for(ints0;sseq_len;s){inttoken_idb*seq_lens;// 路由打分x gate_weights^T// gate_weights: [num_experts, hidden]// 输出: scores [num_experts]LocalTensorfloatscores(num_experts);MatMul(scores,input[token_id],gate_weights);// Top-K 选择部分排序不需要完全排序// 输出indices [top_k], weights [top_k]LocalTensorinttopk_indices(top_k);LocalTensorfloattopk_scores(top_k);TopKPartial(scores,topk_indices,topk_scores,num_experts,top_k);// Softmax 归一化 top-k 权重Softmax(topk_scores,topk_scores);// 存储 top-k 结果Store(expert_indices[token_id],topk_indices);Store(expert_weights[token_id],topk_scores);}}// 阶段 2Token 重排按专家分组// 统计每个专家收到的 token 数量LocalTensorintexpert_token_count(num_experts);expert_token_count0;for(intb0;bbatch;b){for(ints0;sseq_len;s){inttoken_idb*seq_lens;for(intk0;ktop_k;k){intexpert_idexpert_indices[token_id*top_kk];expert_token_count[expert_id];}}}// 构建 token 到专家的映射表CSR 格式// expert_tokens[expert_id] [token_id_0, token_id_1, ...]LocalTensorintexpert_tokens(batch*seq_len*top_k);LocalTensorintexpert_offset(num_experts1);ExclusiveScan(expert_token_count,expert_offset);// 填充 token 列表LocalTensorintexpert_cursor(num_experts);expert_cursor0;for(intb0;bbatch;b){for(ints0;sseq_len;s){inttoken_idb*seq_lens;for(intk0;ktop_k;k){intexpert_idexpert_indices[token_id*top_kk];intposexpert_offset[expert_id]expert_cursor[expert_id];expert_tokens[pos]token_id;expert_cursor[expert_id];}}}// 阶段 3专家计算分组并行// 每个 block 处理一个专家for(inte0;enum_experts;e){intnum_tokens_for_expertexpert_token_count[e];if(num_tokens_for_expert0)continue;// 加载该专家收到的所有 tokenLocalTensorfloatexpert_input(num_tokens_for_expert*hidden);for(intt0;tnum_tokens_for_expert;t){inttoken_idexpert_tokens[expert_offset[e]t];Copy(expert_input[t*hidden],input[token_id*hidden],hidden);}// FFN 计算小矩阵乘Cube 单元// up expert_input expert_up[e]LocalTensorfloatup(num_tokens_for_expert*ffn_hidden);MatMul(up,expert_input,expert_up[e]);// gate expert_input expert_gate[e]LocalTensorfloatgate(num_tokens_for_expert*ffn_hidden);MatMul(gate,expert_input,expert_gate[e]);// 激活silu(gate) * upLocalTensorfloathidden(num_tokens_for_expert*ffn_hidden);for(inti0;inum_tokens_for_expert*ffn_hidden;i){hidden[i]up[i]*Silu(gate[i]);}// down hidden expert_down[e]LocalTensorfloatexpert_output(num_tokens_for_expert*hidden);MatMul(expert_output,hidden,expert_down[e]);// 阶段 4结果写回按 token 聚合for(intt0;tnum_tokens_for_expert;t){inttoken_idexpert_tokens[expert_offset[e]t];// 找到这个 token 在 top-k 中的权重intk_idxFindKIndex(expert_indices,token_id,e);floatweightexpert_weights[token_id*top_kk_idx];// 加权累加到输出AtomicAdd(output[token_id*hidden],expert_output[t*hidden],weight,hidden);}}}关键融合点路由打分 Top-K 融合避免 scores 中间结果写回 HBMToken 重排专家计算融合expert_input 在 L1 中直接复用原子累加聚合多专家输出并行写回避免串行累加Top-K 部分排序的优化MoE 不需要完全排序所有专家得分——只要选出最大的 K 个。ops-transformer 用快速选择QuickSelect算法// ops-transformer/kernels/topk_partial.cpp__aicore__voidTopKPartial(LocalTensorfloatscores,// [num_experts]LocalTensorintindices,// [top_k] 输出LocalTensorfloatvalues,// [top_k] 输出intnum_experts,intk){// 快速选择找到第 K 大的元素作为 pivotfloatpivotQuickSelect(scores,num_experts,k);// 单次扫描选出所有 pivot 的元素intcount0;for(inti0;inum_expertscountk;i){if(scores[i]pivot){indices[count]i;values[count]scores[i];count;}}// 如果选出的元素不够 K 个有多个等于 pivot 的补齐while(countk){// 找到第一个等于 pivot 且未被选中的for(inti0;inum_experts;i){if(scores[i]pivot!IsSelected(indices,i,count)){indices[count]i;values[count]scores[i];count;break;}}}}// QuickSelect找到第 K 大的元素O(N) 平均复杂度floatQuickSelect(LocalTensorfloatarr,intn,intk){intleft0,rightn-1;while(leftright){// 随机选 pivot避免最坏情况intpivot_idxleft(hash(leftright)%(right-left1));floatpivotarr[pivot_idx];// 三路划分intileft;for(intjleft;jright;j){if(arr[j]pivot){// 大的在左边Swap(arr[i],arr[j]);i;}}// 判断 pivot 位置if(ki){righti-1;}elseif(kik(n-1-(right-i1))){returnpivot;// pivot 正好在第 k 位}else{lefti;}}returnarr[left];}完全排序需要 O(N log N)快速选择只需要 O(N)。当 num_experts256、top_k8 时快速选择比排序快 10×。专家并行与 Token 并行的平衡MoE 的并行有两个维度Token 并行不同 token 的路由和计算互不依赖专家并行不同专家的计算互不依赖ops-transformer 的策略小 batchbatch×seq num_experts专家并行为主大 batchbatch×seq num_expertsToken 并行为主// 自动选择并行策略voidMoESelectParallelStrategy(intbatch,intseq,intnum_experts){inttotal_tokensbatch*seq;if(total_tokensnum_experts){// 专家并行每个 block 处理一个专家// 适合推理场景batch1, seq512LaunchMoEExpertParallel(batch,seq,num_experts);}else{// Token 并行每个 block 处理一批 token// 适合训练场景batch8, seq2048LaunchMoETokenParallel(batch,seq,num_experts);}}踩坑一Token 分布不均导致负载不平衡某些热门专家可能收到大量 token冷门专家几乎没 token。最坏情况95% 的 token 走同一个专家其他专家空转。缓解策略容量因子Capacity Factor// 限制每个专家最多处理的 token 数intmax_tokens_per_expert(batch*seq*top_k)/num_experts*capacity_factor;// capacity_factor 1.0完美均衡每个专家处理相同数量的 token// capacity_factor 1.5允许 50% 的负载波动// capacity_factor 2.0允许热点专家处理双倍 token// 超出容量的 token走 drop 或辅助专家if(expert_token_count[e]max_tokens_per_expert){// 策略 1丢弃超出的 token损失信息// 策略 2路由到次优专家保留信息但效果略差intsecond_bestFindSecondBestExpert(token_id,scores);expert_indices[token_id*top_kk]second_best;}踩坑二原子累加的精度损失多专家输出用 AtomicAdd 累加。FP16 精度下连续 8 次累加可能损失 5% 的精度。修复累加时用 FP32 中间结果// 错误直接 FP16 累加AtomicAdd(output_fp16,expert_output_fp16,weight);// 精度损失// 正确先转 FP32累加完再转回floatoutput_fp32float(output_fp16)float(expert_output_fp16)*weight;output_fp16half(output_fp32);ops-transformer 的 MoE kernel 内部自动处理精度转换——用户感知不到 FP32 中间结果。踩坑三专家权重加载的内存布局每个专家的 FFN 权重形状相同[hidden, ffn_hidden]但实际加载时是交错的权重文件布局不适合 MoE expert_0_up [hidden, ffn] expert_0_gate [hidden, ffn] expert_0_down [ffn, hidden] expert_1_up [hidden, ffn] ... 优化布局适合 MoE 融合 all_up [num_experts, hidden, ffn] ← 连续存储 all_gate [num_experts, hidden, ffn] all_down [num_experts, ffn, hidden]ops-transformer 提供权重转换脚本python tools/convert_moe_weights.py\--inputstandard_format.pt\--outputmoe_optimized.pt\--layoutinterleaved转换后的权重可以直接送入 MoE 融合 kernel——不需要运行时重排。MoE 是大模型规模扩展的核心技术但 MoE 的调度逻辑是性能杀手——路由打分、Token 分发、结果聚合这些步骤在 CPU 上是毫秒级在 NPU 上必须压到微秒级。ops-transformer 的 MoE 融合算子把路由、分发、计算、聚合四个阶段合并成一个 kernel中间结果不回 HBM。top-k 快速选择、专家并行与 Token 并行的自适应切换、容量因子负载平衡——这些是 MoE 在 NPU 上跑得快的关键。

查看全文

http://www.zskr.cn/news/1360861.html