MoE Compression¶
约 246 个字 预计阅读时间 1 分钟
SMoE(Sparse Mixture of Experts)的显存与部署压力主要来自专家总参数量与加载,与稠密 FFN 剪枝的设定不同。近年工作多围绕:专家级或神经元级删除/重分配、用路由/激活指引剪专家内部或整条专家、子空间/输出视角的专家合并等。
近年来相关论文如下:
- Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
- MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router
- Delta Decompression for MoE-based LLMs Compression
- Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models
- Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging
- Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs
- Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression
- REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
- MergeMoE: Efficient Compression of MoE Models via Expert Output Merging
- Does a Global Perspective Help Prune Sparse MoEs Elegantly?
- μ-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts
- MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
- SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation
- Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy
- Mixture Compressor for Mixture-of-Experts LLMs Gains More