当前位置：首页 > news >正文

在ec2上部署qwen-image模型

news 2026/6/11 19:17:35

参考资料

https://zhuanlan.zhihu.com/p/1937260131931882107
https://blog.csdn.net/qq_52065352/article/details/150394958
https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Qwen_Image.pdf
https://modelscope.cn/aigc/imageGeneration?tab=advanced

环境初始化

测试环境为g5.12xlarge机器，4卡A10G，cuda版本

# nvidia-smi
Tue Nov 18 10:52:16 2025
NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     # nvcc --version
Cuda compilation tools, release 12.4, V12.4.131

模型下载并上传到efs中

modelscope download --model Qwen/Qwen-Image

使用镜像环境调试

FROM public.ecr.aws/deep-learning-containers/pytorch-inference:2.6.0-gpu-py312
RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
ENTRYPOINT ["sleep", "infinity"]

构建镜像

docker pull public.ecr.aws/deep-learning-containers/pytorch-inference:2.6.0-gpu-py312docker build -t qwenllm:imagev1 .
docker tag qwenllm:imagev1 xxxxxxx.dkr.ecr.cn-north-1.amazonaws.com.cn/qwenllm:imagev1
docker push xxxxxxx.dkr.ecr.cn-north-1.amazonaws.com.cn/qwenllm:imagev1

使用模板启动pod，将仓库和模型通过efs挂载到pod中

apiVersion: apps/v1
kind: Deployment
metadata:name: diffuse-qwen-imagenamespace: aitaolabels:app: diffuse
spec:replicas: 1selector:matchLabels:app: diffusetemplate:metadata:labels:app: diffusespec:serviceAccount: sa-service-account-apinodeSelector:eks.amazonaws.com/nodegroup: llm-ngcontainers:- name: diffuse-containerimage: xxxxxxx.dkr.ecr.cn-north-1.amazonaws.com.cn/qwenllm:imagev1ports:- containerPort: 8000name: http-apiresources:limits:nvidia.com/gpu: 4memory: "64Gi"cpu: "24"requests:nvidia.com/gpu: 4memory: "32Gi"cpu: "16"volumeMounts:- name: persistent-storagemountPath: /efsvolumes:- name: persistent-storagepersistentVolumeClaim:claimName: efs-claimrestartPolicy: Always

通过diffusers库推理

使用如下脚本

device_map不设置为balanced会仅使用单卡（A10G只有20GB显存），导致oom
实际生成时也只使用单卡，每个step约1分钟时间较久

model_path = "/efs/models/Qwen/Qwen-Image"pipe = DiffusionPipeline.from_pretrained(model_path,torch_dtype=torch.bfloat16,device_map="balanced",
)
...
image = pipe(prompt="中国古典庭院，阳光明媚，高清写实",width=width,height=height,num_inference_steps=20,true_cfg_scale=4.0,generator=generator
).images[0]

多卡并行的官方示例中主要通过以下方式实现

使用torch.multiprocessing.set_start_method('spawn', force=True)设置多进程启动方法为'spawn'
每个GPU对应一个独立的进程(GPUWorker)，避免了Python GIL限制和CUDA上下文冲突
使用[MultiGPUManager](javascript:void(0))类管理所有GPU Worker进程，任务自动分配给空闲的GPU Worker
通过队列机制实现了简单的负载均衡

但是这个代码似乎不会对单个生成任务进行多卡并行推理。

每个生成任务会被分配到一个GPU上处理
通过任务队列([task_queue](javascript:void(0)))分发给不同的GPU Worker进程
每个任务只会在一个GPU上执行，而不是跨多个GPU并行执行

所以还是会出现oom

GPU 0 model initialization failed: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacity of 22.30 GiB of which 68.69 MiB is free. Process 3960521 has 22.23 GiB memory in use. Of the allocated memory 21.79 GiB is allocated by PyTorch, and 196.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

使用comfyUI

https://comfyanonymous.github.io/ComfyUI_examples/qwen_image/

仍旧使用上面的pod模板

# 安装 modelscope
pip install modelscope

下载fp8模型

# 下载模型文件
modelscope download --model Comfy-Org/Qwen-Image_ComfyUI split_files/diffusion_models/qwen_image_fp8_e4m3fn.safetensors
# 下载Text Encoder模型
modelscope download --model Comfy-Org/Qwen-Image_ComfyUI split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors
# 下载VAE模型
modelscope download --model Comfy-Org/Qwen-Image_ComfyUI split_files/vae/qwen_image_vae.safetensors $ tree 
.
├── diffusion_models
│   └── qwen_image_fp8_e4m3fn.safetensors
├── text_encoders
│   └── qwen_2.5_vl_7b_fp8_scaled.safetensors
└── vae└── qwen_image_vae.safetensors

拷贝模型到ComfyUI的model目录下

aws s3 sync ./ s3://bucketname/Comfy-Org/Qwen-Image_ComfyUI/

启动ComfyUI

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI && pip install -i https://mirrors.ustc.edu.cn/pypi/simple/ -r requirements.txt
python  main.py  --port 8000

测试生成，大约耗时38s

查看资源消耗，仍旧没有使用多卡，目前看来得需要一个强大的单卡会更好（显存要够）。
部分模型参数被offload到cpu中，消耗了月20GB内存，所以理论上单卡的g5机型就能运行起来。

下载lora模型

/efs/ComfyUI/models/loras# modelscope download --model DiffSynth-Studio/Qwen-Image-LoRA-ArtAug-v1 model.safetensors --local_dir ./

使用4step 和8step 的lora模型能够大幅加快生成速度

https://www.modelscope.cn/models/lightx2v/Qwen-Image-Lightning/files

效果如下

后续需要封装为api可以参考文档

查看全文

http://www.zskr.cn/news/53624.html

一种成熟的状态机

linux c 文件删除文件夹

2025留学生名企内推首选清单：从实习到入职全程护航，5家实力机构深度测评

狂神学习day1 markdown

Week4 题解

2025年11月出国留学咨询机构排行榜：从申请到就业全维度推荐

Universal 3-Button Flip Remote Key for PSA Type (5pcs/lot) – Easy Replacement for Euro/American Cars

Avalonia框架安装 - -YADA

常用基础算法程序

2025出国留学机构哪家强？5大靠谱品牌深度测评

Wavelet tree

Dify VS LangGraph

详细介绍：pdf解析工具---Miner-u 本地部署记录

使用Action表驱动代替switch…case语句

L11 RuoYi_数据分页的总条数分析

c#json帮助类

11.17 事务的隔离级别

详细介绍：深度学习计算机视觉 Kaggle（上）：从理论殿堂起步 ——像素、特征与模型的进化之路

Web of Things (WoT) 物描述 2.0 首个公开工作草案发布

图形渲染与 GPU 交互中的 C++ 性能优化技巧 - 教程

罗盘

CF1721F Matching Reduction

NSSCTF刷题日记

详细介绍：UE4_Niagara基础实例—15、粒子发射器之间的通信

2025年目前口碑好的继承官司律师律所有哪些，遗产继承律师事务所/北京最好的继承律师/婚姻律师事务所/继承律师/北京继承纠纷律师律所哪家强

环境初始化

通过diffusers库推理

使用comfyUI

相关文章：