yolov26改进 | 添加注意力机制篇 | 最新空间和通道协同注意力SCSA改进yolov26有效涨点(含二次创新C2PSA机制和网络结构图)
开始讲解之前推荐一下我的专栏,本专栏的内容支持(分类、检测、分割、追踪、关键点检测),专栏目前为限时折扣,欢迎大家订阅本专栏,本专栏每周更新5-7篇最新机制,更有包含我所有改进的文件和交流群提供给大家,本人定期在群内分享发表论文方法和经验。
一、本文介绍
本文给大家带来的改进机制是最新的空间和通道协同注意力模块(Spatial and Channel Synergistic Attention)SCSA,其通过结合空间注意力(Spatial Attention)和通道注意力(Channel Attention),提出了一种新的协同注意力模块SCSA。SCSA的设计由两个主要部分组成:共享多语义空间注意力(SMSA)和渐进通道自注意力(PCSA)| 个人感觉类似于CBAM,SCSA机制旨在有效地结合通道和空间注意力的优势,充分利用多语义信息,从而提高视觉任务的表现,在本文中我提供其二次创新C2PSA机制,。
欢迎大家订阅我的专栏一起学习YOLO!
专栏链接:YOLOv26有效涨点专栏包含:Conv、注意力机制、主干/Backbone、损失函数、优化器、后处理等改进机制
目录
一、本文介绍
二、基本原理
三、核心代码
四、添加教程
4.1 修改一
4.2 修改二
4.3 修改三
4.4 修改四
4.5 修改五
4.6 修改六
五、正式训练
5.1 yaml文件
5.1.1 yaml文件1
5.1.2 yaml文件2
5.2 训练代码
5.3 训练过程截图
五、本文总结
二、基本原理
论文地址:官方论文地址
代码地址:官方代码地址
该论文探讨了空间注意力(Spatial Attention)和通道注意力(Channel Attention)之间的协同效应,提出了一种新的协同注意力模块(SCSA)。SCSA的设计由两个主要部分组成:共享多语义空间注意力(SMSA)和渐进通道自注意力(PCSA)。
SCSA机制的主要原理
SCSA(Spatial and Channel Synergistic Attention)机制旨在有效地结合通道和空间注意力的优势,充分利用多语义信息,从而提高视觉任务的表现。其主要原理分为以下几个方面:
1. 多语义信息集成
- SCSA设计的一个关键特点是多语义信息的集成。通过共享多语义空间注意力(SMSA),它可以从多尺度的空间信息中提取丰富的语义特征。
- SMSA通过多尺度深度共享的1D卷积,提取多层次的空间信息,为通道自注意力提供了多语义空间先验,有助于增强不同语义信息的表达。
2. 渐进式压缩策略
- 在SCSA中,SMSA模块使用渐进压缩策略,将辨别性空间信息注入到PCSA(Progressive Channel-wise Self-Attention)中,以便有效地引导通道重新校准。
- 这种压缩策略能够在降低计算复杂度的同时,保留空间结构的关键信息,使得通道注意力在进行计算时能够利用到更多的空间先验。
3. 通道自注意力的渐进式通道相似性计算
- PCSA模块采用了输入感知的自注意力机制,能够有效地计算通道之间的相似性,从而缓解SMSA内部不同子特征间的语义差异。
- PCSA结合SMSA提供的空间知识,对通道特征进行更精细的自注意力调整,增强了通道注意力在语义一致性和区分性上的表现。
4. 模块化设计与协同效应
- SCSA的设计是模块化的,即SMSA和PCSA串联使用,从而在维度解耦、轻量化多语义指导和语义差异缓解的基础上实现空间和通道的协同效应。
- 这种协同机制通过空间注意力引导通道学习,使得通道能够更好地关注重要的空间区域,同时通道自注意力可以进一步增强空间结构中的细节表现。
总结
SCSA通过结合SMSA和PCSA两个模块,实现了空间和通道注意力的协同作用。在SMSA中,多语义空间信息的集成和渐进压缩策略有效地为通道注意力提供了空间先验,而PCSA则利用这些空间信息,通过自注意力机制进一步优化通道特征,缓解了不同语义层次的差异。实验结果表明,SCSA在图像分类、目标检测和语义分割等多种视觉任务上具有出色的表现和泛化能力,显著超越了当前的主流注意力机制。这种设计思路为未来多维度协同注意力机制的研究提供了新的方向。
下图是文章中提供的几种思路添加方法.
三、核心代码
核心代码的使用方式看章节四!
import typing as t import torch import torch.nn as nn from einops import rearrange from mmengine.model import BaseModule __all__ = ['SCSA', 'C2PSASCSA'] class SCSA(BaseModule): def __init__( self, dim: int, head_num: int = 4, window_size: int = 7, group_kernel_sizes: t.List[int] = [3, 5, 7, 9], qkv_bias: bool = False, fuse_bn: bool = False, norm_cfg: t.Dict = dict(type='BN'), act_cfg: t.Dict = dict(type='ReLU'), down_sample_mode: str = 'avg_pool', attn_drop_ratio: float = 0., gate_layer: str = 'sigmoid', ): super(SCSA, self).__init__() self.dim = dim head_num = dim // 64 if head_num == 0: head_num = 1 self.head_num = head_num self.head_dim = dim // head_num self.scaler = self.head_dim ** -0.5 self.group_kernel_sizes = group_kernel_sizes self.window_size = window_size self.qkv_bias = qkv_bias self.fuse_bn = fuse_bn self.down_sample_mode = down_sample_mode assert self.dim // 4, 'The dimension of input feature should be divisible by 4.' self.group_chans = group_chans = self.dim // 4 self.local_dwc = nn.Conv1d(group_chans, group_chans, kernel_size=group_kernel_sizes[0], padding=group_kernel_sizes[0] // 2, groups=group_chans) self.global_dwc_s = nn.Conv1d(group_chans, group_chans, kernel_size=group_kernel_sizes[1], padding=group_kernel_sizes[1] // 2, groups=group_chans) self.global_dwc_m = nn.Conv1d(group_chans, group_chans, kernel_size=group_kernel_sizes[2], padding=group_kernel_sizes[2] // 2, groups=group_chans) self.global_dwc_l = nn.Conv1d(group_chans, group_chans, kernel_size=group_kernel_sizes[3], padding=group_kernel_sizes[3] // 2, groups=group_chans) self.sa_gate = nn.Softmax(dim=2) if gate_layer == 'softmax' else nn.Sigmoid() self.norm_h = nn.GroupNorm(4, dim) self.norm_w = nn.GroupNorm(4, dim) self.conv_d = nn.Identity() self.norm = nn.GroupNorm(1, dim) self.q = nn.Conv2d(in_channels=dim, out_channels=dim, kernel_size=1, bias=qkv_bias, groups=dim) self.k = nn.Conv2d(in_channels=dim, out_channels=dim, kernel_size=1, bias=qkv_bias, groups=dim) self.v = nn.Conv2d(in_channels=dim, out_channels=dim, kernel_size=1, bias=qkv_bias, groups=dim) self.attn_drop = nn.Dropout(attn_drop_ratio) self.ca_gate = nn.Softmax(dim=1) if gate_layer == 'softmax' else nn.Sigmoid() if window_size == -1: self.down_func = nn.AdaptiveAvgPool2d((1, 1)) else: if down_sample_mode == 'recombination': self.down_func = self.space_to_chans # dimensionality reduction self.conv_d = nn.Conv2d(in_channels=dim * window_size ** 2, out_channels=dim, kernel_size=1, bias=False) elif down_sample_mode == 'avg_pool': self.down_func = nn.AvgPool2d(kernel_size=(window_size, window_size), stride=window_size) elif down_sample_mode == 'max_pool': self.down_func = nn.MaxPool2d(kernel_size=(window_size, window_size), stride=window_size) def forward(self, x: torch.Tensor) -> torch.Tensor: """ The dim of x is (B, C, H, W) """ # Spatial attention priority calculation b, c, h_, w_ = x.size() # (B, C, H) x_h = x.mean(dim=3) l_x_h, g_x_h_s, g_x_h_m, g_x_h_l = torch.split(x_h, self.group_chans, dim=1) # (B, C, W) x_w = x.mean(dim=2) l_x_w, g_x_w_s, g_x_w_m, g_x_w_l = torch.split(x_w, self.group_chans, dim=1) x_h_attn = self.sa_gate(self.norm_h(torch.cat(( self.local_dwc(l_x_h), self.global_dwc_s(g_x_h_s), self.global_dwc_m(g_x_h_m), self.global_dwc_l(g_x_h_l), ), dim=1))) x_h_attn = x_h_attn.view(b, c, h_, 1) x_w_attn = self.sa_gate(self.norm_w(torch.cat(( self.local_dwc(l_x_w), self.global_dwc_s(g_x_w_s), self.global_dwc_m(g_x_w_m), self.global_dwc_l(g_x_w_l) ), dim=1))) x_w_attn = x_w_attn.view(b, c, 1, w_) x = x * x_h_attn * x_w_attn # Channel attention based on self attention # reduce calculations y = self.down_func(x) y = self.conv_d(y) _, _, h_, w_ = y.size() # normalization first, then reshape -> (B, H, W, C) -> (B, C, H * W) and generate q, k and v y = self.norm(y) q = self.q(y) k = self.k(y) v = self.v(y) # (B, C, H, W) -> (B, head_num, head_dim, N) q = rearrange(q, 'b (head_num head_dim) h w -> b head_num head_dim (h w)', head_num=int(self.head_num), head_dim=int(self.head_dim)) k = rearrange(k, 'b (head_num head_dim) h w -> b head_num head_dim (h w)', head_num=int(self.head_num), head_dim=int(self.head_dim)) v = rearrange(v, 'b (head_num head_dim) h w -> b head_num head_dim (h w)', head_num=int(self.head_num), head_dim=int(self.head_dim)) # (B, head_num, head_dim, head_dim) attn = q @ k.transpose(-2, -1) * self.scaler attn = self.attn_drop(attn.softmax(dim=-1)) # (B, head_num, head_dim, N) attn = attn @ v # (B, C, H_, W_) attn = rearrange(attn, 'b head_num head_dim (h w) -> b (head_num head_dim) h w', h=int(h_), w=int(w_)) # (B, C, 1, 1) attn = attn.mean((2, 3), keepdim=True) attn = self.ca_gate(attn) return attn * x def autopad(k, p=None, d=1): # kernel, padding, dilation """Pad to 'same' shape outputs.""" if d > 1: k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size if p is None: p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad return p class Conv(nn.Module): """Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation).""" default_act = nn.SiLU() # default activation def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True): """Initialize Conv layer with given arguments including activation.""" super().__init__() self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False) self.bn = nn.BatchNorm2d(c2) self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity() def forward(self, x): """Apply convolution, batch normalization and activation to input tensor.""" return self.act(self.bn(self.conv(x))) def forward_fuse(self, x): """Perform transposed convolution of 2D data.""" return self.act(self.conv(x)) class PSABlock(nn.Module): """ PSABlock class implementing a Position-Sensitive Attention block for neural networks. This class encapsulates the functionality for applying multi-head attention and feed-forward neural network layers with optional shortcut connections. Attributes: attn (Attention): Multi-head attention module. ffn (nn.Sequential): Feed-forward neural network module. add (bool): Flag indicating whether to add shortcut connections. Methods: forward: Performs a forward pass through the PSABlock, applying attention and feed-forward layers. Examples: Create a PSABlock and perform a forward pass """ def __init__(self, c, attn_ratio=0.5, num_heads=4, shortcut=True) -> None: """Initializes the PSABlock with attention and feed-forward layers for enhanced feature extraction.""" super().__init__() self.attn = SCSA(c) self.ffn = nn.Sequential(Conv(c, c * 2, 1), Conv(c * 2, c, 1, act=False)) self.add = shortcut def forward(self, x): """Executes a forward pass through PSABlock, applying attention and feed-forward layers to the input tensor.""" x = x + self.attn(x) if self.add else self.attn(x) x = x + self.ffn(x) if self.add else self.ffn(x) return x class C2PSASCSA(nn.Module): """ C2PSA module with attention mechanism for enhanced feature extraction and processing. This module implements a convolutional block with attention mechanisms to enhance feature extraction and processing capabilities. It includes a series of PSABlock modules for self-attention and feed-forward operations. Attributes: c (int): Number of hidden channels. cv1 (Conv): 1x1 convolution layer to reduce the number of input channels to 2*c. cv2 (Conv): 1x1 convolution layer to reduce the number of output channels to c. m (nn.Sequential): Sequential container of PSABlock modules for attention and feed-forward operations. Methods: forward: Performs a forward pass through the C2PSA module, applying attention and feed-forward operations. Notes: This module essentially is the same as PSA module, but refactored to allow stacking more PSABlock modules. Examples: """ def __init__(self, c1, c2, n=1, e=0.5): """Initializes the C2PSA module with specified input/output channels, number of layers, and expansion ratio.""" super().__init__() assert c1 == c2 self.c = int(c1 * e) self.cv1 = Conv(c1, 2 * self.c, 1, 1) self.cv2 = Conv(2 * self.c, c1, 1) self.m = nn.Sequential(*(PSABlock(self.c, attn_ratio=0.5, num_heads=self.c // 64) for _ in range(n))) def forward(self, x): """Processes the input tensor 'x' through a series of PSA blocks and returns the transformed tensor.""" a, b = self.cv1(x).split((self.c, self.c), dim=1) b = self.m(b) return self.cv2(torch.cat((a, b), 1)) # if __name__ == '__main__': # x = torch.ones(8, 128, 32, 32) # channels = x.shape[1] # model = C2f_SCSA(channels, channels, 1,True) # output = model(x) # print(output.shape)四、添加教程
下面的步骤如果你不会或者不想麻烦操作,可以联系作者获得本专栏添加所有项目文件的源代码,可直接训练.
4.1 修改一
第一还是建立文件,我们找到如下ultralytics/nn文件夹下建立一个目录名字呢就是'Addmodules'文件夹!
4.2 修改二
然后在Addmodules文件夹内建立一个新的py文件,将本文章节三中的“核心代码"复制粘贴进去。
4.3 修改三
第二步我们在该目录下创建一个新的py文件名字为'__init__.py',然后在其内部导入我们的文件,如下图所示。
4.4 修改四
第三步我门中到如下文件'ultralytics/nn/tasks.py'进行导入和注册我们的模块(此处只需要添加一次即可,如果你用我其它的改进机制这里的步骤只需要添加一次)!
4.5 修改五
在'ultralytics/nn/tasks.py'文件内的parse_model方法函数内(位置大概在1500+行左右),按照图示位置添加即可(此处需要自己有一定的判别能力,如果不会可联系作者获得视频教程)。
4.6 修改六
在'ultralytics/nn/tasks.py'文件内的parse_model方法函数内(位置大概在1550+行左右),按照图示位置添加即可,此处一定要对应好位置和缩进否则很容易报错。
elif m in {此处填写本章代码的名字.}: c2 = ch[f] args = [c2, *args]五、正式训练
5.1 yaml文件
5.1.1 yaml文件1
训练信息:YOLO26-C2PSA-SCSA summary: 267 layers, 2,456,860 parameters, 2,456,860 gradients, 5.7 GFLOPs
# Ultralytics 🚀 AGPL-3.0 License - https://ultralytics.com/license # Ultralytics YOLO26 object detection model with P3/8 - P5/32 outputs # Model docs: https://docs.ultralytics.com/models/yolo26 # Task docs: https://docs.ultralytics.com/tasks/detect # Parameters nc: 80 # number of classes end2end: True # whether to use end-to-end mode reg_max: 1 # DFL bins scales: # model compound scaling constants, i.e. 'model=yolo26n.yaml' will call yolo26.yaml with scale 'n' # [depth, width, max_channels] n: [0.50, 0.25, 1024] # summary: 260 layers, 2,572,280 parameters, 2,572,280 gradients, 6.1 GFLOPs s: [0.50, 0.50, 1024] # summary: 260 layers, 10,009,784 parameters, 10,009,784 gradients, 22.8 GFLOPs m: [0.50, 1.00, 512] # summary: 280 layers, 21,896,248 parameters, 21,896,248 gradients, 75.4 GFLOPs l: [1.00, 1.00, 512] # summary: 392 layers, 26,299,704 parameters, 26,299,704 gradients, 93.8 GFLOPs x: [1.00, 1.50, 512] # summary: 392 layers, 58,993,368 parameters, 58,993,368 gradients, 209.5 GFLOPs # YOLO26n backbone backbone: # [from, repeats, module, args] - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2 - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4 - [-1, 2, C3k2, [256, False, 0.25]] - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8 - [-1, 2, C3k2, [512, False, 0.25]] - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16 - [-1, 2, C3k2, [512, True]] - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32 - [-1, 2, C3k2, [1024, True]] - [-1, 1, SPPF, [1024, 5, 3, True]] # 9 - [-1, 2, C2PSASCSA, [1024]] # 10 # YOLO26n head head: - [-1, 1, nn.Upsample, [None, 2, "nearest"]] - [[-1, 6], 1, Concat, [1]] # cat backbone P4 - [-1, 2, C3k2, [512, True]] # 13 - [-1, 1, nn.Upsample, [None, 2, "nearest"]] - [[-1, 4], 1, Concat, [1]] # cat backbone P3 - [-1, 2, C3k2, [256, True]] # 16 (P3/8-small) - [-1, 1, Conv, [256, 3, 2]] - [[-1, 13], 1, Concat, [1]] # cat head P4 - [-1, 2, C3k2, [512, True]] # 19 (P4/16-medium) - [-1, 1, Conv, [512, 3, 2]] - [[-1, 10], 1, Concat, [1]] # cat head P5 - [-1, 1, C3k2, [1024, True, 0.5, True]] # 22 (P5/32-large) - [[16, 19, 22], 1, Detect, [nc]] # Detect(P3, P4, P5)5.1.2 yaml文件2
训练信息:YOLO26-Att-SCSA summary: 275 layers, 2,507,164 parameters, 2,507,164 gradients, 5.8 GFLOPs
# Ultralytics 🚀 AGPL-3.0 License - https://ultralytics.com/license # Ultralytics YOLO26 object detection model with P3/8 - P5/32 outputs # Model docs: https://docs.ultralytics.com/models/yolo26 # Task docs: https://docs.ultralytics.com/tasks/detect # Parameters nc: 80 # number of classes end2end: True # whether to use end-to-end mode reg_max: 1 # DFL bins scales: # model compound scaling constants, i.e. 'model=yolo26n.yaml' will call yolo26.yaml with scale 'n' # [depth, width, max_channels] n: [0.50, 0.25, 1024] # summary: 260 layers, 2,572,280 parameters, 2,572,280 gradients, 6.1 GFLOPs s: [0.50, 0.50, 1024] # summary: 260 layers, 10,009,784 parameters, 10,009,784 gradients, 22.8 GFLOPs m: [0.50, 1.00, 512] # summary: 280 layers, 21,896,248 parameters, 21,896,248 gradients, 75.4 GFLOPs l: [1.00, 1.00, 512] # summary: 392 layers, 26,299,704 parameters, 26,299,704 gradients, 93.8 GFLOPs x: [1.00, 1.50, 512] # summary: 392 layers, 58,993,368 parameters, 58,993,368 gradients, 209.5 GFLOPs # YOLO26n backbone backbone: # [from, repeats, module, args] - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2 - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4 - [-1, 2, C3k2, [256, False, 0.25]] - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8 - [-1, 2, C3k2, [512, False, 0.25]] - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16 - [-1, 2, C3k2, [512, True]] - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32 - [-1, 2, C3k2, [1024, True]] - [-1, 1, SPPF, [1024, 5, 3, True]] # 9 - [-1, 2, C2PSA, [1024]] # 10 # YOLO26n head head: - [-1, 1, nn.Upsample, [None, 2, "nearest"]] - [[-1, 6], 1, Concat, [1]] # cat backbone P4 - [-1, 2, C3k2, [512, True]] # 13 - [-1, 1, nn.Upsample, [None, 2, "nearest"]] - [[-1, 4], 1, Concat, [1]] # cat backbone P3 - [-1, 2, C3k2, [256, True]] # 16 (P3/8-small) - [-1, 1, Conv, [256, 3, 2]] - [[-1, 13], 1, Concat, [1]] # cat head P4 - [-1, 2, C3k2, [512, True]] # 19 (P4/16-medium) - [-1, 1, Conv, [512, 3, 2]] - [[-1, 10], 1, Concat, [1]] # cat head P5 - [-1, 1, C3k2, [1024, True, 0.5, True]] # 22 (P5/32-large) - [16, 1, SCSA, []] # 23 # - [19, 1, SCSA, []] # 24 # - [22, 1, SCSA, []] # 25 # 此处的使用说法注释: 其中上面的三个注意力机制目前仅使用了23层,如果你想使用24层那么就取消掉代码注释, # 并将下面检测头中的19改为24,如果想使用第25层注意力机制同理,将下面检测头中的22改为25即可。 # 此处用法比较复杂如过不会联系Snu77博主获取视频教程 - [[23, 19, 22], 1, Detect, [nc]] # Detect(P3, P4, P5)5.2 训练代码
大家可以创建一个py文件将我给的代码复制粘贴进去,配置好自己的文件路径即可运行。
import warnings warnings.filterwarnings('ignore') from ultralytics import YOLO if __name__ == '__main__': model = YOLO('模型配置文件地址,也就是5.1你保存到本地文件的地址') # 如何切换模型版本, 上面的ymal文件可以改为 yolo26s.yaml就是使用的26s, # 类似某个改进的yaml文件名称为yolo26-XXX.yaml那么如果想使用其它版本就把上面的名称改为yolo26l-XXX.yaml即可(改的是上面YOLO中间的名字不是配置文件的)! # model.load('yolo26n.pt') # 是否加载预训练权重,科研不建议大家加载否则很难提升精度 model.train( data=r"数据集文件地址", # 如果大家任务是其它的'ultralytics/cfg/default.yaml'找到这里修改task可以改成detect, segment, classify, pose cache=False, imgsz=640, epochs=20, single_cls=False, # 是否是单类别检测 batch=16, close_mosaic=0, workers=0, device='0', optimizer='MuSGD', # using SGD/MuSGD # resume=, # 这里是填写last.pt地址 amp=True, # 如果出现训练损失为Nan可以关闭amp project='runs/train', name='exp', )5.3 训练过程截图
五、本文总结
到此本文的正式分享内容就结束了,在这里给大家推荐我的YOLOv26改进有效涨点专栏,本专栏目前为新开的平均质量分98分,后期我会根据各种最新的前沿顶会进行论文复现,也会对一些老的改进机制进行补充,如果大家觉得本文帮助到你了,订阅本专栏,关注后续更多的更新~
专栏链接:YOLOv26有效涨点专栏包含:Conv、注意力机制、主干/Backbone、损失函数、优化器、后处理等改进机制
