CVPR文章阅读

¶CVPR文章阅读

标题： Grid-GCN for Fast and Scalable Point Cloud Learning

¶Abstract

由于点云的稀疏性以及不规则性，直接使用原始点的方法变得比较流行，但是基于点(point-based)的方法在数据构建(data structuring)上开销非常大，这限制了算法的速度以及可拓展性(scalability)。在这篇文章中我们提出了一个叫做Grid-GCN的方法用作快速以及可拓展的点云学习。Grid-GCN使用新颖的数据构建策略(data structuring strategy)：覆盖自意识网格查询(Coverage-Aware Grid Query, CAGQ)。CAGQ通过平衡(leverage)网格区域的效率，在增加空间覆盖度的同时减小了理论的时间复杂度。相比与FPS(Farthest Point Sampling)速度提升了50倍。在GCA(Grid Context Aggregation)模组的加持下，Grid-GCN在主流的点云分类以及予以分割基准上达到了state-of-the-art的性能，速度比以前的研究有显著提升。Grid-GCN在ScanNet上使用81920点作为输入的推理速度达到了50fps。

Supplementary: https://xharlie.github.io/papers/GGCN_supCamReady.pdf

Source code: https://github.com/xharlie/Grid-GCN

¶Introduction

点云数据在自动驾驶，机器人，无人飞行器的领域很常见。目前Lidar传感器每秒可产生上百万个点，产生密集的世界表示。有很多方法被用作点云数据处理：

基于体素的模型 基于体素的模型将点云转换成空间量化的体素网格，并使用体素卷积在网格空间进行卷积。优点：网格数据结构高效；缺点：1.为了保留数据位置的力度的话必须要高的体素分辨率，而算力与内存随分辨率三次方增长。2.大约90%的体素都是空的。

基于点的模型 相比基于体素的模型，计算更高效但是需要遭受低效的数据构建(data structuring)。

Grid-GCN模型融合(blend)了二者的优点，在模型中插入多个GridConv层。没各层包含两个阶段：1. data structuring stage: 采集出有代表性的中心并查询临近的点。2. convolution stage: 对每一个点群构建局部图并聚集信息到中心上。

Coverage-Aware Grid Query (CAGQ) module: 用来达到高效的数据构建。模块 1.加速了中心采集以及临近点查询；2.提供了对点云更完整的覆盖。data structuring efficiency通过体素化实现，computation efficiency通过仅计算被占用的区域实现。

Grid Context Aggregation (GCA) module: 用来利用点之间的关系(point relationships)。模型使用Grid context pooling来提取网格邻域的上下文特征，这有利于边缘关系计算(edge relation computation)而不增加额外开销。

We demonstrate the Grid-GCN model on two tasks: point cloud classification and segmentation. Specifically, we perform the classification task on the ModelNet40 and ModelNet10 [43], and achieve the state-of-the-art overall accuracy of 93.1% (no voting), while being on average 5× faster than other models. We also perform the segmentation tasks on ScanNet [8] and S3DIS [1] dataset, and achieve 10× speed-up on average than other models. Notably, our model demonstrates its ability on real-time large-scale point-based learning by processing 81920 points in a scene within 20 ms.

基于体素的三维学习方法 为了延续在卷积神经网络模型的成功，Voxnet和他的变体开始将点云或者深度图转换为被占用的网格并应用体积卷积。为了强调内存使用三次方增长的问题，OctNet为被占用体素构建了树状结构来避免在空区域的计算。这种方法尽管在数据构建上效率较高，但是基于体素的方法有低计算效率以及损失数据粒度的缺陷。

基于点的点云学习方法 基于点的方法首先由[29，30]提出，他们使用pooling来聚合点的信息以追求排列不变性。类似kernel correlation和extended convolutions的方法被提出来更好地捕获局部特征。为了解决计数(ordering)歧义，PointCNN与预测了局部点的顺序，RSNet从不同的方向顺序地消耗点。基于点的学习方法的计算花销随着输入点的数量线性增加。然而，数据构建的花销会在大型点云上成为性能瓶颈。

点云的数据构建策略 大多数基于点的方法使用FPS来平均地采样分散的群组中心。FPS会挑选相互间距离最大的点。如果中心的数量不是很小，这种方法会消耗O(N²)计算性能。一个近似的方法[9]能够达到O(NlogN)。随机点采样(Random Point Sampling, RPS)拥有最小的可能开销，但是算法对密度不均衡很敏感。我们的CAGQ模组和RPS有相同的复杂度，但是他会一次性执行采样以及邻域点(neighbors)查询，这甚至比使用Ball Query的RPS或者K-NN更快(看Table 2)。KPConv使用一个网格欠采样(sub-sampling)从被占用体素中挑选点。不同于我们的CAGQ，这种方法不能从体素邻域中查询点。CAGQ也有一个Coverage-Aware Sampling(CAS)算法，它优化中心点的选择，使得算法嫩能够比FPS达到对点云更好的覆盖。

额外地，SO-Net搭建一个自组织图(self-organizing map)。KDNet使用kd-tree分割(partition)区域。PAT使用Gunble Subset Sampling替代FPS。SPG使用一个聚类的方法将点分组为超级点(super points)所有的这些方法不是慢就是需要构建的预处理(structure preprocessing)。SPLATNet中的晶格投影在体素中保留了更多点的细节，但是方法更缓慢。像VoxelNet的方法通过在每个体素中使用PointNet并进行体素卷积的方式结合了基于点和体素的方法。一个并发(concurrent)的高速模型PVCNN使用了相似的方法但是不会在每一层逐渐地减小点数。目前Grid-GCN可以通过CAGQ下采样大量的点，并且通过考虑局部图的端点(node)关系来聚合信息。

使用GCN进行点云学习 图卷积网络已经被广泛应用在点云学习上。通常每个组都会构建一个局部图，GCN根据点之间的关系聚合点的信息。SpecConv混合了点的特征通过使用一个图的傅里叶变换。其他研究对中心以及端点(node)的边缘特征(edge feature)进行建模。在他们中，[47, 26, 17, 41, 49]使用几何关系，[6, 39]探索了nodes之间的语义关系。除了这些特征，我们提出的Grid Context Aggregation模组考虑了覆盖关系并提取了上下文特征去计算语义关系。

Figure 1: Overview of the Grid-GCN model. (a) Illustration of the network architecture for point cloud segmentation. Our model consists of several GridConv layers, and each can be used in either a downsampling or an upsampling process. A GridConv layer includes two stages: (b) For the data structuring stage, a Coverage-Aware Grid Query (CAGQ) module achieves efficient data structuring and provides point groups for efficient computation. © For the convolution stage, a Grid Context Aggregation (GCA) module conducts graph convolution on the point groups by aggregating local context

¶Methods

¶Method Overview

如figure1所示，Grid-GCN在一系列的Grid-Conv层上构建。每一个GridConv层处理N个点并将他们映射到M个点上。下采样GridConv层(N>M)被重复几次直到最终的特征表达被学习到。这种表达可以被直接用在分类中或者在语义分割任务中更一步被上采样GridConv层(N<M)上采样。

GridConv包含两个模组：

一个CAGQ模组从N个点中采样M个点。每一个组包含K个node点和一个组中心。在上采样过程中，CAGQ从长距离连接(long-range connections)中直接获取中心点，并且仅查询这些中心点的node points。
一个网格上下文聚合(Grid Context Aggregation, GCA)模组为每一个点群构建局部图，并将信息聚合到群中心。M个群中心会被当作点送达下一层。

为了清楚起见，我们在supplementary中列举了所有的说明。

Figure 2: Illustration of Coverage-Aware Grid Query (CAGQ). Assume we want to sample $M = 2$ point groups and query $K = 5$ node points for each group. (a) The input is N points (grey). The voxel id and number of points is listed for each occupied voxel. (b) We build voxel-point index and store up to $n_v = 3$ points (yellow) in each voxel. © Comparison of different sampling methods: FPS and RPS prefer the two centers inside the marked voxels. Our RVS could randomly pick any two occupied voxels (e.g. (2,0) and (0,0)) as center voxels. If our CAS is used, voxel (0,2) will replace (0,0). (d) Context points of center voxel (2,1) are the yellow points in its neighborhood (we use 3 × 3 as an example). CAGQ queries 5 points (yellow points with blue ring) from these context points, then calculate the locations of the group centers.

¶Coverage-Aware Grid Query(CAGQ)

在这个小节，我们讨论CAGQ的细节。CAGQ目标是在给定点云的情况下，有效地构建点云(structure the point cloud)，并使center sampling以及neighbor points query更容易。为了执行CAGQ，

我们首先通过设定体素尺寸$(v_x, v_y, v_z)$将输入空间体素化。
然后我们将每一个点映射到voxel index $Vid(u,v,w)=floor(\frac{x}{v_x},\frac{y}{v_y},\frac{z}{v_z})$。这里我们仅在每个体素中保存至多$n_v$个点。
使得$O_v$代表所有的非空体素。我们接着采样M个canter voxels $O_c\subseteq O_v$.对于每个center voxel $v_i$，我们定义它的voxel neighbors(相邻体素) $\pi(v_i)$为在center voxel周围的voxels(voxels within the neighbor-hood of a center voxel)。在Figure 2d，$\pi(v(2,1))$是在红框中的3X3图体素。我们把存储在$\pi(v_i)$中的点叫做context points(上下文点)。由于之前已经构建了体素化的坐标，CAGQ能够很快地检索(retrive)每个$\pi(v_i)$中的上下文点。
从那之后， CAGQ从每个$v_i$的context points中挑选K个node points。我们计算一个group(群)中的重心(barycenter)，作为group中心的位置。整个流程在Figure 2展示。

这里有两个问题还没有解决，(1)如何采样center voxels $O_c\subseteq O_v$。(2)如何从$\pi(v_i)$的context points中选取K个node points。

为了解决第一个问题，提出了center voxels sampling框架，框架主要包含两种方法：

Random Voxel Sampling (RVS)：每一个被占用的voxel都有同样的被选中的概率。在这些center voxels内部计算的group centers会比RPS直接在原始点上采集的centers分布更加均匀。
Coverage-Aware Sampling (CAS)：每一个被选择的center voxels可以覆盖至多$\lambda$个被占用的voxel beighbors。CAS的目标是选择一系列的center voxels使得他们可以覆盖最多的被占用体素。这里使用了贪心算法(greedy algorthm)来逼近最优解决方法
- 随机从$O_v$中挑选出M个voxels作为候选者
- 从所有的unpicked voxels中，迭代地挑战一个随机的候选者。如果挑战成功，则替换该候选者
对于一个挑战者$v_c$和一个候选者$v_I$， heuristics计算方法：
$$
\delta(x)=
\begin{cases}
1,\quad if\space x=0\
0, \quad otherwise.
\end{cases}
\tag{1}
$$

$$
H_{add}=\sum_{V\in \pi(V_C)}
{
\delta(C_V)-\beta \cdot \frac{C_V}{\lambda}
}
\tag{2}
$$

$$
H_{rmv}=\sum_{V\in \pi(V_I)}
{
\delta(C_V-1)
}
\tag{3}
$$

$\lambda$是一个voxel的neighbors的数量，$C_V$是覆盖体素$V$的候选者数量。$H_{add}$代表添加挑战者的覆盖增益，$H_{rmv}$是移除候选者的覆盖损失。当$H_{add}>H_{rmv}$时即挑战成功。若$\beta$为0，每次替代都一定会提升空间覆盖度。

Node points query CAGQ提供了两种策略去从$\pi(v_i)$的context points中挑选K个node points。
1. Cube Query：随机从context points中挑选K个点。
2. K-Nearest Neighbors：CAGQ中的k-NN仅需要搜索context points，这与传统k-NN不同。

¶Grid Context Aggregation

对于每个CAGQ生成的point group，使用Grid Context Aggregation(GCA)模块将node points 的信息聚合到group center上。首先构建一个局部图$G(V,E)$，其中V由CAGQ提供的group center和K个node points组成。然后将每一个node point连接到group center。GCA将一个node point的特征$f_i$投射到$\widetilde f_i$。基于node和center之间的边缘关系，GCA计算$\widetilde f_i$的贡献并将这些特征聚合作center的特征$\widetilde f_c$。正式地，GCA模组可以被描述为：
$$
\widetilde f_{c,i}=e(\chi _i,f_i)*\mathcal{M}(f_i)
\tag 4
$$

$$
\widetilde f_{c}=\mathcal A({\widetilde f_{c,i}},i\in 1,…,K)
\tag 5
$$

$\widetilde f_{c,i}$是一个node的贡献，$\chi _i$是node的xyz坐标。$\mathcal{M}$是多层感知集(Multi-Layer Perceptron, MLP)，$e$是边缘注意力函数，$\mathcal{A}$是聚合函数。研究设计了新的边缘注意力函数$e$，函数有以下的优点(Figure 4)：

Coverage Weight 从前的研究使用center的$\chi _c$和node的$\chi _i$来将edge attention建模为几何关系的函数(Figure 4b)。然而方程忽略了previous layers的node的贡献。直观地，previous layers拥有更多信息的node points应该被给予更多注意力。这个情景被阐明在Figure 3。在此基础上，我们引进coverage weight的观念，coverage weight被定义为previous layers中构成一个node的点的数量。这个值可以被CAGQ很容易地计算得到，我们认为(argue that) coverage weight是计算edge attention中的一个很重要的特征。

Figure 3: The red point is the group center. Yellow points are its node points. Black points are node points of the yellow points in the previous layer. The coverage weight is an important feature as it encodes the number of black points that have been aggregated to each yellow point.

Grid Context Pooling 当计算边缘注意力时(edge attention)，语义关系是另外一个重要的方面。在以前的工作中，语义关系通过使用group center特征$f_c$和node point’s feature $f_i$，这需要group center从node points选出。在CAGQ中，由于group center被计算作node points的中心，我们提出Grid context pooling，通过池化所有的context points来提取context features $f_{cxt}$。这有以下好处：

$f_{cxt}$模拟了虚拟group center的特征，允许我们计算centers和它的node points之间的语义联系。
即使当group center被选为实际的点，$f_{cxt}$依然是一个有用的特征表示，因为它覆盖了临近的更多的点，而不是仅仅图中的点。
由于之前在CAGQ已经将context points与它的center voxel进行了联系，因此不会有额外的点查询开销。$f_{cxt}$被分享到所有局部图的边缘计算，并且池化是轻量的操作，不需要可学习的参数，这仅增加了微小的计算开销。

GCA模组在Figure 4d中有总结，并且边缘注意力函数可以被建模为：
$$
e=mlp(mlp_{geo}(\chi {c},\chi {i},\omega {i}),mlp{sem}(f{cxt},f{i}))
\tag 6
$$

Figure 4: Different strategies to compute the contribution $\widetilde f_{c,i}$ from a node $n_i$ to its center $c. f_i , χ_i$ are the feature maps and the location of $n_i$. $e_i$ is the edge feature between $n_i$ and $c$ calculated from the edge attention function. (a) Pointnet++ [30] ignores $e_i$. (b) computes ei based on low dimensional geometric relation between $n_i$ and $c$. © also consider semantic relation between the center and the node point, but $c$ has to be sampled on one of the points from the previous layer. (d). Grid-GCN’s geo-relation also includes the coverage weight. It pools a context feature fcxt from all stored neighbors to provide a semantic reference in $e_i$ computing.

Figure 5: The visualization of the sampled group center and the queried node points by RPS, FPS, and CAS. The blue and green balls indicate Ball Query. The red squares indicate Cube Query. The ball and cube have the same volume. (a) RPS covers 45.6% of the occupied space, while FPS covers 65% and CAS covers 75.2%.

¶Analysis of CAGQ

暂时忽略

¶Experiments

暂时忽略

¶Conclusion

在这篇文章中，我们提出了Grid-GCN应用于快速的可拓展的点云学习。Grid-GCN通过CAGQ达到了有效的data structuring与computation。CAGQ通过体素化以及提供覆盖整个点云point groups彻底地(drastically)减小了data structuring的开销。图卷积模组GCA也被提出来吸收(incorporate)计算中的context features和coverage information。在这两个模组的作用下，Grid-GCN在多种个基准集上达到了state-of-the -art的精度和速度。Grid-GCN，有了超凡的性能与非并行的效率，可以被用在大尺度、实时的点云处理应用上。