PointNet is an architecture that operates on unordered sets of points; its key design is permutation invarianceโ€”changing the ordering of the input doesnโ€™t affect the output. Moreover, additional transform modules also account for geometric transformations.

The main backbone of the network (in blue) takes points of coordinates. The main idea is a shared MLP that individually computes features for each point. After developing features per point, we take a global max pooling for each feature mapโ€”effectively taking the largest feature value for each of the features. This max pooling is permutation invariant since the global max for each feature map doesnโ€™t change with reordering; formally, our operations are equivalent to approximating a general set function with a symmetric and transformed points ,

For classification, we can pass the global features to a final MLP that computes class probabilities. For semantic segmentation, we need to incorporate local per-point feature information as well (in yellow), so we concatenate the global features to point features and process it with more per-point MLPs that consider both global and local information.

The input transform at the start is a smaller network that generates a transformation matrix thatโ€™s applied to every point. The feature transform similarly produces a transform thatโ€™s close to orthonormal.

Key Points

One property of the max pool is robustness. For a max pooling layer of dimension (called the bottleneck dimension), there exists a critical point set with that defines the output of our function; that is,

The network is similarly robust to some level of input noise where adding additional points outside of the critical set doesnโ€™t affect the output.

We can visualize the critical points below; the first row is our input, second is the critical set, and third is the largest possible input that results in the same input.