rsasaki0109 · rsasaki0109 · May 26, 2026 · May 26, 2026
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -347,6 +347,10 @@ add_executable(gpu_megaparticles_6dof src/gpu_megaparticles_6dof.cu)
 target_link_libraries(gpu_megaparticles_6dof ${OpenCV_LIBS})
 target_compile_options(gpu_megaparticles_6dof PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr>)
 
+add_executable(gpu_megaparticles_gicp_mcl src/gpu_megaparticles_gicp_mcl.cu)
+target_link_libraries(gpu_megaparticles_gicp_mcl ${OpenCV_LIBS})
+target_compile_options(gpu_megaparticles_gicp_mcl PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr>)
+
 add_executable(gpu_kld_amcl src/gpu_kld_amcl.cu)
 target_link_libraries(gpu_kld_amcl ${OpenCV_LIBS})
 target_compile_options(gpu_kld_amcl PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr>)

diff --git a/plan.md b/plan.md
@@ -9,8 +9,52 @@ known sharp edges and lessons learned from the last few attempts, and (5)
 a prioritised menu of candidate next tasks with enough specificity that a
 fresh agent can pick one and start.
 
-Branch `chore/shared-se3-helpers` (-> `master`) is **in flight** (2026-05-26).
-This is the Open Threads A shared-header cleanup recommended after #105: it
+PR for `feat/gpu-megaparticles-gicp-d2d` -> `master` is **IN FLIGHT** (opened
+2026-05-26): a GICP-style distribution-to-distribution (D2D) scan likelihood
+for the MegaParticles line -- the "GICP-like point-cloud likelihood" follow-up
+flagged after #86/#101/#104. New single file `src/gpu_megaparticles_gicp_mcl.cu`
+runs a controlled head-to-head: two 1,048,576-particle filters with IDENTICAL
+MegaParticles machinery (global uniform init, Gauss-Newton particle motion,
+sparse bucket-neighbor Stein attraction/repulsion, posterior smoothing,
+representative-state gate, hidden kidnap + 15-frame scan blackout), differing
+ONLY in the per-particle scoring kernel. Arm A is the #86 distance-field
+endpoint proxy (control; it reproduces #86's ~0.097 m post-kidnap RMSE). Arm B
+is the new GICP D2D likelihood: the map is a point cloud (2,396 points, boundary
+cells thinned to ~0.15 m) with per-point disk covariances (small variance along
+the surface normal, large along the tangent), indexed by a uniform NN grid; each
+particle matches every scan endpoint to the nearest map point and scores the
+surface-aware Gaussian log-likelihood under the combined covariance
+M = (C_map + R C_scan R^T)^{-1} (Segal et al. RSS 2009), summed over the scan,
+with a per-particle full 3x3 Gauss-Newton step driving the Stein motion. Key
+robustness lesson: a PURE D2D likelihood (flat penalty + zero gradient outside
+the match radius) re-localized the hidden kidnap only intermittently -- the
+sharp D2D contracts harder pre-kidnap, leaving thin global support, and a lost
+particle gets no gradient pull. Fix is a coarse-to-fine design: an UNMATCHED ray
+falls back to the distance-field endpoint log-likelihood (smooth long-range
+pull), so the worst case == the field filter and global recovery is robust,
+while MATCHED rays use the sharp surface-aware GICP term for accuracy. Verified
+over 4 runs (GPU atomicAdd noise floor): both arms recover the kidnap in 0
+frames; post-kidnap RMSE field 0.099 m -> GICP D2D ~0.064 m (~35% lower), final
+error 0.040 m -> 0.021 m (~halved), at ~2.4x per-step cost (field ~4.9 ms ->
+D2D ~12.1 ms). Touched files: CMakeLists.txt, readme.md, plan.md,
+src/gpu_megaparticles_gicp_mcl.cu, plus the gh-pages GIF. With this the
+MegaParticles localization line now covers Stein (#86), explicit LSH (#101),
+6-DoF SE(3) (#104), and the GICP D2D likelihood.
+
+PR #112 (`chore/shared-se3-helpers` -> `master`) was **MERGED** (squash) on
+2026-05-26 at `3eb4b4d`; CI Build passed (12m1s; Build + Python tests + CPU
+tests all green; only the Node.js 20 deprecation annotation), the draft was
+marked ready, and the remote branch was deleted. The concurrent agent session
+landed six GPU graph-neural / game-theoretic MPPI demos
+(`gpu_best_response_graph_mppi`, `gpu_belief_risk_graph_mppi`,
+`gpu_intent_graph_neural_mppi`, `gpu_priority_graph_neural_mppi`,
+`gpu_interaction_graph_neural_mppi`, `gpu_multiagent_graph_neural_mppi`) on
+master just before it, so the squash fast-forward landed all of them together;
+#112's own diff is exactly its five files (include/se3_helpers.cuh, plan.md,
+and the three migrated `.cu`). Local `master` is at `3eb4b4d` and in sync with
+origin. There is **no active feature branch** now -- the next agent starts
+fresh from `master`.
+This was the Open Threads A shared-header cleanup recommended after #105: it
 lifts the SE(3) / SO(3) math kernels + the 6x6 SPD Cholesky solve that were
 copied verbatim across the three rotation-matrix pose-graph SLAM back-ends
 (`gpu_pose_graph_slam_3d.cu`, `gpu_pose_graph_slam_3d_switchable.cu`,

diff --git a/readme.md b/readme.md
@@ -20,8 +20,8 @@ Same algorithm on CPU and GPU — GPU enables orders of magnitude more particles
 | <img src="https://rsasaki0109.github.io/CudaRobotics/comparison_reeds_shepp_fan.gif" width="400"/> | <img src="https://rsasaki0109.github.io/CudaRobotics/gpu_kld_amcl.gif" width="400"/> |
 | **MegaParticles + explicit p-stable LSH neighbor index: fixed grid 58% vs LSH 88% neighbor recall (1M particles)** | **MegaParticles 6-DoF: 1M SE(3) particles, hidden-kidnap relocalization (LSH neighbor consensus)** |
 | <img src="https://rsasaki0109.github.io/CudaRobotics/gpu_megaparticles_lsh.gif" width="400"/> | <img src="https://rsasaki0109.github.io/CudaRobotics/gpu_megaparticles_6dof.gif" width="400"/> |
-| **Online 3D SLAM with switchable loop constraints: false loops rejected live (plain 9.10 m vs switchable 0.29 m, 21/21 rejected)** | |
-| <img src="https://rsasaki0109.github.io/CudaRobotics/gpu_online_slam_3d_switchable.gif" width="400"/> | |
+| **Online 3D SLAM with switchable loop constraints: false loops rejected live (plain 9.10 m vs switchable 0.29 m, 21/21 rejected)** | **MegaParticles GICP D2D likelihood: surface-aware scoring halves error vs field proxy (1M particles)** |
+| <img src="https://rsasaki0109.github.io/CudaRobotics/gpu_online_slam_3d_switchable.gif" width="400"/> | <img src="https://rsasaki0109.github.io/CudaRobotics/gpu_megaparticles_gicp_mcl.gif" width="400"/> |
 
 ## Capability matrix
 
@@ -31,7 +31,7 @@ Same algorithm on CPU and GPU — GPU enables orders of magnitude more particles
 | Collision check | `comparison_collision_check` | 1M segments/scan | 1,277x per candidate |
 | Scan matching | `comparison_icp`, `comparison_ndt`, `gpu_ndt_3d_multires`, `gicp` | 10K+ points | parallel correspondences |
 | Pose-graph SLAM | `gpu_pose_graph_slam`, `gpu_pose_graph_slam_3d`, `gpu_pose_graph_slam_3d_robust`, `gpu_pose_graph_slam_3d_switchable`, `gpu_online_slam`, `gpu_online_slam_3d_switchable` | 2D 200 poses / 3D 384-420 poses | robust 3D rejects 36/36 false loops, 6.95→0.28 m; switchable constraints learn per-loop switches jointly with poses, 6.95→0.29 m; online 3D switchable rejects false loops live in a sliding window, plain 9.10 m → switchable 0.29 m (21/21 rejected) |
-| Particle filter | `comparison_pf`, `gpu_global_localization_mcl`, `gpu_megaparticles_stein_mcl`, `gpu_megaparticles_lsh`, `gpu_megaparticles_6dof`, `gpu_kld_amcl`, `diff_pf`, `diff_pf_mlp` | 10K-1M particles | MegaParticles-style range SPF: 14.61 m bootstrap vs 0.097 m recovery; explicit p-stable LSH neighbor index lifts neighbor recall 58%→88%; 6-DoF SE(3) relocalization recovers a hidden kidnap to 0.22 m / 1.9 deg (LSH neighbor consensus); KLD-AMCL adapts 400→65,536 particles, 15.2x vs CPU |
+| Particle filter | `comparison_pf`, `gpu_global_localization_mcl`, `gpu_megaparticles_stein_mcl`, `gpu_megaparticles_lsh`, `gpu_megaparticles_6dof`, `gpu_megaparticles_gicp_mcl`, `gpu_kld_amcl`, `diff_pf`, `diff_pf_mlp` | 10K-1M particles | MegaParticles-style range SPF: 14.61 m bootstrap vs 0.097 m recovery; explicit p-stable LSH neighbor index lifts neighbor recall 58%→88%; 6-DoF SE(3) relocalization recovers a hidden kidnap to 0.22 m / 1.9 deg (LSH neighbor consensus); surface-aware GICP D2D likelihood halves post-kidnap error vs the field proxy (0.099→0.064 m); KLD-AMCL adapts 400→65,536 particles, 15.2x vs CPU |
 | RRT family | `comparison_rrt*`, `comparison_rrtstar_rewire` | 1M paths / 200K nodes | 5,000x per-path; 62x rewire |
 | Crowd / swarm | `gpu_crowd_swarm` | 10,000 boids with uniform-grid neighbours | 105x vs CPU |
 | Graph policy control | `gpu_gnn_swarm_controller`, `gpu_gat_traversability_policy` | 2048 agents / 3072 terrain nodes x 3 heads | 2.88 ms/control; 99.4x GAT policy |
@@ -191,6 +191,7 @@ cd ros2_ws && colcon build --packages-select cuda_robotics
 | MegaParticles-style Stein MCL | 1,048,576 range particles; local bootstrap post RMSE 14.61 m → Stein/bucket posterior recovery 0.097 m |
 | MegaParticles LSH neighbor index | 2 × 1,048,576 particles; explicit p-stable LSH (8 tables × 3 projections) vs fixed grid; neighbor recall vs brute-force kNN 58.2% → 87.8%, post-kidnap RMSE 0.099 → 0.088 m |
 | MegaParticles 6-DoF SE(3) | 1,048,576 SE(3) particles in a 3D voxel world; 3D-ESDF range likelihood, quaternion GN steps, 6-D p-stable LSH neighbor consensus; hidden kidnap: local bootstrap post RMSE 5.97 m → 6-DoF MegaParticles 0.22 m / 1.9 deg, reacquires in 0 frames |
+| MegaParticles GICP D2D likelihood | 2 × 1,048,576 particles, identical Stein machinery; surface-aware GICP distribution-to-distribution scoring (per-point disk covariances, grid-indexed map cloud) vs the distance-field proxy; both recover the hidden kidnap in 0 frames, post-kidnap RMSE 0.099 → 0.064 m and final error 0.040 → 0.021 m, at ~2.4x per-step cost (4.9 → 12.1 ms) |
 | Augmented KLD-AMCL | KLD-sampling adapts 400→65,536 particles, augmented injection reacquires hidden kidnap in 13 steps, settled RMSE 0.014 m, **15.2x** vs CPU |
 | 2D ESDF (640K cells) | **53,404x** per cell (JFA) |
 | 3D ESDF (1M voxels) | **86,613x** per voxel (JFA-3D) |