BEVFormerFusion¶

BEVFormerFusion is a BEVFormer-derived multi-modal 3D detector for nuScenes. The active implementation keeps the upstream camera-to-BEV scaffold and adds PointPillars LiDAR BEV features at the encoder and decoder stages, plus a dedicated velocity head that reads from the pre-fusion camera BEV state.

The published comparisons on this site are limited to the repository-tracked checkpoint summaries normalized into docs/assets/data/metrics.json. At the shared 100000-iteration checkpoint, the fused configuration reaches 0.2507 mAP and 0.2546 NDS, compared with 0.2011 mAP and 0.2192 NDS for the local BEVFormer baseline.

Overview and motivation

The upstream BEVFormer design builds BEV tokens from multi-view images through temporal self-attention and camera spatial cross-attention. The active BEVFormerFusion path keeps that detector scaffold but adds a LiDAR BEV branch produced by PointPillars. The resulting design injects geometric evidence twice: first inside the encoder by parallel cross-attention, then again before decoder cross-attention through a concatenate-and-project block.

The repository also changes how orientation and motion are supervised. Yaw is factorized into discrete-bin and residual heads, while velocity is moved out of the box regression branch and predicted by a dedicated query-to-BEV attention module. This isolates gradients for motion estimation from the heavily LiDAR-enriched detection path.

Active implementation scope. The published documentation tracks `projects/configs/bevformer/bevformer_project.py`, the `BEVFormer` detector, `PerceptionTransformer`, `MM_BEVFormerLayer`, `BEVFormerHead`, and `CustomNuScenesDataset`.

Excluded from the main narrative. `projects/configs/bevformerv2/`, `transformer copy.py`, `*_old.py`, and the PETR3D test path are treated as legacy or experimental code and are not presented as the canonical method.

Method¶

Original BEVFormer¶

The official BEVFormer baseline used for this documentation is the public fundamentalvision/BEVFormer repository. Its base configuration is camera-only, uses BEVFormerLayer inside the encoder, and predicts box geometry, yaw, and velocity directly from the standard detection head without a separate LiDAR branch or dedicated motion head.

Proposed modifications¶

BEVFormerFusion adds three code-backed changes to that baseline:

MM_BEVFormerLayer adds a LiDAR deformable-attention branch to each encoder layer and blends it with the camera cross-attention output through a learned sigmoid gate.
PerceptionTransformer snapshots the camera-path BEV before decoder-side fusion, then concatenates projected LiDAR BEV tokens with the encoder output and compresses them back to the model dimension through an identity-initialized linear layer.
BEVFormerHead replaces direct yaw and velocity supervision with yaw-bin / yaw-residual heads and a dedicated velocity cross-attention head that reads from the pre-fusion BEV snapshot.

Comparison table¶

Component	BEVFormer	Ours	Change	Benefit
Sensor inputs	Multi-view camera only	Multi-view camera plus PointPillars LiDAR BEV branch	Architectural addition	Adds an explicit geometric feature path in BEV coordinates.
Encoder layer	`BEVFormerLayer` with camera spatial cross-attention only	`MM_BEVFormerLayer` with camera and LiDAR deformable attention branches	Architectural modification	Allows each BEV layer to blend image evidence with LiDAR BEV evidence.
Decoder input	Encoder BEV output only	Encoder BEV plus projected LiDAR BEV through concat + linear fusion	Architectural addition	Preserves a second, direct LiDAR path for object-query decoding.
Motion supervision	Velocity channels trained inside the box head	Dedicated velocity cross-attention head with separate loss	Architectural modification	Keeps motion estimation tied to the pre-fusion BEV state instead of the LiDAR-heavy decoder input.
Orientation supervision	Direct box regression channels	Yaw-bin and yaw-residual branches	Architectural modification	Separates coarse orientation classification from residual refinement.
Token budget	`bev_h = bev_w = 200`, `num_query = 900`, encoder depth 6, ResNet-101 backbone	`bev_h = bev_w = 100`, `num_query = 450`, encoder depth 4, ResNet-50 backbone	Training and config change	Reduces BEV/query budget and backbone depth relative to the local base config.
Temporal handling	Standard `prev_bev` path in official code	Scene-keyed BEV cache plus no-grad history passes in `obtain_history_bev`	Efficiency and memory change	Avoids backpropagating through history frames while keeping temporal context.

Model	Best iter	mAP	NDS
Loading metrics...

Results¶

Experiments¶

The public experiment pages focus on the normalized checkpoint summaries that are already tracked in the repository:

validation mAP and NDS across checkpoint milestones,
the best local baseline checkpoint at 100k iterations,
the best fused checkpoint at 100k iterations,
curve-level comparison across the baseline and fusion runs.

The training configuration for the active method is drawn from projects/configs/bevformer/bevformer_project.py: AdamW with lr = 2e-4, BEV resolution 100 x 100, object-query count 450, temporal queue length 4, and LiDAR fusion mode encoder_decoder.

Documentation¶

Architecture

System summary, module roles, data flow, and training/inference paths.

BEVFormer comparison

Code-backed differences between the official baseline and the active fusion path.

Experiments

Checkpoint metrics and analysis for the published baseline and fusion runs.

Usage

Installation assumptions, dataset expectations, and common train/eval commands.

API reference

Code-to-documentation mapping for the active implementation path.