BEVFormer Comparison¶
This comparison is restricted to two sources: the official fundamentalvision/BEVFormer implementation for the baseline and the active fusion path in this repository for the project column. The local projects/configs/bevformer/bevformer_base.py is used only as a sanity check that the in-repo baseline stays aligned with the upstream configuration.
Baseline references¶
- Official repository:
https://github.com/fundamentalvision/BEVFormer - Baseline config:
projects/configs/bevformer/bevformer_base.py - Baseline transformer:
projects/mmdet3d_plugin/bevformer/modules/transformer.py - Baseline encoder:
projects/mmdet3d_plugin/bevformer/modules/encoder.py - Baseline head:
projects/mmdet3d_plugin/bevformer/dense_heads/bevformer_head.py
Core comparison¶
| Component | BEVFormer | This Project | Change Type | Technical Impact |
|---|---|---|---|---|
| Sensor modality | Camera-only (use_lidar=False) |
Camera + LiDAR (use_lidar=True) |
Architectural addition | Introduces a second BEV feature path derived from PointPillars. |
| Camera backbone and preprocessing | ResNet-101 with DCNv2, Caffe-style normalization, bev_h = bev_w = 200, num_query = 900 |
ResNet-50 without DCNv2, PyTorch-style normalization, bev_h = bev_w = 100, num_query = 450 |
Training and config change | Shrinks the local token/query budget and backbone depth relative to the local base config. |
| LiDAR branch | None in the published base path | PointPillars voxelization, pillar encoding, and scatter-to-BEV pipeline | Architectural addition | Produces a LiDAR BEV map that can be fused directly in BEV space. |
| Encoder layer class | BEVFormerLayer |
MM_BEVFormerLayer |
Architectural modification | Replaces camera-only encoder cross-attention with a multi-modal encoder layer. |
| Encoder cross-attention | Camera SpatialCrossAttention only |
Camera SpatialCrossAttention plus LiDAR CustomMSDeformableAttention with learned sigmoid blending |
Architectural modification | Enables per-layer fusion between image-lifted BEV evidence and LiDAR BEV evidence. |
| Decoder input | Encoder BEV only | Encoder BEV concatenated with projected LiDAR BEV and compressed by lidar_fuse_linear |
Architectural addition | Adds a direct LiDAR shortcut to the decoder in addition to the encoder fusion path. |
| Decoder fusion initialization | Not applicable | Identity-initialized camera passthrough with zero LiDAR half | Efficiency and memory change | Starts training from the camera-only decoder path and learns LiDAR contribution incrementally. |
| Yaw supervision | Direct box regression channels | Separate yaw-bin and yaw-residual heads | Architectural modification | Splits coarse orientation selection from residual refinement. |
| Velocity supervision | Velocity predicted in the box regression branch | Dedicated velocity cross-attention head on bev_embed_cam; box velocity weights zeroed |
Architectural modification | Decouples motion estimation from the LiDAR-heavy decoder representation. |
| Temporal history execution | Standard prev_bev path |
obtain_history_bev() runs history in eval mode with no gradients and caches by scene key |
Efficiency and memory change | Avoids backpropagating through history frames while preserving temporal context. |
flowchart LR
accTitle: BEVFormer Versus BEVFormerFusion
accDescr: The official BEVFormer path is camera-only, while BEVFormerFusion adds PointPillars LiDAR BEV, multi-modal encoder fusion, decoder fusion, and a dedicated velocity head.
subgraph baseline["Official BEVFormer"]
cam_b["Multi-view images"]
enc_b["BEVFormerLayer<br/>camera attention only"]
dec_b["Decoder on encoder BEV"]
head_b["Shared class / box / yaw / velocity head"]
cam_b --> enc_b --> dec_b --> head_b
end
subgraph fusion["BEVFormerFusion"]
cam_f["Multi-view images"]
lidar_f["LiDAR points"]
pillars_f["PointPillars LiDAR BEV"]
enc_f["MM_BEVFormerLayer<br/>camera + LiDAR attention"]
dec_f["Decoder fusion<br/>concat + project"]
cam_bev_f["Camera-path BEV snapshot"]
head_f["Class + box + yaw heads"]
vel_f["Dedicated velocity head"]
cam_f --> enc_f
lidar_f --> pillars_f --> enc_f
enc_f --> dec_f
pillars_f --> dec_f
enc_f --> cam_bev_f --> vel_f
dec_f --> head_f
end
classDef baseline_style fill:#f8f1e4,stroke:#bb6b2c,color:#15263b
classDef fusion_style fill:#e5f4ef,stroke:#136f63,color:#15263b
class cam_b,enc_b,dec_b,head_b baseline_style
class cam_f,lidar_f,pillars_f,enc_f,dec_f,cam_bev_f,head_f,vel_f fusion_style
Figure: The comparison diagram isolates the method-level differences behind the table: LiDAR enters only in the fusion path, affects both encoder and decoder BEV representations, and leaves motion estimation on a separate camera-path branch.
New modules and modified attention paths¶
New modules¶
pts_voxel_layer,pts_voxel_encoder, andpts_middle_encoderin the active config add the PointPillars LiDAR encoder path.lidar_encoder_proj,lidar_proj,lidar_fuse_linear, andlidar_fuse_norminPerceptionTransformerimplement the two LiDAR fusion stages.yaw_bin_branches,yaw_res_branches,vel_cross_attn, andvel_branchesinBEVFormerHeadimplement dedicated orientation and motion heads.
Modified attention¶
- Encoder cross-attention now has two branches in
MM_BEVFormerLayer:SpatialCrossAttentionfor camera features andlidar_cross_attn_layerfor LiDAR BEV tokens. - Velocity estimation uses full
nn.MultiheadAttentionfrom decoder queries to the pre-fusion BEV snapshot rather than reusing the box branch output.
Efficiency and memory-oriented changes¶
- The active config halves the BEV grid resolution per side and the query count relative to the local base config.
- The active config reduces encoder depth from 6 to 4 layers.
- History BEV reconstruction is explicitly run without gradients in
BEVFormer.obtain_history_bev().
Comparison notes¶
- The official BEVFormer code already includes temporal BEV handling and NMS-free decoding. Those paths are retained rather than replaced.
- The project changes mix architectural and configuration differences. The table labels them separately so that implementation changes are not conflated with local compute-budget reductions.
- Runtime benefits are not claimed quantitatively because the repository does not contain a benchmark artifact for FPS or memory. Only structural compute reductions, such as smaller grids and fewer queries, are described here.