Master's Thesis · 2025 Unsupervised CTU FEE

FlowSeg4D: Unsupervised 4D Panoptic Segmentation

An online framework for 4D panoptic segmentation of LiDAR driving scenes that requires no labeled training data. It combines semantic segmentation, scene flow estimation, and temporal clustering to produce consistent instance tracks across time — rivalling supervised methods on the SemanticKITTI and nuScenes benchmarks.

46.9LSTQ · SemanticKITTI val
52.2LSTQ · nuScenes val
0Labeled training samples
OnlineProcessing mode

Method

4D panoptic segmentation extends panoptic segmentation to temporal sequences — each point must be assigned both a semantic class and a consistent instance identity across frames. FlowSeg4D achieves this without any labels by combining three components.

Task diagram: semantic + instance segmentation combine into 4D panoptic segmentation

The 4D panoptic segmentation task

Pipeline: LiDAR frames fed through WaffleIron and scene flow into object discovery, association, and panoptic output

FlowSeg4D pipeline — semantic segmentation (yellow), scene flow (beige), instance association (red)

Semantic segmentation

WaffleIron WI-48-768, pretrained unsupervised via ScaLR on four LiDAR sensor types using DINOv2 features. Linear probing on the target dataset provides class labels while keeping annotation requirements minimal.

Scene flow estimation

Let-It-Flow — an unsupervised optimisation-based model selected for low error on vulnerable road users (pedestrians, cyclists). Flow vectors are precomputed and used by the association module to update cluster positions before matching.

Instance association

A clustering and Hungarian-matching module that links object clusters across frames. The long-term variant maintains a temporal window of previous frames and uses WaffleIron embeddings alongside spatial distance to resolve ambiguous matches.

Association pipeline

Four progressively richer association strategies were developed and evaluated. All share the same clustering step; they differ in how clusters are matched across frames.

1

Naive

  1. Cluster foreground semantic points per class (ALPINE / DBSCAN / HDBSCAN)
  2. Hungarian match between current and previous frame using cluster-centre distance as cost
  3. Accept match if distance < 3.5 m → assign same instance ID

Fast and interpretable, but limited to one previous frame and struggles with occlusions.

2

Naive + Scene Flow

  1. Shift cluster centres by the mean scene flow vector before matching
  2. Otherwise identical to Naive

Improves SemanticKITTI scores but hurts nuScenes — most effective when semantic labels are noisy.

3

Long-term window Best

  1. Maintain a window of N previous frames (optimal: 6)
  2. Represent each cluster by its mean WaffleIron embedding
  3. Cost matrix = cluster-centre distance + feature dissimilarity (1 − cosine similarity), weighted by α = 0.1
  4. Hungarian match; accept only if both distance (< 4.5 m) and feature thresholds (dissimilarity < 0.4) pass

Consistent improvement over Naive across all datasets and clustering methods.

4

Long-term + Scene Flow

  1. Update previous-frame cluster centres with mean scene flow before matching
  2. Otherwise identical to Long-term window

Minimal additional gain over the long-term variant alone; scene flow already captured by the embedding cost.

Association pipeline diagram: object discovery (clustering, median, cosine similarity) feeding into instance ID assignment (cost matrix, Hungarian matching, object cache)

Full association pipeline — object discovery (left) feeds cluster centres and embeddings into the instance ID assignment module (right)

SemanticKITTI

FlowSeg4D on SemanticKITTI — semantic (top) and panoptic (bottom), 10× speed

Method progression

DBSCAN gives the best results on SemanticKITTI; HDBSCAN on nuScenes. S_cls is fixed by the semantic model and does not change between association variants.

SemanticKITTI — validation

MethodLSTQS_ascS_cls
Naive42.131.855.8
+ Scene Flow44.735.955.8
+ Long Window46.939.555.8
+ LW + Scene Flow46.939.555.8

nuScenes — validation

MethodLSTQS_ascS_cls
Naive50.437.068.7
+ Scene Flow47.833.268.7
+ Long Window52.239.768.7
+ LW + Scene Flow52.239.768.7

Comparison with state of the art

All supervised baselines are trained with full point-level annotations. FlowSeg4D (marked ✓) uses no labels.

SemanticKITTI — validation

MethodUnsup.LSTQS_ascS_cls
4D-PLS 162.765.160.5
4D-StOP 267.074.460.3
Mask4D 371.475.467.5
Mask4Former 470.574.366.9
4D-Former 573.980.967.6
Ours (LW)46.939.555.8

SemanticKITTI — test

MethodUnsup.LSTQS_ascS_cls
4D-PLS 156.956.457.4
CIA 663.165.760.6
4D-StOP 263.969.558.8
Mask4D 364.366.462.2
Mask4Former 468.467.369.6
Ours (LW+SF)39.329.951.5
  1. Aygun et al., 4D Panoptic LiDAR Segmentation, CVPR 2021
  2. Kreuzberg et al., 4D-StOP: Panoptic Segmentation of 4D LiDAR Using Spatio-Temporal Object Proposal Generation and Aggregation, ECCV Workshops 2023
  3. Marcuzzi et al., Mask4D: End-to-End Mask-Based 4D Panoptic Segmentation for LiDAR Sequences, RA-L 2023
  4. Yilmaz et al., Mask4Former: Mask Transformer for 4D Panoptic Segmentation, ICRA 2024
  5. Athar et al., 4D-Former: Multimodal 4D Panoptic Segmentation, CoRL 2023
  6. Marcuzzi et al., Contrastive Instance Association for 4D Panoptic Segmentation Using Sequences of 3D LiDAR Scans, RA-L 2022

Qualitative results

SemanticKITTI semantic segmentation: ground truth (top) vs WaffleIron prediction (bottom)

Semantic segmentation — ground truth (top row) vs WaffleIron linear probing (bottom row) across 5 frames

Instance segmentation comparison: ALPINE, DBSCAN, HDBSCAN across frames

Instance association (long-term + scene flow) — ALPINE (row 1), DBSCAN (row 2–3), HDBSCAN (row 4) across 5 frames. Consistent colour = consistent instance identity.

Temporal instance consistency

10 consecutive frames — each coloured blob is a tracked instance; consistent colour across frames indicates correct temporal association.

Failure cases

The three clustering methods each exhibit characteristic failure modes. ALPINE (row 1) over-segments; DBSCAN (row 2) merges nearby clusters; HDBSCAN (row 3) struggles with sparse objects.

Cross-dataset generalisation — PONE

FlowSeg4D was applied to the PONE dataset with no retraining or fine-tuning, using models pretrained on SemanticKITTI and nuScenes. Both semantic and instance outputs transfer without adaptation.

FlowSeg4D on PONE — semantic (left) and panoptic (right), no retraining
PONE dataset semantic segmentation: WaffleIron SemanticKITTI model (top) vs nuScenes model (bottom)

Semantic segmentation on PONE — SemanticKITTI model (top row) vs nuScenes model (bottom row) across 3 frames

PONE dataset instance segmentation: long-term association with ALPINE, DBSCAN, HDBSCAN

Instance association (long-term) on PONE — ALPINE (row 1), DBSCAN (row 2), HDBSCAN (row 3)

Master's thesis at CTU FEE, 2025.