Master's Thesis · 2025 Unsupervised CTU FEE

FlowSeg4D: Unsupervised 4D Panoptic Segmentation

An online framework for 4D panoptic segmentation of LiDAR driving scenes that requires no labeled training data. It combines semantic segmentation, scene flow estimation, and temporal clustering to produce consistent instance tracks across time — rivalling supervised methods on the SemanticKITTI and nuScenes benchmarks.

Read the thesis Source code

46.9LSTQ · SemanticKITTI val

52.2LSTQ · nuScenes val

0Labeled training samples

OnlineProcessing mode

Method

4D panoptic segmentation extends panoptic segmentation to temporal sequences — each point must be assigned both a semantic class and a consistent instance identity across frames. FlowSeg4D achieves this without any labels by combining three components.

Task diagram: semantic + instance segmentation combine into 4D panoptic segmentation

The 4D panoptic segmentation task

Pipeline: LiDAR frames fed through WaffleIron and scene flow into object discovery, association, and panoptic output

FlowSeg4D pipeline — semantic segmentation (yellow), scene flow (beige), instance association (red)

Semantic segmentation

WaffleIron WI-48-768, pretrained unsupervised via ScaLR on four LiDAR sensor types using DINOv2 features. Linear probing on the target dataset provides class labels while keeping annotation requirements minimal.

Scene flow estimation

Let-It-Flow — an unsupervised optimisation-based model selected for low error on vulnerable road users (pedestrians, cyclists). Flow vectors are precomputed and used by the association module to update cluster positions before matching.

Instance association

A clustering and Hungarian-matching module that links object clusters across frames. The long-term variant maintains a temporal window of previous frames and uses WaffleIron embeddings alongside spatial distance to resolve ambiguous matches.

Association pipeline

Four progressively richer association strategies were developed and evaluated. All share the same clustering step; they differ in how clusters are matched across frames.

Naive

Cluster foreground semantic points per class (ALPINE / DBSCAN / HDBSCAN)
Hungarian match between current and previous frame using cluster-centre distance as cost
Accept match if distance < 3.5 m → assign same instance ID

Fast and interpretable, but limited to one previous frame and struggles with occlusions.

Naive + Scene Flow

Shift cluster centres by the mean scene flow vector before matching
Otherwise identical to Naive

Improves SemanticKITTI scores but hurts nuScenes — most effective when semantic labels are noisy.

Long-term window Best

Maintain a window of N previous frames (optimal: 6)
Represent each cluster by its mean WaffleIron embedding
Cost matrix = cluster-centre distance + feature dissimilarity (1 − cosine similarity), weighted by α = 0.1
Hungarian match; accept only if both distance (< 4.5 m) and feature thresholds (dissimilarity < 0.4) pass

Consistent improvement over Naive across all datasets and clustering methods.

Long-term + Scene Flow

Update previous-frame cluster centres with mean scene flow before matching
Otherwise identical to Long-term window

Minimal additional gain over the long-term variant alone; scene flow already captured by the embedding cost.

Association pipeline diagram: object discovery (clustering, median, cosine similarity) feeding into instance ID assignment (cost matrix, Hungarian matching, object cache)

Full association pipeline — object discovery (left) feeds cluster centres and embeddings into the instance ID assignment module (right)

SemanticKITTI

FlowSeg4D on SemanticKITTI — semantic (top) and panoptic (bottom), 10× speed

Method progression

DBSCAN gives the best results on SemanticKITTI; HDBSCAN on nuScenes. S_cls is fixed by the semantic model and does not change between association variants.

SemanticKITTI — validation

Method	LSTQ	S_asc	S_cls
Naive	42.1	31.8	55.8
+ Scene Flow	44.7	35.9	55.8
+ Long Window	46.9	39.5	55.8
+ LW + Scene Flow	46.9	39.5	55.8

nuScenes — validation

Method	LSTQ	S_asc	S_cls
Naive	50.4	37.0	68.7
+ Scene Flow	47.8	33.2	68.7
+ Long Window	52.2	39.7	68.7
+ LW + Scene Flow	52.2	39.7	68.7

Comparison with state of the art

All supervised baselines are trained with full point-level annotations. FlowSeg4D (marked ✓) uses no labels.

SemanticKITTI — validation

Method	Unsup.	LSTQ	S_asc	S_cls
4D-PLS ¹		62.7	65.1	60.5
4D-StOP ²		67.0	74.4	60.3
Mask4D ³		71.4	75.4	67.5
Mask4Former ⁴		70.5	74.3	66.9
4D-Former ⁵		73.9	80.9	67.6
Ours (LW)	✓	46.9	39.5	55.8

SemanticKITTI — test

Method	Unsup.	LSTQ	S_asc	S_cls
4D-PLS ¹		56.9	56.4	57.4
CIA ⁶		63.1	65.7	60.6
4D-StOP ²		63.9	69.5	58.8
Mask4D ³		64.3	66.4	62.2
Mask4Former ⁴		68.4	67.3	69.6
Ours (LW+SF)	✓	39.3	29.9	51.5

Aygun et al., 4D Panoptic LiDAR Segmentation, CVPR 2021
Kreuzberg et al., 4D-StOP: Panoptic Segmentation of 4D LiDAR Using Spatio-Temporal Object Proposal Generation and Aggregation, ECCV Workshops 2023
Marcuzzi et al., Mask4D: End-to-End Mask-Based 4D Panoptic Segmentation for LiDAR Sequences, RA-L 2023
Yilmaz et al., Mask4Former: Mask Transformer for 4D Panoptic Segmentation, ICRA 2024
Athar et al., 4D-Former: Multimodal 4D Panoptic Segmentation, CoRL 2023
Marcuzzi et al., Contrastive Instance Association for 4D Panoptic Segmentation Using Sequences of 3D LiDAR Scans, RA-L 2022

Qualitative results

SemanticKITTI semantic segmentation: ground truth (top) vs WaffleIron prediction (bottom)

Semantic segmentation — ground truth (top row) vs WaffleIron linear probing (bottom row) across 5 frames

Instance segmentation comparison: ALPINE, DBSCAN, HDBSCAN across frames

Instance association (long-term + scene flow) — ALPINE (row 1), DBSCAN (row 2–3), HDBSCAN (row 4) across 5 frames. Consistent colour = consistent instance identity.

Temporal instance consistency

10 consecutive frames — each coloured blob is a tracked instance; consistent colour across frames indicates correct temporal association.

10-frame grid showing temporally consistent instance tracking on SemanticKITTI

Failure cases

The three clustering methods each exhibit characteristic failure modes. ALPINE (row 1) over-segments; DBSCAN (row 2) merges nearby clusters; HDBSCAN (row 3) struggles with sparse objects.

Cross-dataset generalisation — PONE

FlowSeg4D was applied to the PONE dataset with no retraining or fine-tuning, using models pretrained on SemanticKITTI and nuScenes. Both semantic and instance outputs transfer without adaptation.

FlowSeg4D on PONE — semantic (left) and panoptic (right), no retraining

PONE dataset semantic segmentation: WaffleIron SemanticKITTI model (top) vs nuScenes model (bottom)

Semantic segmentation on PONE — SemanticKITTI model (top row) vs nuScenes model (bottom row) across 3 frames

PONE dataset instance segmentation: long-term association with ALPINE, DBSCAN, HDBSCAN

Instance association (long-term) on PONE — ALPINE (row 1), DBSCAN (row 2), HDBSCAN (row 3)

Master's thesis at CTU FEE, 2025.

Read the thesis Source code ← All projects