StreetTree: A Large-Scale Global Benchmark for Fine-Grained Tree Species Classification

Jiapeng Li, Yingjing Huang, Fan Zhang, Yu Liu
Peking University University of Vienna
StreetTree overview

StreetTree introduces the first large-scale global benchmark for fine-grained street tree species classification, with more than 12 million images and over 8,300 species collected from 133 countries.

Key Contributions

Global Coverage

12,235,152 street-view images from 133 countries across five continents.

Hierarchical Taxonomy

Four-level labels: 71 orders, 241 families, 1,747 genera, and 8,363 species.

Individual-Tree Scale

3,365,485 individual trees linked with geolocation and temporal records.

Seasonal Signals

Season labels capture phenological changes and intra-class appearance variation.

Temporal Continuity

Long-term observations from 2015 to 2025 support longitudinal analysis.

Abstract

The fine-grained classification of street trees is a crucial task for urban planning, streetscape management, and the assessment of urban ecosystem services. However, progress in this field has been significantly hindered by the lack of large-scale, geographically diverse, and publicly available benchmark datasets specifically designed for street trees. To address this critical gap, we introduce StreetTree, the world's first large-scale benchmark dataset dedicated to fine-grained street tree classification. The dataset contains over 12 million images covering more than 8,300 common street tree species, collected from urban streetscapes across 133 countries spanning five continents, and supplemented with expert-verified observational data. StreetTree poses substantial challenges for pretrained vision models under complex urban environments: high inter-species visual similarity, long-tailed natural distributions, significant intra-class variations caused by seasonal changes, and diverse imaging conditions such as lighting, occlusions from buildings, and varying camera angles. In addition, we provide a hierarchical taxonomy (order-family-genus-species) to support research in hierarchical classification and representation learning. Through extensive experiments with various visual models, we establish strong baselines and reveal the limitations of existing methods in handling such real-world complexities.

Dataset Statistics

StreetTree dataset statistics

The dataset follows a naturally long-tailed distribution at order, family, genus, and species levels, while maintaining meaningful seasonal coverage and long-term observation depth.

Data Construction and Filtering Pipeline

We organize data processing into construction and quality filtering stages. The pipeline view and binary filtering examples are shown below.

StreetTree data construction pipeline

StreetTree data construction and preprocessing workflow.

Binary filtering examples for noisy image removal

Two-stage binary filtering for noisy sample removal.

Real-World Challenges

StreetTree highlights major real-world challenges for fine-grained classification, including high inter-species similarity, seasonal shifts, occlusion, illumination changes, and viewpoint variation.

Inter-species visual similarity examples

High inter-species visual similarity.

Seasonal appearance variation examples

Strong seasonal intra-class appearance variation.

Lighting condition variation examples

Robustness under diverse illumination conditions.

Occlusion examples in urban scenes

Frequent occlusion in complex urban scenes.

Camera viewpoint variation examples

Sensitivity to viewpoint and capture-angle changes.

Experiment Results

Model evaluation on StreetTree-12M (mean accuracy % over 5 random runs)

Model Frequent Common Rare
OrderFamilyGenusSpecies OrderFamilyGenusSpecies OrderFamilyGenusSpecies
0.1% of training set
Fine-tuned ViT17.52 (52.31)13.26 (34.56)6.33 (20.16)1.81 (5.76)12.34 (38.11)9.47 (24.22)4.16 (12.35)1.43 (3.28)7.82 (25.33)5.14 (15.29)2.13 (7.41)1.16 (1.82)
CLIP20.35 (60.62)16.58 (41.25)5.21 (21.28)1.54 (5.56)14.28 (45.18)11.31 (30.12)3.14 (14.22)1.27 (3.18)9.17 (30.16)7.18 (18.11)1.56 (8.12)1.08 (1.52)
Fine-tuned CLIP20.06 (58.76)15.16 (39.84)6.08 (22.95)1.76 (5.82)13.82 (43.21)10.43 (28.17)4.07 (15.11)1.41 (3.59)8.87 (28.19)6.42 (16.21)2.18 (9.14)1.19 (1.97)
Fine-tuned SigLIP23.41 (63.26)18.28 (44.31)8.42 (25.18)1.52 (7.82)16.56 (48.11)12.87 (33.22)5.86 (18.81)1.93 (5.16)11.18 (33.65)8.41 (20.17)3.27 (11.18)1.46 (2.81)
Fine-tuned BioCLIP26.17 (66.42)21.18 (48.21)11.23 (28.14)2.87 (9.11)19.12 (52.19)15.18 (37.51)7.82 (21.18)1.81 (6.87)13.48 (37.11)10.56 (24.12)4.83 (14.16)1.87 (3.92)
1% of training set
Fine-tuned ViT21.85 (58.41)17.31 (42.49)10.45 (27.42)3.32 (12.05)15.48 (42.17)11.82 (29.51)6.81 (18.13)1.87 (7.82)10.72 (28.18)7.46 (18.68)4.13 (11.19)1.81 (4.13)
CLIP20.18 (60.45)16.52 (41.81)6.98 (24.42)1.78 (8.71)14.07 (44.18)11.17 (29.35)4.19 (16.43)0.93 (5.11)9.16 (29.78)6.82 (18.16)2.84 (9.81)1.38 (2.87)
Fine-tuned CLIP22.21 (61.22)16.65 (43.02)8.92 (26.18)3.98 (11.65)16.18 (45.11)11.52 (30.18)5.82 (17.81)1.72 (7.58)10.82 (30.34)7.18 (19.12)3.19 (10.82)1.86 (3.81)
Fine-tuned SigLIP25.28 (64.18)20.22 (47.48)12.31 (30.49)5.08 (14.83)18.33 (48.68)14.07 (34.47)8.16 (21.84)2.82 (9.98)12.16 (33.67)9.19 (22.53)5.33 (13.07)1.48 (5.92)
Fine-tuned BioCLIP28.17 (67.14)23.94 (51.26)15.81 (34.01)7.40 (17.63)21.04 (52.35)16.32 (38.44)10.66 (25.55)4.56 (12.81)15.67 (37.93)11.64 (26.30)6.58 (16.22)2.50 (7.02)
10% of training set
Fine-tuned ViT30.71 (69.45)24.82 (53.05)17.25 (38.28)10.85 (24.95)22.41 (52.18)17.08 (38.37)11.39 (26.33)6.41 (16.23)15.16 (36.24)11.05 (25.36)6.82 (16.30)3.18 (9.76)
CLIP22.01 (62.31)17.52 (44.38)8.68 (27.14)4.32 (13.98)15.18 (45.12)12.07 (31.18)5.14 (18.27)2.76 (8.54)10.93 (30.25)7.82 (20.40)2.86 (11.01)1.82 (4.31)
Fine-tuned CLIP27.88 (67.22)22.05 (50.25)14.72 (35.81)10.35 (23.75)19.34 (49.30)15.38 (36.18)9.12 (24.16)5.46 (15.80)12.86 (34.35)9.82 (23.87)5.05 (14.77)2.82 (8.44)
Fine-tuned SigLIP34.76 (72.36)28.16 (56.80)20.52 (42.00)13.99 (28.32)25.76 (55.43)20.67 (41.06)14.91 (30.47)8.01 (19.24)18.35 (40.14)14.16 (28.64)9.02 (19.82)4.67 (11.49)
Fine-tuned BioCLIP38.16 (76.96)32.31 (60.24)24.85 (47.70)16.13 (32.86)29.23 (59.27)24.02 (45.37)17.14 (34.82)11.31 (23.38)21.32 (44.06)17.62 (32.82)11.31 (23.50)6.19 (14.57)
100% of training set
Fine-tuned ViT41.25 (75.46)33.15 (60.25)25.60 (45.84)17.16 (31.77)28.09 (58.24)22.51 (44.87)16.64 (31.23)9.32 (21.62)20.85 (42.88)15.16 (31.10)10.59 (21.22)5.14 (13.79)
CLIP24.99 (66.21)20.11 (48.13)11.88 (31.56)8.67 (17.16)18.25 (49.12)14.24 (34.81)7.43 (21.66)3.48 (11.58)12.40 (33.02)9.43 (23.15)4.07 (14.62)1.28 (6.13)
Fine-tuned CLIP42.31 (78.77)35.16 (64.83)28.25 (50.34)23.77 (36.02)31.03 (61.81)25.27 (48.05)18.30 (35.16)11.55 (24.79)23.35 (45.50)18.94 (35.54)12.25 (24.12)6.16 (15.27)
Fine-tuned SigLIP42.92 (78.06)36.86 (66.43)31.41 (57.05)28.42 (48.47)39.19 (73.52)31.34 (60.94)25.73 (50.33)23.17 (41.85)37.51 (71.59)23.30 (59.76)23.43 (46.63)18.29 (38.37)
Fine-tuned BioCLIP46.41 (81.12)40.30 (69.00)37.70 (63.28)30.26 (52.07)41.67 (78.09)38.69 (68.81)30.96 (52.63)27.15 (46.01)40.42 (77.20)31.64 (64.12)28.27 (48.95)20.60 (41.76)

Overall model evaluation (mean accuracy % ± standard deviation %)

Model Top-1 Top-5
OrderFamilyGenusSpecies OrderFamilyGenusSpecies
0.1% of training set
Fine-tuned ViT16.37 ± 0.5812.39 ± 0.305.84 ± 0.881.73 ± 0.2249.15 ± 1.2432.27 ± 0.2218.49 ± 0.645.23 ± 0.55
CLIP19.01 ± 0.2815.43 ± 1.124.76 ± 1.281.48 ± 0.5657.15 ± 0.7038.71 ± 1.0319.72 ± 1.465.05 ± 0.89
Fine-tuned CLIP18.69 ± 0.4414.12 ± 1.375.63 ± 0.191.69 ± 0.4255.27 ± 0.6637.20 ± 2.3421.24 ± 0.875.34 ± 0.54
Fine-tuned SigLIP21.91 ± 1.5517.09 ± 0.187.84 ± 1.421.58 ± 0.6759.86 ± 0.5841.74 ± 1.4623.70 ± 1.127.23 ± 0.69
Fine-tuned BioCLIP24.62 ± 0.7919.87 ± 0.9410.47 ± 0.382.67 ± 0.9163.18 ± 0.8545.71 ± 1.0626.57 ± 1.328.58 ± 0.44
1% of training set
Fine-tuned ViT20.46 ± 0.5416.11 ± 0.469.66 ± 0.673.04 ± 0.2754.82 ± 0.9239.63 ± 0.7425.40 ± 0.4211.11 ± 0.65
CLIP18.84 ± 0.3315.34 ± 1.026.40 ± 0.631.63 ± 0.8256.84 ± 0.6439.04 ± 1.3522.66 ± 1.827.94 ± 0.72
Fine-tuned CLIP20.87 ± 0.6615.52 ± 0.708.24 ± 0.853.55 ± 0.3657.63 ± 0.6240.18 ± 0.4824.33 ± 0.3410.74 ± 1.32
Fine-tuned SigLIP23.74 ± 0.6118.87 ± 0.7811.42 ± 0.494.60 ± 0.8560.70 ± 1.0344.58 ± 2.3128.53 ± 1.0413.76 ± 0.79
Fine-tuned BioCLIP26.62 ± 1.2222.31 ± 1.3214.68 ± 0.446.78 ± 0.9663.81 ± 0.7648.39 ± 0.6832.07 ± 1.1816.51 ± 0.58
10% of training set
Fine-tuned ViT28.87 ± 0.4223.13 ± 1.5715.97 ± 0.689.89 ± 0.3265.59 ± 0.8049.79 ± 0.5635.64 ± 1.5223.05 ± 0.85
CLIP20.55 ± 0.4416.33 ± 1.147.92 ± 0.763.99 ± 0.6258.50 ± 0.3241.48 ± 1.3325.19 ± 0.3812.79 ± 0.36
Fine-tuned CLIP26.02 ± 0.5520.58 ± 0.4913.50 ± 0.849.32 ± 0.3163.27 ± 2.3347.13 ± 0.9133.25 ± 1.5521.97 ± 0.68
Fine-tuned SigLIP32.78 ± 0.7626.50 ± 1.6319.24 ± 1.2012.72 ± 0.8568.59 ± 0.7953.35 ± 1.4339.42 ± 0.7626.31 ± 1.31
Fine-tuned BioCLIP36.18 ± 1.3430.50 ± 1.0623.17 ± 0.8415.03 ± 0.6173.05 ± 1.4256.96 ± 0.7644.84 ± 0.3930.74 ± 0.57
100% of training set
Fine-tuned ViT38.4730.8523.6715.5171.6356.8342.6929.55
CLIP23.5018.8210.917.5962.3845.1829.4215.90
Fine-tuned CLIP39.8833.0426.1321.2374.9661.1847.0533.54
Fine-tuned SigLIP42.1435.5330.2427.2577.1265.3355.6347.08
Fine-tuned BioCLIP45.4539.7636.3129.4580.5068.8161.1050.76

BibTeX

@article{li2026streettree,
      title={StreetTree: A Large-Scale Global Benchmark for Fine-Grained Tree Species Classification},
      author={Li, Jiapeng and Huang, Yingjing and Zhang, Fan and Liu, Yu},
      journal={arXiv preprint arXiv:2602.19123},
      year={2026},
      url={https://arxiv.org/abs/2602.19123}
}