GeoINF: A Geospatial Multimodal Vision Foundation Model for Overhead and Ground-Level Images
Topics:
Keywords: Vision Foundation Model; Multimodal; Geospatial Artificial Intelligence; Remote Sensing; Streetview Image; Few-shot Learning
Abstract Type: Paper Abstract
Authors:
Zeping Liu, The University of Texas at Austin
Gengchen Mai, The University of Texas at Austin
,
,
,
,
,
,
,
,
Abstract
The recent success of remote sensing vision foundation models (RSVFMs) such as SatMAE and Skysense reveal immense potential towards generic geospatial foundation models. However, these models focus only on overhead remote sensing data and neglect ground-level street view imagery, which is an important complementary data source. This limitation restricts the variety of tasks these models can tackle. To bridge this gap, we propose a geospatial multimodal foundation model, GeoINF, which is pretrained on a curated geospatial multimodal dataset of nearly 10 million samples, combining both overhead and ground-level imagery. The model incorporates a multimodal spatiotemporal encoder taking temporal sequences of overhead remote sensing images and street view data as input. The remote sensing encoder, adopted from existing RSVFMs, provides overhead geospatial features, while the street view encoder captures the ground-level view of the same location. By leveraging the geospatial alignment between overhead images and street-view images, we develop an implicit neural function (INF) to extract a geospatial visual feature of a specific location from overhead imagery and contrast it with its colocated street-view image features. The experimental results demonstrate that the proposed model effectively captures generic geospatial features across data modalities and outperforms previous state-of-the-art models in various street view imagery tasks.
GeoINF: A Geospatial Multimodal Vision Foundation Model for Overhead and Ground-Level Images
Category
Paper Abstract
Description
Submitted by:
Zeping Liu University of Texas at Austin
zeping.liu@utexas.edu
This abstract is part of a session. Click here to view the session.
| Slides