Decoding the Visual Pattern Language of Urban Scenes: A Cross-Modal Approach Using Deep Learning and CLIP
Topics:
Keywords: CLIP; Street view analysis; Pattern language; Cross-modal;
Abstract Type: Paper Abstract
Authors:
Song Guo, School of Architecture, Tsinghua University
,
,
,
,
,
,
,
,
,
Abstract
Urban visual scenes are the result of complex interactions among cultural, social, and economic factors. However, analyzing the structural components of these intricate scenes remains challenging. Previous studies have utilized computer vision techniques to generate high-dimensional and low-dimensional feature representations of urban visual scenes, enabling clustering and classification based on these features. Yet, there is still a lack of structured representation and analysis of the patterns underlying visual scene composition.
Christopher Alexander introduced the concept of “Pattern Language” in architecture, which deconstructs architectural layout into an organized and coherent set of patterns. For urban visual scenes, which involve more elements and larger scales, the question arises: if visual scenes are analogous to textual structures, what are the visual vocabulary, syntax, and grammar of these scenes? How are the various visual elements organized into recognizable patterns, and what constitutes the “visual pattern language” of cities?
Our approach integrates vector representations of urban street view images obtained through deep learning models with textual descriptions of the same images generate and vectorized via large language models. Using CLIP model, we align visual fragments with textual components to identify the vocabulary and structural rules embedded in urban visual scenes.
This approach not only deconstructs the principles underlying urban design but also enriches visual scenes with high-dimensional semantic information, enhancing their interpretability. The findings provide a foundation for analyses such as urban visual profiling, supporting a deeper understanding of imageability of cities.
Decoding the Visual Pattern Language of Urban Scenes: A Cross-Modal Approach Using Deep Learning and CLIP
Category
Paper Abstract
Description
Submitted by:
Song GUO Tsinghua University
guos22@mails.tsinghua.edu.cn
This abstract is part of a session. Click here to view the session.
| Slides