SilLang: Improving Gait Recognition with Silhouette Language Encoding

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing gait recognition methods that overlook the discrete nature of binary silhouette sequences, hindering effective modeling of temporal motion patterns in a shared discrete space with natural language. To bridge this gap, the authors propose a dual-branch framework featuring a custom contour–velocity tokenizer that maps binary gait silhouettes into a discrete token space aligned with textual tokens. A large language model (LLM) is then leveraged to extract linguistic embeddings, enabling fusion of visual and semantic features. This approach represents the first application of LLMs’ discrete sequence modeling capability to gait recognition, achieving state-of-the-art performance on the SUSTech1K, GREW, and Gait3D benchmarks and demonstrating the efficacy of cross-modal discrete alignment for gait representation learning.

Technology Category

Application Category

📝 Abstract
Gait silhouettes, which can be encoded into binary gait codes, are widely adopted to representing motion patterns of pedestrian. Recent approaches commonly leverage visual backbones to encode gait silhouettes, achieving successful performance. However, they primarily focus on continuous visual features, overlooking the discrete nature of binary silhouettes that inherently share a discrete encoding space with natural language. Large Language Models (LLMs) have demonstrated exceptional capability in extracting discriminative features from discrete sequences and modeling long-range dependencies, highlighting their potential to capture temporal motion patterns by identifying subtle variations. Motivated by these observations, we explore bridging binary gait silhouettes and natural language within a binary encoding space. However, the encoding spaces of text tokens and binary gait silhouettes remain misaligned, primarily due to differences in token frequency and density. To address this issue, we propose the Contour-Velocity Tokenizer, which encodes binary gait silhouettes while reshaping their distribution to better align with the text token space. We then establish a dual-branch framework termed Silhouette Language Model, which enhances visual silhouettes by integrating discrete linguistic embeddings derived from LLMs. Implemented on mainstream gait backbones, SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D.
Problem

Research questions and friction points this paper is trying to address.

gait recognition
binary gait silhouettes
discrete encoding
language modeling
feature alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

gait recognition
binary silhouette encoding
large language models
discrete token alignment
Contour-Velocity Tokenizer
🔎 Similar Papers
No similar papers found.
R
Ruiyi Zhan
Beihang University
G
Guozhen Peng
Beihang University
Canyu Chen
Canyu Chen
CS Ph.D. at Northwestern | Visiting Researcher at UC Berkeley
Foundation AgentTrustworthinessMultimodality
J
Jian Lei
Tsinghua University
A
Annan Li
Beihang University