🤖 AI Summary
Existing studies employ large language models (LLMs) as semantic feature extractors for sequential recommendation systems (SRS), but suffer from highly heterogeneous prompt design, architectural choices, and adaptation strategies—hindering fair attribution of design factors. To address this, we propose RecXplore, the first modular analytical framework that decouples LLM-driven sequential recommendation into four independently evaluable components: data processing, feature extraction, feature adaptation, and sequence modeling. This enables standardized ablation studies and systematic discovery of effective design patterns. Evaluated on four public benchmark datasets, RecXplore achieves end-to-end improvements of +18.7% in NDCG@5 and +12.7% in HR@5 over strong baselines by composing state-of-the-art modules. Our core contribution is establishing the first decomposable, reproducible, and comparable analytical paradigm for LLM-based feature extraction in sequential recommendation.
📝 Abstract
Using Large Language Models (LLMs) to generate semantic features has been demonstrated as a powerful paradigm for enhancing Sequential Recommender Systems (SRS). This typically involves three stages: processing item text, extracting features with LLMs, and adapting them for downstream models. However, existing methods vary widely in prompting, architecture, and adaptation strategies, making it difficult to fairly compare design choices and identify what truly drives performance. In this work, we propose RecXplore, a modular analytical framework that decomposes the LLM-as-feature-extractor pipeline into four modules: data processing, semantic feature extraction, feature adaptation, and sequential modeling. Instead of proposing new techniques, RecXplore revisits and organizes established methods, enabling systematic exploration of each module in isolation. Experiments on four public datasets show that simply combining the best designs from existing techniques without exhaustive search yields up to 18.7% relative improvement in NDCG@5 and 12.7% in HR@5 over strong baselines. These results underscore the utility of modular benchmarking for identifying effective design patterns and promoting standardized research in LLM-enhanced recommendation.