To address the challenges of fine-grained force control and unintuitive human–robot interaction in robotic dexterous manipulation, this paper proposes a semantics-driven bilateral force modulation framework. Methodologically, we introduce the first joint modeling of natural language instructions (e.g., “gently grasp the cup”) with bilateral teleoperation force/motion signals via a multimodal Transformer architecture, integrating the SigLIP language encoder, action tokenization, and fused perception of joint position, velocity, and torque. Our key contributions include: (i) the first end-to-end mapping from linguistic intent to force-level control, enabling real-time, interpretable, and bimanual force modulation; and (ii) empirical validation on single-hand cup-stacking and dual-hand sponge-squeezing tasks, where multi-level force instructions are accurately reproduced. SigLIP significantly improves language–force alignment accuracy, demonstrating the efficacy of semantic-guided imitation learning for force control.