Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech

📅 2025-01-13

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the poorly understood linguistic representation mechanism within fully connected (FC) layers of generative CNNs for text-to-speech synthesis. We systematically investigate how FC layers encode phonological and lexical-level linguistic information—previously unexplored in this context. To this end, we propose two intervention strategies: (i) targeted injection into the FC weight matrix and (ii) explicit manipulation of FC activations, complemented by weight temporal structure analysis, feature-map back-injection, symbolic intervention, and distributional modeling. Our findings reveal that FC layers implicitly encode subword-level phonological constraints that are invariant across lexical items. Critically, we demonstrate—for the first time—that precise synthesis of target phonemes can be achieved solely via FC-layer manipulation, confirming its structured, interpretable linguistic encoding. These results establish a novel paradigm for interpretable modeling of generative CNNs and enable fine-grained controllability in neural speech synthesis.

Technology Category

Application Category

📝 Abstract

Interpretability work on the convolutional layers of CNNs has primarily focused on computer vision, but some studies also explore correspondences between the latent space and the output in the audio domain. However, it has not been thoroughly examined how acoustic and linguistic information is represented in the fully connected (FC) layer that bridges the latent space and convolutional layers. The current study presents the first exploration of how the FC layer of CNNs for speech synthesis encodes linguistically relevant information. We propose two techniques for exploration of the fully connected layer. In Experiment 1, we use weight matrices as inputs into convolutional layers. In Experiment 2, we manipulate the FC layer to explore how symbolic-like representations are encoded in CNNs. We leverage the fact that the FC layer outputs a feature map and that variable-specific weight matrices are temporally structured to (1) demonstrate how the distribution of learned weights varies between latent variables in systematic ways and (2) demonstrate how manipulating the FC layer while holding constant subsequent model parameters affects the output. We ultimately present an FC manipulation that can output a single segment. Using this technique, we show that lexically specific latent codes in generative CNNs (ciwGAN) have shared lexically invariant sublexical representations in the FC-layer weights, showing that ciwGAN encodes lexical information in a linguistically principled manner.

Problem

Research questions and friction points this paper is trying to address.

Convolutional Neural Networks

Speech Synthesis

Language Information Encoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Convolutional Neural Network (CNN)

Fully Connected (FC) Layer Analysis

Speech Synthesis Encoding

🔎 Similar Papers

Sylber: Syllabic Embedding Representation of Speech from Raw Audio