Large Deviations for Sequential Tests of Statistical Sequence Matching

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This paper addresses the problem of identifying matching pairs between two sequence databases generated by unknown distributions, considering both known and unknown numbers of true matches. To satisfy the bounded expected stopping time requirement for sequential testing, we establish the first exact characterization of the exponential decay rates—termed *mismatch exponents*—for three error types: mismatch, false alarm, and missed detection. Our analysis reveals that sequentiality yields exponential performance gains over fixed-length schemes. Theoretically, our optimal sequential test strictly dominates Zhou et al.’s (IEEE TIT 2024) two-stage fixed-length method across all error exponents. Moreover, we propose an improved single-stage fixed-length scheme achieving superior trade-offs. Methodologically, the work integrates large deviations theory, sequential decision analysis, and information-theoretic tools—including Kullback–Leibler divergence and exponential bounds—to derive fundamental limits and constructive tests.

Technology Category

Application Category

📝 Abstract

We revisit the problem of statistical sequence matching initiated by Unnikrishnan (TIT 2015) and derive theoretical performance guarantees for sequential tests that have bounded expected stopping times. Specifically, in this problem, one is given two databases of sequences and the task is to identify all matched pairs of sequences. In each database, each sequence is generated i.i.d. from a distinct distribution and a pair of sequences is said matched if they are generated from the same distribution. The generating distribution of each sequence is emph{unknown}. We first consider the case where the number of matches is known and derive the exact exponential decay rate of the mismatch (error) probability, a.k.a. the mismatch exponent, under each hypothesis for optimal sequential tests. Our results reveal the benefit of sequentiality by showing that optimal sequential tests have larger mismatch exponent than fixed-length tests by Zhou emph{et al.} (TIT 2024). Subsequently, we generalize our achievability result to the case of unknown number of matches. In this case, two additional error probabilities arise: false alarm and false reject probabilities. We propose a corresponding sequential test, show that the test has bounded expected stopping time under certain conditions, and characterize the tradeoff among the exponential decay rates of three error probabilities. Furthermore, we reveal the benefit of sequentiality over the two-step fixed-length test by Zhou emph{et al.} (TIT 2024) and propose an one-step fixed-length test that has no worse performance than the fixed-length test by Zhou emph{et al.} (TIT 2024). When specialized to the case where either database contains a single sequence, our results specialize to large deviations of sequential tests for statistical classification, the binary case of which was recently studied by Hsu, Li and Wang (ITW 2022).

Problem

Research questions and friction points this paper is trying to address.

Derive performance guarantees for sequential statistical sequence matching tests

Analyze error probability decay rates for known and unknown matches

Compare sequential and fixed-length tests in mismatch exponent performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential tests with bounded stopping times

Exponential decay rate analysis for errors

Generalized tests for unknown match counts

🔎 Similar Papers

No similar papers found.