Improving Unsupervised Constituency Parsing via Maximizing Semantic Information

TL;DR: Constituent structures encode fine-grained semantics as constituent substrings, which can be identified as high-information substrings among paraphrases.

The Challenge

Unsupervised constituency parsing, discovering syntactic structure without labeled data, remains a fundamental challenge in Computational Linguistics. While recent approaches rely heavily on language modeling objectives, we ask: Is language modeling alone sufficient for accurate parsing?

Our answer is no. We propose that sentence-level semantic information provides crucial signals for robust parsing that pure distributional methods miss.

Key Insight: Syntax Encodes Semantics

Constituent Structure as Semantic Representation

Each node in a constituent tree corresponds to a meaningful semantic unit. This one-to-one correspondence between syntactic constituents and partial meanings suggests we can invert the problem: identify constituents by finding semantically coherent substrings.

Our Approach

Step 1: Quantifying Semantic Information via Paraphrasing

We leverage paraphrases to identify semantically important substrings. The intuition: constituents should appear consistently across paraphrases of the same sentence, while arbitrary substrings should not.

Step 2: Why Traditional Methods Fall Short

We compared two approaches for estimating substring semantic information:

Neural Similarity (Bge-M3)

Neural embeddings struggle to distinguish between constituent (blue) and random substrings (red), showing nearly identical scores.

Our TF-IDF Approach

By treating substrings “generated” by semantics in paraphrasing sentences (identical to how words are “generated” by topics in documents), TF-IDF naturally captures the substring-semantics interactions (as measured by semantic information).

Step 3: Learning to Parse via Information Maximization

We use reinforcement learning to train a parser that maximizes the semantic information of predicted constituents. This bridges semantic objectives with traditional syntactic parsing through a unified optimization framework.