Misplaced Pages

Chomsky–Schützenberger enumeration theorem

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

Statement

In order to state the theorem, a few notions from algebra and formal language theory are needed.

Let N {\displaystyle \mathbb {N} } denote the set of nonnegative integers. A power series over N {\displaystyle \mathbb {N} } is an infinite series of the form

f = f ( x ) = k = 0 a k x k = a 0 + a 1 x 1 + a 2 x 2 + a 3 x 3 + {\displaystyle f=f(x)=\sum _{k=0}^{\infty }a_{k}x^{k}=a_{0}+a_{1}x^{1}+a_{2}x^{2}+a_{3}x^{3}+\cdots }

with coefficients a k {\displaystyle a_{k}} in N {\displaystyle \mathbb {N} } . The multiplication of two formal power series f {\displaystyle f} and g {\displaystyle g} is defined in the expected way as the convolution of the sequences a n {\displaystyle a_{n}} and b n {\displaystyle b_{n}} :

f ( x ) g ( x ) = k = 0 ( i = 0 k a i b k i ) x k . {\displaystyle f(x)\cdot g(x)=\sum _{k=0}^{\infty }\left(\sum _{i=0}^{k}a_{i}b_{k-i}\right)x^{k}.}

In particular, we write f 2 = f ( x ) f ( x ) {\displaystyle f^{2}=f(x)\cdot f(x)} , f 3 = f ( x ) f ( x ) f ( x ) {\displaystyle f^{3}=f(x)\cdot f(x)\cdot f(x)} , and so on. In analogy to algebraic numbers, a power series f ( x ) {\displaystyle f(x)} is called algebraic over Q ( x ) {\displaystyle \mathbb {Q} (x)} , if there exists a finite set of polynomials p 0 ( x ) , p 1 ( x ) , p 2 ( x ) , , p n ( x ) {\displaystyle p_{0}(x),p_{1}(x),p_{2}(x),\ldots ,p_{n}(x)} each with rational coefficients such that

p 0 ( x ) + p 1 ( x ) f + p 2 ( x ) f 2 + + p n ( x ) f n = 0. {\displaystyle p_{0}(x)+p_{1}(x)\cdot f+p_{2}(x)\cdot f^{2}+\cdots +p_{n}(x)\cdot f^{n}=0.}

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree or, equivalently, only one leftmost derivation. Having established the necessary notions, the theorem is stated as follows.

Chomsky–Schützenberger theorem. If L {\displaystyle L} is a context-free language admitting an unambiguous context-free grammar, and a k := | L   Σ k | {\displaystyle a_{k}:=|L\ \cap \Sigma ^{k}|} is the number of words of length k {\displaystyle k} in L {\displaystyle L} , then G ( x ) = k = 0 a k x k {\displaystyle G(x)=\sum _{k=0}^{\infty }a_{k}x^{k}} is a power series over N {\displaystyle \mathbb {N} } that is algebraic over Q ( x ) {\displaystyle \mathbb {Q} (x)} .

Proofs of this theorem are given by Kuich & Salomaa (1985), and by Panholzer (2005).

Usage

Asymptotic estimates

The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by Gruber, Lee & Shallit (2012): the unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules

SM | U
M → 0M1M | ε
U → 0S | 0M1U.

To obtain an algebraic representation of the power series ⁠ G ( x ) {\displaystyle G(x)} ⁠ associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by x, each occurrence of ε by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:

S = M + U
M = M²x² + 1
U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write ⁠ S ( x ) {\displaystyle S(x)} ⁠, ⁠ M ( x ) {\displaystyle M(x)} ⁠, and ⁠ U ( x ) {\displaystyle U(x)} ⁠. The equation system can be resolved after S, resulting in a single algebraic equation:

x ( 2 x 1 ) S 2 + ( 2 x 1 ) S + 1 = 0 {\displaystyle x(2x-1)S^{2}+(2x-1)S+1=0} ⁠.

This quadratic equation has two solutions for S, one of which is the algebraic power series ⁠ G ( x ) {\displaystyle G(x)} ⁠. By applying methods from complex analysis to this equation, the number a n {\displaystyle a_{n}} of words of length n generated by G can be estimated, as n grows large. In this case, one obtains a n O ( 2 + ϵ ) n {\displaystyle a_{n}\in O(2+\epsilon )^{n}} but a n O ( 2 ϵ ) n {\displaystyle a_{n}\notin O(2-\epsilon )^{n}} for each ϵ > 0 {\displaystyle \epsilon >0} .

The following example is from Bassino & Nicaud (2011): { S X Y T a T | T b T | Y c Y Y Y a Y | c Y | a b T a Y Y a | X X a | b | c { s ( z ) = x ( z ) y ( z ) t ( z ) = z t ( z ) + z t ( z ) 2 + z y ( z ) 2 y ( z ) = z y ( z ) 2 + z y ( z ) + z 4 t ( z ) y ( z ) 2 + x ( z ) x ( z ) = 3 z {\displaystyle \left\{{\begin{array}{l }{S\rightarrow XY}\\{T\rightarrow aT|TbT|YcY}\\{Y\rightarrow YaY|cY|abTaYYa|X}\\{X\rightarrow a|b|c}\end{array}}\Rightarrow \left\{{\begin{array}{l}s(z)=x(z)y(z)\\t(z)=zt(z)+zt(z)^{2}+zy(z)^{2}\\y(z)=zy(z)^{2}+zy(z)+z^{4}t(z)y(z)^{2}+x(z)\\x(z)=3z\end{array}}\right.\right.} which simplifies to s ( z ) 8 27 ( z 3 z 2 ) s ( z ) 5 + + 59049 z 10 = 0 {\displaystyle s(z)^{8}-27\left(z^{3}-z^{2}\right)s(z)^{5}+\ldots +59049z^{10}=0}

Inherent ambiguity

In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the Goldstine language L G {\displaystyle L_{G}} over the alphabet { a , b } {\displaystyle \{a,b\}} consists of the words a n 1 b a n 2 b a n p b {\displaystyle a^{n_{1}}ba^{n_{2}}b\cdots a^{n_{p}}b} with p 1 {\displaystyle p\geq 1} , n i > 0 {\displaystyle n_{i}>0} for i { 1 , 2 , , p } {\displaystyle i\in \{1,2,\ldots ,p\}} , and n j j {\displaystyle n_{j}\neq j} for some j { 1 , 2 , , p } {\displaystyle j\in \{1,2,\ldots ,p\}} .

It is comparably easy to show that the language L G {\displaystyle L_{G}} is context-free. The harder part is to show that there does not exist an unambiguous grammar that generates L G {\displaystyle L_{G}} . This can be proved as follows: If g k {\displaystyle g_{k}} denotes the number of words of length k {\displaystyle k} in L G {\displaystyle L_{G}} , then for the associated power series holds G ( x ) = k = 0 g k x k = 1 x 1 2 x 1 x k 1 x k ( k + 1 ) / 2 1 {\displaystyle G(x)=\sum _{k=0}^{\infty }g_{k}x^{k}={\frac {1-x}{1-2x}}-{\frac {1}{x}}\sum _{k\geq 1}x^{k(k+1)/2-1}} . Using methods from complex analysis, one can prove that this function is not algebraic over Q ( x ) {\displaystyle \mathbb {Q} (x)} . By the Chomsky-Schützenberger theorem, one can conclude that L G {\displaystyle L_{G}} does not admit an unambiguous context-free grammar.

Notes

  1. See Gruber, Lee & Shallit (2012) for a detailed exposition.
  2. Berstel & Boasson (1990).
  3. See Berstel & Boasson (1990) for detailed account.

References

Noam Chomsky
Select
bibliography
Linguistics
Politics
Collections
Academic
works about
Filmography
Family
Related
Categories: