Misplaced Pages

Autocorrelation (words)

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
In combinatorics, the autocorrelation of a word is the set of periods of this word

In combinatorics, a branch of mathematics, the autocorrelation of a word is the set of periods of this word. More precisely, it is a sequence of values which indicate how much the end of a word looks likes the beginning of a word. This value can be used to compute, for example, the average value of the first occurrence of this word in a random string.

Definition

In this article, A is an alphabet, and w = w 1 w n {\displaystyle w=w_{1}\dots w_{n}} a word on A of length n. The autocorrelation of w {\displaystyle w} can be defined as the correlation of w {\displaystyle w} with itself. However, we redefine this notion below.

Autocorrelation vector

The autocorrelation vector of w {\displaystyle w} is c = ( c 0 , , c n 1 ) {\displaystyle c=(c_{0},\dots ,c_{n-1})} , with c i {\displaystyle c_{i}} being 1 if the prefix of length n i {\displaystyle n-i} equals the suffix of length n i {\displaystyle n-i} , and with c i {\displaystyle c_{i}} being 0 otherwise. That is c i {\displaystyle c_{i}} indicates whether w i + 1 w n = w 1 w n i {\displaystyle w_{i+1}\dots w_{n}=w_{1}\dots w_{n-i}} .

For example, the autocorrelation vector of a a a {\displaystyle aaa} is ( 1 , 1 , 1 ) {\displaystyle (1,1,1)} since, clearly, for i {\displaystyle i} being 0, 1 or 2, the prefix of length n i {\displaystyle n-i} is equal to the suffix of length n i {\displaystyle n-i} . The autocorrelation vector of a b b {\displaystyle abb} is ( 1 , 0 , 0 ) {\displaystyle (1,0,0)} since no strict prefix is equal to a strict suffix. Finally, the autocorrelation vector of a a b b a a {\displaystyle aabbaa} is 100011, as shown in the following table:

a a b b a a
a a b b a a 1
a a b b a a 0
a a b b a a 0
a a b b a a 0
a a b b a a 1
a a b b a a 1

Note that c 0 {\displaystyle c_{0}} is always equal to 1, since the prefix and the suffix of length n {\displaystyle n} are both equal to the word w {\displaystyle w} . Similarly, c n 1 {\displaystyle c_{n-1}} is 1 if and only if the first and the last letters are the same.


Autocorrelation polynomial

The autocorrelation polynomial of w {\displaystyle w} is defined as c ( z ) = c 0 z 0 + + c n 1 z n 1 {\displaystyle c(z)=c_{0}z^{0}+\dots +c_{n-1}z^{n-1}} . It is a polynomial of degree at most n 1 {\displaystyle n-1} .

For example, the autocorrelation polynomial of a a a {\displaystyle aaa} is 1 + z + z 2 {\displaystyle 1+z+z^{2}} and the autocorrelation polynomial of a b b {\displaystyle abb} is 1 {\displaystyle 1} . Finally, the autocorrelation polynomial of a a b b a a {\displaystyle aabbaa} is 1 + z 4 + z 5 {\displaystyle 1+z^{4}+z^{5}} .

Property

We now indicate some properties which can be computed using the autocorrelation polynomial.

First occurrence of a word in a random string

Suppose that you choose an infinite sequence s {\displaystyle s} of letters of A {\displaystyle A} , randomly, each letter with probability 1 | A | {\displaystyle {\frac {1}{|A|}}} , where | A | {\displaystyle |A|} is the number of letters of A {\displaystyle A} . Let us call E {\displaystyle E} the expectation of the first occurrence of ? m {\displaystyle m} in s {\displaystyle s} ? . Then E {\displaystyle E} equals | A | n c ( 1 | A | ) {\displaystyle |A|^{n}c\left({\frac {1}{|A|}}\right)} . That is, each subword v {\displaystyle v} of w {\displaystyle w} which is both a prefix and a suffix causes the average value of the first occurrence of w {\displaystyle w} to occur | A | | v | {\displaystyle |A|^{|v|}} letters later. Here | v | {\displaystyle |v|} is the length of v.

For example, over the binary alphabet A = { a , b } {\displaystyle A=\{a,b\}} , the first occurrence of a a {\displaystyle aa} is at position 2 2 ( 1 + 1 2 ) = 6 {\displaystyle 2^{2}(1+{\frac {1}{2}})=6} while the average first occurrence of a b {\displaystyle ab} is at position 2 2 ( 1 ) = 4 {\displaystyle 2^{2}(1)=4} . Intuitively, the fact that the first occurrence of a a {\displaystyle aa} is later than the first occurrence of a b {\displaystyle ab} can be explained in two ways:

  • We can consider, for each position p {\displaystyle p} , what are the requirement for w {\displaystyle w} 's first occurrence to be at p {\displaystyle p} .
    • The first occurrence of a b {\displaystyle ab} can be at position 1 in only one way in both case. If s {\displaystyle s} starts with w {\displaystyle w} . This has probability 1 4 {\displaystyle {\frac {1}{4}}} for both considered values of w {\displaystyle w} .
    • The first occurrence of a b {\displaystyle ab} is at position 2 if the prefix of s {\displaystyle s} of length 3 is a a b {\displaystyle aab} or is b a b {\displaystyle bab} . However, the first occurrence of a a {\displaystyle aa} is at position 2 if and only if the prefix of s {\displaystyle s} of length 3 is b a a {\displaystyle baa} . (Note that the first occurrence of a a {\displaystyle aa} in a a a {\displaystyle aaa} is at position 1.).
    • In general, the number of prefixes of length n + 1 {\displaystyle n+1} such that the first occurrence of a a {\displaystyle aa} is at position n {\displaystyle n} is smaller for a a {\displaystyle aa} than for b a {\displaystyle ba} . This explain why, on average, the first a a {\displaystyle aa} arrive later than the first a b {\displaystyle ab} .
  • We can also consider the fact that the average number of occurrences of w {\displaystyle w} in a random string of length l {\displaystyle l} is | A | l n {\displaystyle |A|^{l-n}} . This number is independent of the autocorrelation polynomial. An occurrence of w {\displaystyle w} may overlap another occurrence in different ways. More precisely, each 1 in its autocorrelation vector correspond to a way for occurrence to overlap. Since many occurrences of w {\displaystyle w} can be packed together, using overlapping, but the average number of occurrences does not change, it follows that the distance between two non-overlapping occurrences is greater when the autocorrelation vector contains many 1's.

Ordinary generating functions

Autocorrelation polynomials allows to give simple equations for the ordinary generating functions (OGF) of many natural questions.

  • The OGF of the languages of words not containing w {\displaystyle w} is c ( z ) z n + ( 1 | A | z ) c ( z ) {\displaystyle {\frac {c(z)}{z^{n}+(1-|A|z)c(z)}}} .
  • The OGF of the languages of words containing w {\displaystyle w} is z n ( 1 | A | z ) ( z n + ( 1 | A | z ) c ( z ) ) {\displaystyle {\frac {z^{n}}{(1-|A|z)(z^{n}+(1-|A|z)c(z))}}} .
  • The OGF of the languages of words containing a single occurrence of w {\displaystyle w} , at the end of the word is z n z n + ( 1 | A | z ) c ( z ) {\displaystyle {\frac {z^{n}}{z^{n}+(1-|A|z)c(z)}}} .

References

Categories: