Misplaced Pages

Anatree

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Data structure for anagram solving
This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (July 2013) (Learn how and when to remove this message)
Anatree

An anatree is a data structure designed to solve anagrams. Solving an anagram is the problem of finding a word from a given list of letters. These problems are commonly encountered in word games like Scrabble or in newspaper crossword puzzles. The problem for the wordwheel also has the condition that the central letter appear in all the words framed with the given set. Some other conditions may be introduced regarding the frequency (number of appearances) of each of the letters in the given input string. These problems are classified as Constraint satisfaction problem in computer science literature.

An anatree is represented as a directed tree which contains a set of words (W) encoded as strings in some alphabet. The internal vertices are labelled with some letter in the alphabet and the leaves contain words. The edges are labelled with non-negative integers. An anatree has the property that the sum of the edge labels from the root to the leaf is the length of the word stored at the leaf. If the internal vertices are labelled as α 1 {\displaystyle \alpha _{1}} , α 2 {\displaystyle \alpha _{2}} ... α l {\displaystyle \alpha _{l}} , and the edge labels are n 1 {\displaystyle n_{1}} , n 2 {\displaystyle n_{2}} ... n l {\displaystyle n_{l}} , then the path from the root to the leaf along these vertices and edges are a list of words that contain n 1 {\displaystyle n_{1}} α 1 {\displaystyle \alpha _{1}} s, n 2 {\displaystyle n_{2}} α 2 {\displaystyle \alpha _{2}} s and so on. Anatrees are intended to be read only data structures with all the words available at construction time.

Mixed anatree

A mixed anatree is an anatree where the internal vertices also store words. A mixed anatree can have words of varying lengths, where as in a regular anatree, all words are of the same length.

Data structures

A number of data structures have been proposed to solve anagrams in constant time. Two of the most commonly used data structures are the alphabetic map and the frequency map.

The alphabetic map maintains a hash table of all the possible words that can be in the language (this is referred to as the lexicon). For a given input string, sort the letters in alphabetic order. This sorted string maps onto a word in the hash table. Hence finding the anagram requires sorting the letters and looking up the word in the hash table. The sorting can be done in linear time with counting sort and hash table look ups can be done in constant time. For example, given the word ANATREE, the alphabetic map would produce a mapping of { A A E E N R T > { a n a t r e e } } {\displaystyle \{AAEENRT->\{''anatree''\}\}} .

A frequency map also stores the list of all possible words in the lexicon in a hash table. For a given input string, the frequency map maintains the frequencies (number of appearances) of all the letters and uses this count to perform a look up in the hash table. The worst case execution time is found to be linear in size of the lexicon. For example, given the word ANATREE, the alphabetic map would produce a mapping of f ( A ) > 2 , f ( E ) > 2 , f ( N ) > 1 , f ( R ) > 1 , f ( T ) > 1 {\displaystyle f(A)->2,f(E)->2,f(N)->1,f(R)->1,f(T)->1} . The words that do not appear in the string are not written in the map.

Construction

The construction of an anatree begins by selecting a label for the root and partitioning words based on the label chosen for the root. This process is repeated recursively for all the labels of the tree. Anatree construction is non-canonical for a given set of words, depending on the label chosen for the root, the anatree will differ accordingly. The performance of the anatree is greatly impacted by the choice of labels.

The following are some heuristics for choosing labels:

  • Start labeling vertices in alphabetical order from the root. This approach reduces construction overhead
  • Start labeling vertices based on the relative frequency. A probabilistic approach is used to assign labels to vertices. If W n α {\displaystyle W_{n}^{\alpha }} is the set of words that contain n α s {\displaystyle n\alpha s} , then we label the vertex with α {\displaystyle \alpha } if it maximizes the expected distance to the leaf. This approach has the most frequently appearing characters (like E) labeled at the root and the least frequently appearing characters labeled at the leaves. The following equation is maximized D α = n n | W n α | | W | {\displaystyle D_{\alpha }=\sum _{n}n{\frac {|W_{n}^{\alpha }|}{|W|}}} . This approach prevents long sequences of zero labeled edges since they do not contribute letters to the words generated by the anatree.

Finding anagrams

To find a word in an anatree, start at the root, depending on the frequency of the label in the given input string, follow the edge that has that frequency till the leaf. The leaf contains the required word. For example, consider the anatree in the figure, to find the word d o g {\displaystyle dog} , the given string may be o g d {\displaystyle ogd} . Start at the root and follow the edge that has 1 {\displaystyle 1} as the label. We follow this label since the given input string has 1 {\displaystyle 1} d {\displaystyle d} . Traverse this edge until the leaf is encountered. That gives the required word.

Space and time requirements

A lexicon that stores w {\displaystyle w} words (each word can be l {\displaystyle l} characters long) in an alphabet O {\displaystyle O} has the following space requirements.

Data Structure Space Requirements
Alphabetic Map O ( w l ( l o g | O | ) {\displaystyle O(wl(log|O|)}
Frequency Map O ( w ( | O | l o g l + l l o g | O | ) ) {\displaystyle O(w(|O|logl+llog|O|))}
Anatree O ( w | O | ( l o g | O | + l ) ) {\displaystyle O(w|O|(log|O|+l))}

The worst case execution time of an anatree is O ( | w | ( l + w | O | 2 ) ) {\displaystyle O(|w|(l+w|O|^{2}))}

References

  1. Reams, Charles (March 2012). "Anatree: A Fast Data Structure for Anagrams". Journal of Experimental Algorithmics. 17 (1): 2012. doi:10.1145/2133803.2133804.

External links

Categories: