
Other reimplementations of Peter Norvig’s algorithm can be found here, here, here and here. Of course, I’m not the first to take this on. If you come back with your sanity intact, you can make good money. An old version using Data.Set was significantly slower that the implementation described in this article. The resulting duplicate entries don’t seem to actually impact the result’s correctness and passing them up the call stack instead of performing a filtering pass at each stage 8 turned out to be faster.
#HASKELL LIST COMPREHENSION CODE#
This is mainly due to performance and code clarity 7 considerations. Luckily, this is only required during the first call to the correction function – any subsequent calls can skip this step and will run much faster, at which point the largely unoptimized Haskell translation “only” takes about 100 ms per word, about twice as long as Peter Norvig’s Python original (although this varies depending on the input).Īnother thing worth noting is that I have decided against using a set-like data type like Data.Set or equivalent third-party libraries here, opting for lists instead (even though the Python version employs sets). On my machine, Python manages this in just under a second while my Haskell implementation takes about 10 seconds. Compiling Spell ( src/Spell.hs, interpreted )Īs you will notice when running it, compared to the original Python implementation, the Haskell version performs significantly worse in one respect: parsing the file big.txt and populating the mapping from each word to its number of occurrences. Compiling Paths_spell ( dist/build/autogen/Paths_spell.hs, interpreted ) Loaded GHCi configuration from /Users/noah/.ghci Moving on: Given a word and a text, we need a way of computing the probability of the word.

Finally, we sort the resulting list of words and group duplicate occurrences, which facilitates creating a mapping from words to number of occurrences.

This emulates the \w+ regular expression 4 closely and reasonably idiomatically. The helper function words' first converts all characters to lower-case and splits the resulting String on non-alphabetic characters, returning a which contains only the words we want (because they are composed of alphabetic characters) and empty lists (this is what splitWhen from the split package leaves behind when applied to two subsequent non-alphabetic characters), the latter of which are subsequently filtered out. fromList [ ( head l, length l ) | l >= readFile words' = filter ( not. Please put on your safety - goggles and take a few steps back. | Note: To simulate the global variable from the original Python - implementation, we'll perform some unsafe IO here.
