Hello World! I’m Maurice, I’m new to .simplicity, nice to meet you, etc.
Language
I’ve been experimenting with some fun stuff lately, it involves language, programming, password-dictionaries, spell checkers, and more
.
- Get a file with a lot of words (i.e. a spell checker dictionary).
- Count the occurrences of N-letter-groups in those words. (example with N = 3: “hello”, hel++, ell++, llo++)
- Find high-scoring letter-groups which overlap (ie. “hel” + “ell” = “hell”) to form a word. (Obviously, repeat this to form bigger words)
Using an English dictionary, this gives output such as: “anteringlogratio”, “callinesthesional” and “prestionistering”. Altough those words are absolutely not English words (as far as I know), they look pretty English. Cool huh?
It can obviously also create ‘fake words’ in other languages, by using other dictionaries. For example, using a Dutch dictionary: “verderendelijker”, “heiderdelijkeren” and “eerderingelijkerigen”. (I’m Dutch, and I can tell you those words are easy to read and pronounce, altough they mean nothing to me)
I can hear you thinking: “Nice, but why’d you ever need to create ‘fake words’ ?”.
Well, for example, we could expand a password-dictionary by generating new passwords from the passwords already in there. This gives new passwords wich are likely to be passwords, instead of the 99% of garbage from a simple brute-forcer.
Also, if we can create words that seem to belong to some language, we can probably also recognize if something could be a real word.
We could create a program wich can tell random sequences of letters and real words apart by giving the words an ‘Englishness-score’, the more it uses high-scoring letter-groups, the higher the score. This might come in handy while brute-forcing a password for some text file. The program could detect if it succesfully decrypted the file, or just got some garbage-output. Or we could create a smart spell checker wich can check and correct words even when they’re not in the dictionary. If ‘hello’ has a much higher score than ‘helllo’, the user probably wanted to type ‘hello’. Or a language-recognizer, if all the words have a higher ‘Dutchness-score’ than ‘Englishness-score’, it’s probably not an English text.
When we take this a level higher, not using words and letter-groups, but sentences and word-groups, the awesomeness grows exponentially. A spell checker could simply correct “Ive is fun” to “Ice is fun”, but “Ive been running” to “I’ve been running”. An internet spider could tell real text apart from those random-keywords-pages. Lots of possibilities.
Anyway, I’ll try to code some of this stuff to see how well it all works, stay tuned.
