<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>.simplicity &#187; Language Recognizer</title>
	<atom:link href="http://www.dotsimplicity.net/tag/language-recognizer/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dotsimplicity.net</link>
	<description>Simple, reliable, simplicity. A software discussion blog</description>
	<lastBuildDate>Sun, 04 Jul 2010 09:44:42 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Re: Language</title>
		<link>http://www.dotsimplicity.net/2009/11/re-language/</link>
		<comments>http://www.dotsimplicity.net/2009/11/re-language/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 12:05:41 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Software Idea]]></category>
		<category><![CDATA[Language Recognizer]]></category>

		<guid isPermaLink="false">http://www.dotsimplicity.net/?p=507</guid>
		<description><![CDATA[These features are pending proof of concept implementation. Currently I’m very busy with my study and my job and Maurice is as well so you probably won’t see anything anytime soon. But to be honest, from my side it’s also laziness. But today I sat down and wanted to come up with some algorithms.

Some definitions [...]]]></description>
			<content:encoded><![CDATA[<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica;">These features are pending proof of concept implementation. Currently I’m very busy with my study and my job and Maurice is as well so you probably won’t see anything anytime soon. But to be honest, from my side it’s also laziness. But today I sat down and wanted to come up with some algorithms.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica;"><span id="more-507"></span></p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px;">Some definitions for the following paragraphs:</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica;">Text: any text. Can be a complete book, a word, a sentence, etc.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica;">LR&lt;Languages&gt;: Language Recognizer for the languages.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px;"><span style="white-space: pre;"> </span></p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica;">Possible applications of the LR</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px;">
<h2>Advanced Texthelper</h2>
<h3>Features</h3>
<h5>Can correct words without a dictionary: recognizes when something probably isn’t a word.</h5>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica;">A word of length X has an average score of AVG. If there is a word with length X and its score is lower with significance S, then the word probably isn’t a word in the current language. I don’t know how to determine significance S.</p>
<h5>Can correct sentences, since it also recognizes sentence structures.</h5>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica;">Basically the same as the word recognition, but for sentences.</p>
<h5>Can auto-complete sentences (and parts thereof).</h5>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica;">Since the LR keeps tracks of how many times a word follows another word, it should be able to predict what word you are going to type, which could speed up writing a text. As you read, a lot of shoulda coulda woulda, but maybe we are clever enough to pull this off.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px;">
<h2>Language Recognizer<span style="white-space: pre;"> </span></h2>
<h3>Features</h3>
<h5>Can tell if text A is more English than text B</h5>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica;">If text A scores higher on LR&lt;English&gt; than B, A is more likely to be english than B.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica;"><em>Test results indicate very few false positives with a simple implementation.</em></p>
<h5>Can help for auto-recognizing if something is human readable text.</h5>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica;">If a text gets a higher score than previous texts, the text is more likely to be human readable text, which is useful for the “recovery” of encrypted text.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dotsimplicity.net/2009/11/re-language/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HAI! / Language</title>
		<link>http://www.dotsimplicity.net/2009/11/hai-language/</link>
		<comments>http://www.dotsimplicity.net/2009/11/hai-language/#comments</comments>
		<pubDate>Mon, 16 Nov 2009 20:56:29 +0000</pubDate>
		<dc:creator>Maurice</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Software Idea]]></category>
		<category><![CDATA[Language Recognizer]]></category>

		<guid isPermaLink="false">http://www.dotsimplicity.net/?p=499</guid>
		<description><![CDATA[Hello World! I&#8217;m Maurice, I&#8217;m new to .simplicity, nice to meet you, etc.
Language
I&#8217;ve been experimenting with some fun stuff lately, it involves language, programming, password-dictionaries, spell checkers, and more   .


Get a file with a lot of words (i.e. a spell checker dictionary).
Count the occurrences of N-letter-groups in those words. (example with N = [...]]]></description>
			<content:encoded><![CDATA[<p>Hello World! I&#8217;m Maurice, I&#8217;m new to .simplicity, nice to meet you, etc.</p>
<h2>Language</h2>
<p>I&#8217;ve been experimenting with some fun stuff lately, it involves language, programming, password-dictionaries, spell checkers, and more <img src='http://www.dotsimplicity.net/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  .</p>
<p><span id="more-499"></span></p>
<ol>
<li>Get a file with a lot of words (i.e. a spell checker dictionary).</li>
<li>Count the occurrences of N-letter-groups in those words. (example with N = 3: &#8220;hello&#8221;, hel++, ell++, llo++)</li>
<li>Find high-scoring letter-groups which overlap (ie. &#8220;hel&#8221; + &#8220;ell&#8221; = &#8220;hell&#8221;) to form a word. (Obviously, repeat this to form bigger words)</li>
</ol>
<p>Using an English dictionary, this gives output such as: &#8220;anteringlogratio&#8221;, &#8220;callinesthesional&#8221; and &#8220;prestionistering&#8221;. Altough those words are absolutely not English words (as far as I know), they look pretty English. Cool huh?</p>
<p>It can obviously also create &#8216;fake words&#8217; in other languages, by using other dictionaries. For example, using a Dutch dictionary: &#8220;verderendelijker&#8221;, &#8220;heiderdelijkeren&#8221; and &#8220;eerderingelijkerigen&#8221;. (I&#8217;m Dutch, and I can tell you those words are easy to read and pronounce, altough they mean nothing to me)</p>
<p>I can hear you thinking: &#8220;Nice, but why&#8217;d you ever need to create &#8216;fake words&#8217; ?&#8221;.<br />
Well, for example, we could expand a password-dictionary by generating new passwords from the passwords already in there. This gives new passwords wich are likely to be passwords, instead of the 99% of garbage from a simple brute-forcer.</p>
<p>Also, if we can create words that seem to belong to some language, we can probably also recognize if something could be a real word.</p>
<p>We could create a program wich can tell random sequences of letters and real words apart by giving the words an &#8216;Englishness-score&#8217;, the more it uses high-scoring letter-groups, the higher the score. This might come in handy while brute-forcing a password for some text file. The program could detect if it succesfully decrypted the file, or just got some garbage-output. Or we could create a smart spell checker wich can check and correct words even when they&#8217;re not in the dictionary. If &#8216;hello&#8217; has a much higher score than &#8216;helllo&#8217;, the user probably wanted to type &#8216;hello&#8217;. Or a language-recognizer, if all the words have a higher &#8216;Dutchness-score&#8217; than &#8216;Englishness-score&#8217;, it&#8217;s probably not an English text.</p>
<p>When we take this a level higher, not using words and letter-groups, but sentences and word-groups, the awesomeness grows exponentially. A spell checker could simply correct &#8220;Ive is fun&#8221; to &#8220;Ice is fun&#8221;, but &#8220;Ive been running&#8221; to &#8220;I&#8217;ve been running&#8221;. An internet spider could tell real text apart from those random-keywords-pages. Lots of possibilities.</p>
<p>Anyway, I&#8217;ll try to code some of this stuff to see how well it all works, stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dotsimplicity.net/2009/11/hai-language/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

