<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>notebook &#187; chinese</title>
	<atom:link href="http://notebook.bwong.net/category/chinese/feed/" rel="self" type="application/rss+xml" />
	<link>http://notebook.bwong.net</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Sun, 11 Jan 2009 20:51:15 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Cedict Sqlite Database</title>
		<link>http://notebook.bwong.net/2008/04/27/cedict-sqlite-database/</link>
		<comments>http://notebook.bwong.net/2008/04/27/cedict-sqlite-database/#comments</comments>
		<pubDate>Sun, 27 Apr 2008 19:03:30 +0000</pubDate>
		<dc:creator>Benny</dc:creator>
				<category><![CDATA[chinese]]></category>
		<category><![CDATA[nerd]]></category>
		<category><![CDATA[cedict]]></category>
		<category><![CDATA[sqlite]]></category>

		<guid isPermaLink="false">http://notebook.bwong.net/?p=27</guid>
		<description><![CDATA[Since my last post (3 hours ago), I felt like the data in Unihan was great, but basic. Most Chinese &#8220;words&#8221; consist of more than one character. For example, the word &#8220;adult&#8221; is &#8220;大人&#8221;. And what do you know, there&#8217;s a database for that too! It&#8217;s called CEDict (wikipedia entry) and surprise surprise, its a [...]]]></description>
			<content:encoded><![CDATA[<p>Since my <a href="http://notebook.bwong.net/2008/04/27/unihan-sqlite-database/">last post</a> (3 hours ago), I felt like the data in Unihan was great, but basic. Most Chinese &#8220;words&#8221; consist of more than one character. For example, the word &#8220;adult&#8221; is &#8220;大人&#8221;. And what do you know, there&#8217;s a database for that too! It&#8217;s called <a href="http://www.mandarintools.com/cedict.html">CEDict</a> (<a href="http://en.wikipedia.org/wiki/CEDICT">wikipedia entry</a>) and surprise surprise, its a flat file like the Unihan database is. This time, it&#8217;s a bit nicer to use:<br />
<code><br />
Traditional Simplified [pin1 yin1] /English equivalent 1/equivalent 2/<br />
中國 中国 [Zhong1 guo2] /China/Middle Kingdom/<br />
</code></p>
<p>I&#8217;ve written a(nother) quick python script that&#8217;ll convert this file into a(nother) sqlite database. Enjoy!</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#! /usr/bin/python</span>
<span style="color: #808080; font-style: italic;">#</span>
<span style="color: #808080; font-style: italic;">#  A script to parse the CEDICT file into a sqlite</span>
<span style="color: #808080; font-style: italic;">#  database.</span>
<span style="color: #808080; font-style: italic;">#</span>
<span style="color: #808080; font-style: italic;">#	Author:	Benny Wong &lt;bwong.net&gt;</span>
<span style="color: #808080; font-style: italic;">#	Date:	2008.04.27</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span> 
<span style="color: #ff7700;font-weight:bold;">from</span> pysqlite2 <span style="color: #ff7700;font-weight:bold;">import</span> dbapi2 <span style="color: #ff7700;font-weight:bold;">as</span> sqlite
&nbsp;
columns = <span style="color: black;">&#91;</span><span style="color: #483d8b;">'Traditional'</span>, <span style="color: #483d8b;">'Simplified'</span>, <span style="color: #483d8b;">'Pinyin'</span>, <span style="color: #483d8b;">'Definition'</span><span style="color: black;">&#93;</span>
&nbsp;
f = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'cedict_ts.u8'</span>, <span style="color: #483d8b;">'r'</span><span style="color: black;">&#41;</span>
p = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'(.*) (.*) <span style="color: #000099; font-weight: bold;">\[</span>(.*)<span style="color: #000099; font-weight: bold;">\]</span> /(.*)/'</span><span style="color: black;">&#41;</span>
&nbsp;
conn = sqlite.<span style="color: black;">connect</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'Cedict.sqlite'</span><span style="color: black;">&#41;</span>
cursor = conn.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
cursor.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;DROP TABLE IF EXISTS Cedict&quot;</span><span style="color: black;">&#41;</span>
cursor.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;CREATE TABLE Cedict (&quot;</span> + <span style="color: #483d8b;">&quot;, &quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>i + <span style="color: #483d8b;">&quot; TEXT&quot;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> columns<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span> + <span style="color: #483d8b;">&quot;)&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">for</span> line <span style="color: #ff7700;font-weight:bold;">in</span> f:
	<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> line.<span style="color: black;">startswith</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'#'</span><span style="color: black;">&#41;</span>:
		tokens = p.<span style="color: black;">match</span><span style="color: black;">&#40;</span>line<span style="color: black;">&#41;</span>
		traditional = tokens.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
		simplified = tokens.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>
		pinyin = tokens.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#41;</span>
		definition = tokens.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/'</span>, <span style="color: #483d8b;">'|'</span><span style="color: black;">&#41;</span>
&nbsp;
		cursor.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;INSERT INTO Cedict (&quot;</span> + <span style="color: #483d8b;">', '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>columns<span style="color: black;">&#41;</span> + \
			<span style="color: #483d8b;">&quot;) VALUES (?, ?, ?, ?)&quot;</span>, \
			<span style="color: black;">&#91;</span>traditional, simplified, pinyin, definition<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">;</span>
&nbsp;
f.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
conn.<span style="color: black;">commit</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>PS: <3 <a href="http://en.wikipedia.org/wiki/Regex">Regex</a></p>
]]></content:encoded>
			<wfw:commentRss>http://notebook.bwong.net/2008/04/27/cedict-sqlite-database/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Unihan Sqlite Database</title>
		<link>http://notebook.bwong.net/2008/04/27/unihan-sqlite-database/</link>
		<comments>http://notebook.bwong.net/2008/04/27/unihan-sqlite-database/#comments</comments>
		<pubDate>Sun, 27 Apr 2008 17:14:37 +0000</pubDate>
		<dc:creator>Benny</dc:creator>
				<category><![CDATA[chinese]]></category>
		<category><![CDATA[nerd]]></category>
		<category><![CDATA[personal]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sqlite]]></category>
		<category><![CDATA[unihan]]></category>

		<guid isPermaLink="false">http://notebook.bwong.net/?p=25</guid>
		<description><![CDATA[Once upon a time I wrote an extension for Firefox where you could highlight Chinese characters and be able to right-click and &#8220;Pinyinize&#8221; the characters. It would then take the characters into Pinyin, the phonetic representation of those Chinese characters.
Now, the only way I could do that was to pretend to make a request to [...]]]></description>
			<content:encoded><![CDATA[<p>Once upon a time I wrote an extension for <a href="http://www.mozilla.com/en-US/firefox/">Firefox</a> where you could highlight <a href="http://en.wikipedia.org/wiki/Chinese_character">Chinese characters</a> and be able to right-click and &#8220;Pinyinize&#8221; the characters. It would then take the characters into <a href="http://en.wikipedia.org/wiki/Pinyin">Pinyin</a>, the phonetic representation of those Chinese characters.</p>
<p>Now, the only way I could do that was to pretend to make a request to <a href="http://www.pin1yin1.com/">pin1yin1.com</a> and parse the HTML page that comes back. That&#8217;s a silly way to do things. There should be a way/service where I could make a query, and have the results come back in a known schema, in <a href="http://en.wikipedia.org/wiki/JSON">JSON</a> or <a href="http://en.wikipedia.org/wiki/XML">XML</a> or otherwise.</p>
<p>I haven&#8217;t been able to find one (if you know, let me know <img src='http://notebook.bwong.net/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  ) so I thought I&#8217;d see if I could make my own. I haven&#8217;t gotten that far, but I found out what database they were using. pin1yin1.com uses the <a href="http://unicode.org/charts/unihan.html">Unihan database</a>. The problem with that is is that its a flat text file where the lines look like:<br />
<code><br />
&lt;Chinese Character&gt;   &lt;Key&gt;   &lt;Value&gt;<br />
</code><br />
like:<br />
<code><br />
U+340C  kDefinition a tribe of savages in South China<br />
</code></p>
<p>It&#8217;s totally unusable in most situations so I decided to write a quick <a href="http://www.python.org/">python</a> (thanks <a href="http://www1.cs.columbia.edu/~hila/">Hila</a>!) script to do this:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#! /usr/bin/python</span>
<span style="color: #808080; font-style: italic;">#</span>
<span style="color: #808080; font-style: italic;">#  A script to convert/pivot the Unihan.txt file into a sqlite</span>
<span style="color: #808080; font-style: italic;">#  database.</span>
<span style="color: #808080; font-style: italic;">#</span>
<span style="color: #808080; font-style: italic;">#	Author:	Benny Wong &lt;bwong.net&gt;</span>
<span style="color: #808080; font-style: italic;">#	Date:	2008.04.27</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">from</span> pysqlite2 <span style="color: #ff7700;font-weight:bold;">import</span> dbapi2 <span style="color: #ff7700;font-weight:bold;">as</span> sqlite
&nbsp;
charmap = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
keys = <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
keys.<span style="color: black;">add</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'Character'</span><span style="color: black;">&#41;</span>
&nbsp;
f = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'Unihan.txt'</span>, <span style="color: #483d8b;">'r'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">for</span> line <span style="color: #ff7700;font-weight:bold;">in</span> f:
	<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> line.<span style="color: black;">startswith</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'#'</span><span style="color: black;">&#41;</span>:
		tokens = line.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
		key = tokens<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'U+'</span>, <span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span>
		<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> charmap.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>tokens<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>:
			charmap<span style="color: black;">&#91;</span>tokens<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span> = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
&nbsp;
			unichar = tokens<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'U+'</span>, <span style="color: #483d8b;">'0x'</span><span style="color: black;">&#41;</span>
			unichar = <span style="color: #008000;">unichr</span><span style="color: black;">&#40;</span><span style="color: #008000;">long</span><span style="color: black;">&#40;</span>unichar, <span style="color: #ff4500;">16</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
			charmap<span style="color: black;">&#91;</span>tokens<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'Character'</span><span style="color: black;">&#93;</span> = unichar.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span>
&nbsp;
		charmap<span style="color: black;">&#91;</span>tokens<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>tokens<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span> = <span style="color: #483d8b;">&quot; &quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>tokens<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span>:<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
		keys.<span style="color: black;">add</span><span style="color: black;">&#40;</span>tokens<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp;
f.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
keystring = <span style="color: #483d8b;">&quot;, &quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>key + <span style="color: #483d8b;">&quot; TEXT&quot;</span> <span style="color: #ff7700;font-weight:bold;">for</span> key <span style="color: #ff7700;font-weight:bold;">in</span> keys<span style="color: black;">&#41;</span>
&nbsp;
conn = sqlite.<span style="color: black;">connect</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'Unihan.sqlite'</span><span style="color: black;">&#41;</span>
cursor = conn.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
cursor.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;DROP TABLE IF EXISTS Unihan&quot;</span><span style="color: black;">&#41;</span>
cursor.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;CREATE TABLE Unihan (key TEXT, &quot;</span> + keystring + <span style="color: #483d8b;">&quot;)&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">while</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>charmap<span style="color: black;">&#41;</span> <span style="color: #66cc66;">&gt;</span> <span style="color: #ff4500;">0</span>:
	key, values = charmap.<span style="color: black;">popitem</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
	columns = <span style="color: #483d8b;">&quot;,&quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>values.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
	cells = <span style="color: #483d8b;">'&quot;,&quot;'</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>values.<span style="color: black;">values</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
	sql = <span style="color: #483d8b;">'INSERT INTO Unihan (key, '</span> + columns + <span style="color: #483d8b;">') VALUES (&quot;'</span> + key + <span style="color: #483d8b;">'&quot;, &quot;'</span> + cells + <span style="color: #483d8b;">'&quot;)'</span>.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span>
	cursor.<span style="color: black;">execute</span><span style="color: black;">&#40;</span>sql<span style="color: black;">&#41;</span>
&nbsp;
cursor.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;CREATE INDEX key ON Unihan(key)&quot;</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">for</span> key <span style="color: #ff7700;font-weight:bold;">in</span> keys:
	cursor.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;CREATE INDEX &quot;</span> + key + <span style="color: #483d8b;">&quot; ON Unihan(&quot;</span> + key + <span style="color: #483d8b;">&quot;)&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
conn.<span style="color: black;">commit</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>I haven&#8217;t worked with python much, so if this code is crappy, let me know and how to fix it <img src='http://notebook.bwong.net/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  I&#8217;m seeing when I&#8217;ll have time to actually create the service (if anyone&#8217;s interested!) but yeah, here&#8217;s the basis that I&#8217;m going to be using.</p>
<p>You can easily port this database over from SQLite to <a href="http://www.mysql.com/">MySQL</a>, <a href="http://www.postgresql.org/">PostgreSQL</a>, etc. by using &#8220;.dump;&#8221;</p>
<p>Enjoy!</p>
]]></content:encoded>
			<wfw:commentRss>http://notebook.bwong.net/2008/04/27/unihan-sqlite-database/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
