Cedict Sqlite Database

April 27th, 2008 by benny

Since my last post (3 hours ago), I felt like the data in Unihan was great, but basic. Most Chinese “words” consist of more than one character. For example, the word “adult” is “大人”. And what do you know, there’s a database for that too! It’s called CEDict (wikipedia entry) and surprise surprise, its a flat file like the Unihan database is. This time, it’s a bit nicer to use:

Traditional Simplified [pin1 yin1] /English equivalent 1/equivalent 2/
中國 中国 [Zhong1 guo2] /China/Middle Kingdom/

I’ve written a(nother) quick python script that’ll convert this file into a(nother) sqlite database. Enjoy!

#! /usr/bin/python
#
#  A script to parse the CEDICT file into a sqlite
#  database.
#
#	Author:	Benny Wong <bwong.net>
#	Date:	2008.04.27
 
import re 
from pysqlite2 import dbapi2 as sqlite
 
columns = ['Traditional', 'Simplified', 'Pinyin', 'Definition']
 
f = open('cedict_ts.u8', 'r')
p = re.compile('(.*) (.*) \[(.*)\] /(.*)/')
 
conn = sqlite.connect('Cedict.sqlite')
cursor = conn.cursor()
 
cursor.execute("DROP TABLE IF EXISTS Cedict")
cursor.execute("CREATE TABLE Cedict (" + ", ".join([i + " TEXT" for i in columns]) + ")")
 
for line in f:
	if not line.startswith('#'):
		tokens = p.match(line)
		traditional = tokens.group(1)
		simplified = tokens.group(2)
		pinyin = tokens.group(3)
		definition = tokens.group(4).replace('/', '|')
 
		cursor.execute("INSERT INTO Cedict (" + ', '.join(columns) + \
			") VALUES (?, ?, ?, ?)", \
			[traditional, simplified, pinyin, definition]);
 
f.close()
 
conn.commit()

PS: <3 Regex

Unihan Sqlite Database

by benny

Once upon a time I wrote an extension for Firefox where you could highlight Chinese characters and be able to right-click and “Pinyinize” the characters. It would then take the characters into Pinyin, the phonetic representation of those Chinese characters.

Now, the only way I could do that was to pretend to make a request to pin1yin1.com and parse the HTML page that comes back. That’s a silly way to do things. There should be a way/service where I could make a query, and have the results come back in a known schema, in JSON or XML or otherwise.

I haven’t been able to find one (if you know, let me know :) ) so I thought I’d see if I could make my own. I haven’t gotten that far, but I found out what database they were using. pin1yin1.com uses the Unihan database. The problem with that is is that its a flat text file where the lines look like:

<Chinese Character> <Key> <Value>

like:

U+340C kDefinition a tribe of savages in South China

It’s totally unusable in most situations so I decided to write a quick python (thanks Hila!) script to do this:

#! /usr/bin/python
#
#  A script to convert/pivot the Unihan.txt file into a sqlite
#  database.
#
#	Author:	Benny Wong <bwong.net>
#	Date:	2008.04.27
 
from pysqlite2 import dbapi2 as sqlite
 
charmap = {}
keys = set()
keys.add('Character')
 
f = open('Unihan.txt', 'r')
 
for line in f:
	if not line.startswith('#'):
		tokens = line.split()
		key = tokens[0].replace('U+', '')
		if not charmap.has_key(tokens[0]):
			charmap[tokens[0]] = {}
 
			unichar = tokens[0].replace('U+', '0x')
			unichar = unichr(long(unichar, 16))
 
			charmap[tokens[0]]['Character'] = unichar.encode('utf8')
 
		charmap[tokens[0]][tokens[1]] = " ".join(tokens[2:])
		keys.add(tokens[1])
 
f.close()
 
keystring = ", ".join(key + " TEXT" for key in keys)
 
conn = sqlite.connect('Unihan.sqlite')
cursor = conn.cursor()
 
cursor.execute("DROP TABLE IF EXISTS Unihan")
cursor.execute("CREATE TABLE Unihan (key TEXT, " + keystring + ")")
 
while len(charmap) > 0:
	key, values = charmap.popitem()
 
	columns = ",".join(values.keys())
	cells = '","'.join(values.values())
 
	sql = 'INSERT INTO Unihan (key, ' + columns + ') VALUES ("' + key + '", "' + cells + '")'.encode('utf8')
	cursor.execute(sql)
 
cursor.execute("CREATE INDEX key ON Unihan(key)")
for key in keys:
	cursor.execute("CREATE INDEX " + key + " ON Unihan(" + key + ")")
 
conn.commit()

I haven’t worked with python much, so if this code is crappy, let me know and how to fix it :) I’m seeing when I’ll have time to actually create the service (if anyone’s interested!) but yeah, here’s the basis that I’m going to be using.

You can easily port this database over from SQLite to MySQL, PostgreSQL, etc. by using “.dump;”

Enjoy!