Unihan Sqlite Database

April 27th, 2008 by benny

Once upon a time I wrote an extension for Firefox where you could highlight Chinese characters and be able to right-click and “Pinyinize” the characters. It would then take the characters into Pinyin, the phonetic representation of those Chinese characters.

Now, the only way I could do that was to pretend to make a request to pin1yin1.com and parse the HTML page that comes back. That’s a silly way to do things. There should be a way/service where I could make a query, and have the results come back in a known schema, in JSON or XML or otherwise.

I haven’t been able to find one (if you know, let me know :) ) so I thought I’d see if I could make my own. I haven’t gotten that far, but I found out what database they were using. pin1yin1.com uses the Unihan database. The problem with that is is that its a flat text file where the lines look like:

<Chinese Character> <Key> <Value>

like:

U+340C kDefinition a tribe of savages in South China

It’s totally unusable in most situations so I decided to write a quick python (thanks Hila!) script to do this:

#! /usr/bin/python
#
#  A script to convert/pivot the Unihan.txt file into a sqlite
#  database.
#
#	Author:	Benny Wong <bwong.net>
#	Date:	2008.04.27
 
from pysqlite2 import dbapi2 as sqlite
 
charmap = {}
keys = set()
keys.add('Character')
 
f = open('Unihan.txt', 'r')
 
for line in f:
	if not line.startswith('#'):
		tokens = line.split()
		key = tokens[0].replace('U+', '')
		if not charmap.has_key(tokens[0]):
			charmap[tokens[0]] = {}
 
			unichar = tokens[0].replace('U+', '0x')
			unichar = unichr(long(unichar, 16))
 
			charmap[tokens[0]]['Character'] = unichar.encode('utf8')
 
		charmap[tokens[0]][tokens[1]] = " ".join(tokens[2:])
		keys.add(tokens[1])
 
f.close()
 
keystring = ", ".join(key + " TEXT" for key in keys)
 
conn = sqlite.connect('Unihan.sqlite')
cursor = conn.cursor()
 
cursor.execute("DROP TABLE IF EXISTS Unihan")
cursor.execute("CREATE TABLE Unihan (key TEXT, " + keystring + ")")
 
while len(charmap) > 0:
	key, values = charmap.popitem()
 
	columns = ",".join(values.keys())
	cells = '","'.join(values.values())
 
	sql = 'INSERT INTO Unihan (key, ' + columns + ') VALUES ("' + key + '", "' + cells + '")'.encode('utf8')
	cursor.execute(sql)
 
cursor.execute("CREATE INDEX key ON Unihan(key)")
for key in keys:
	cursor.execute("CREATE INDEX " + key + " ON Unihan(" + key + ")")
 
conn.commit()

I haven’t worked with python much, so if this code is crappy, let me know and how to fix it :) I’m seeing when I’ll have time to actually create the service (if anyone’s interested!) but yeah, here’s the basis that I’m going to be using.

You can easily port this database over from SQLite to MySQL, PostgreSQL, etc. by using “.dump;”

Enjoy!


blog comments powered by Disqus