Cedict Sqlite Database

April 27th, 2008 by benny

Since my last post (3 hours ago), I felt like the data in Unihan was great, but basic. Most Chinese “words” consist of more than one character. For example, the word “adult” is “大人”. And what do you know, there’s a database for that too! It’s called CEDict (wikipedia entry) and surprise surprise, its a flat file like the Unihan database is. This time, it’s a bit nicer to use:

Traditional Simplified [pin1 yin1] /English equivalent 1/equivalent 2/
中國 中国 [Zhong1 guo2] /China/Middle Kingdom/

I’ve written a(nother) quick python script that’ll convert this file into a(nother) sqlite database. Enjoy!

#! /usr/bin/python
#
#  A script to parse the CEDICT file into a sqlite
#  database.
#
#	Author:	Benny Wong <bwong.net>
#	Date:	2008.04.27
 
import re 
from pysqlite2 import dbapi2 as sqlite
 
columns = ['Traditional', 'Simplified', 'Pinyin', 'Definition']
 
f = open('cedict_ts.u8', 'r')
p = re.compile('(.*) (.*) \[(.*)\] /(.*)/')
 
conn = sqlite.connect('Cedict.sqlite')
cursor = conn.cursor()
 
cursor.execute("DROP TABLE IF EXISTS Cedict")
cursor.execute("CREATE TABLE Cedict (" + ", ".join([i + " TEXT" for i in columns]) + ")")
 
for line in f:
	if not line.startswith('#'):
		tokens = p.match(line)
		traditional = tokens.group(1)
		simplified = tokens.group(2)
		pinyin = tokens.group(3)
		definition = tokens.group(4).replace('/', '|')
 
		cursor.execute("INSERT INTO Cedict (" + ', '.join(columns) + \
			") VALUES (?, ?, ?, ?)", \
			[traditional, simplified, pinyin, definition]);
 
f.close()
 
conn.commit()

PS: <3 Regex


Viewing 2 Comments

 

Trackbacks

(Trackback URL)

close Reblog this comment
blog comments powered by Disqus