Cedict Sqlite Database
April 27th, 2008 by bennySince my last post (3 hours ago), I felt like the data in Unihan was great, but basic. Most Chinese “words” consist of more than one character. For example, the word “adult” is “大人”. And what do you know, there’s a database for that too! It’s called CEDict (wikipedia entry) and surprise surprise, its a flat file like the Unihan database is. This time, it’s a bit nicer to use:
Traditional Simplified [pin1 yin1] /English equivalent 1/equivalent 2/
中國 中国 [Zhong1 guo2] /China/Middle Kingdom/
I’ve written a(nother) quick python script that’ll convert this file into a(nother) sqlite database. Enjoy!
#! /usr/bin/python # # A script to parse the CEDICT file into a sqlite # database. # # Author: Benny Wong <bwong.net> # Date: 2008.04.27 import re from pysqlite2 import dbapi2 as sqlite columns = ['Traditional', 'Simplified', 'Pinyin', 'Definition'] f = open('cedict_ts.u8', 'r') p = re.compile('(.*) (.*) \[(.*)\] /(.*)/') conn = sqlite.connect('Cedict.sqlite') cursor = conn.cursor() cursor.execute("DROP TABLE IF EXISTS Cedict") cursor.execute("CREATE TABLE Cedict (" + ", ".join([i + " TEXT" for i in columns]) + ")") for line in f: if not line.startswith('#'): tokens = p.match(line) traditional = tokens.group(1) simplified = tokens.group(2) pinyin = tokens.group(3) definition = tokens.group(4).replace('/', '|') cursor.execute("INSERT INTO Cedict (" + ', '.join(columns) + \ ") VALUES (?, ?, ?, ?)", \ [traditional, simplified, pinyin, definition]); f.close() conn.commit()
PS: <3 Regex
Add New Comment
Viewing 2 Comments
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Add New Comment
Trackbacks
(Trackback URL)
May 7, 2008 at 8:38 pm
[...] What I’m referring to is some of the side projects I’ve been working on (exhibit a and exhibit b). ...