Feb
10th
Tue
10th
Importing Enron into CouchDB
I have been goofing around with couchdb for about a year now. In order to do anything fun or interesting with it, you first must have some data to play with. To solve this I imported the enron email dataset into couchdb so we can have a couple hundred thousand documents. How? First I downloaded all enron data from the Carnagie Melon University's Enron Email Data. Then I used the 'mail trends' project's enron.py code to convert the loose files into a unix mbox format so it's easily understood by code. Once we have the enron data in a format we like, we can use a ruby script below to take the email and push it into couchdb. (Make sure your couchdb installed and running.) The code is as follows:cat >rakefile.rb <<\THEEND
%w(time tmail find restclient json).each { l require l }
file_create('enron_mail_030204.tar.gz') do
`curl -O http://download.srv.cs.cmu.edu/~enron/enron_mail_030204.tar.gz`
end
file_create('maildir' => 'enron_mail_030204.tar.gz') do
`tar xzof enron_mail_030204.tar.gz`
end
desc('import the email to localhost couchdb')
task(:import => 'maildir') do
RestClient.put('http://localhost:5984/enron', '') rescue nil
Find.find('maildir') do path
next if FileTest.directory?(path)
begin
txt = IO.read(path)
msg = TMail::Mail.parse(txt)
next if msg.date < @t = Time.parse("1999-01-01")
attrs = msg.header.merge('to' => msg.to_addrs,
'cc' => msg.cc_addrs,
'bcc' => msg.bcc_addrs,
'body' => msg.body).reject {k,v v.to_s.empty?}
RestClient.post('http://localhost:5984/enron',
attrs.to_json,
:content_type => 'application/json')
rescue Interrupt
exit(1)
rescue Exception => ex
puts "#{path} #{ex.inspect}"
end
end
end
THEEND
sudo gem install rake rest-client json tmail
rake -T
rake import
# .....wait for it.....
rake irb
This will take a while. Not long after the script starts you will see documents showing up in your couchdb. You will see a couple dozen emails are not properly formatted or that wont convert to json but you'll still end up with 290k+ emails in your couchdb. 290k emails is 99.9% complete and 4 years of enron email to play with. Navigate to http://localhost:5984/_utils and start mappin' and reducin' :)