I have been goofing around with couchdb for about a year now. In order to do anything fun or interesting with it, you first must have some data to play with. To solve this I imported the enron email dataset into couchdb so we can have a couple hundred thousand documents.
How? First I downloaded all enron data from the Carnagie Melon University’s Enron Email Data. Then I used the ‘mail trends’ project’s enron.py code to convert the loose files into a unix mbox format so it’s easily understood by code. Once we have the enron data in a format we like, we can use a ruby script below to take the email and push it into couchdb. (Make sure your couchdb installed and running.) The code is as follows:
cat >;rakefile.rb <<\THEEND
%w(time tmail find restclient json).each {|l| require l}
file_create('enron_mail_030204.tar.gz') do
`curl -O http://download.srv.cs.cmu.edu/~enron/enron_mail_030204.tar.gz`
end
file_create('maildir' => 'enron_mail_030204.tar.gz') do
`tar xzof enron_mail_030204.tar.gz`
end
desc('import the email to localhost couchdb')
task(:import => 'maildir') do
RestClient.put('http://localhost:5984/enron', '') rescue nil
Find.find('maildir') do |path|
next if FileTest.directory?(path)
begin
txt = IO.read(path)
msg = TMail::Mail.parse(txt)
next if msg.date < @t = Time.parse("1999-01-01")
attrs = msg.header.merge('to' => msg.to_addrs,
'cc' => msg.cc_addrs,
'bcc' => msg.bcc_addrs,
'body' => msg.body).reject {k,v v.to_s.empty?}
RestClient.post('http://localhost:5984/enron',
attrs.to_json,
:content_type => 'application/json')
rescue Interrupt
exit(1)
rescue Exception => ex
puts "#{path} #{ex.inspect}"
end
end
end
THEEND
sudo gem install rake rest-client json tmail
rake -T
rake import
# .....wait for it.....
rake irb
This will take a while. Not long after the script starts you will see documents showing up in your couchdb. You will see a couple dozen emails are not properly formatted or that wont convert to json but you’ll still end up with most of the emails in your couchdb. Navigate to Couchdb’s Futon and start mappin’ and reducin’ :)