Tim Dysinger

Life & Tech on Kauai

Importing Enron Into CouchDB

I have been goofing around with couchdb for about a year now. In order to do anything fun or interesting with it, you first must have some data to play with. To solve this I imported the enron email dataset into couchdb so we can have a couple hundred thousand documents.

How? First I downloaded all enron data from the Carnagie Melon University’s Enron Email Data. Then I used the ‘mail trends’ project’s enron.py code to convert the loose files into a unix mbox format so it’s easily understood by code. Once we have the enron data in a format we like, we can use a ruby script below to take the email and push it into couchdb. (Make sure your couchdb installed and running.) The code is as follows:

cat >;rakefile.rb <<\THEEND
%w(time tmail find restclient json).each {|l| require l}

file_create('enron_mail_030204.tar.gz') do
  `curl -O http://download.srv.cs.cmu.edu/~enron/enron_mail_030204.tar.gz`
end

file_create('maildir' => 'enron_mail_030204.tar.gz') do
  `tar xzof enron_mail_030204.tar.gz`
end

desc('import the email to localhost couchdb')
task(:import => 'maildir') do
  RestClient.put('http://localhost:5984/enron', '') rescue nil
  Find.find('maildir') do |path|
    next if FileTest.directory?(path)
    begin
      txt = IO.read(path)
      msg = TMail::Mail.parse(txt)
      next if msg.date < @t = Time.parse("1999-01-01")
      attrs = msg.header.merge('to' => msg.to_addrs,
                               'cc' => msg.cc_addrs,
                               'bcc' => msg.bcc_addrs,
                               'body' => msg.body).reject {k,v v.to_s.empty?}
      RestClient.post('http://localhost:5984/enron',
                      attrs.to_json,
                      :content_type => 'application/json')
    rescue Interrupt
      exit(1)
    rescue Exception => ex
      puts "#{path} #{ex.inspect}"
    end
  end
end
THEEND

sudo gem install rake rest-client json tmail
rake -T
rake import
# .....wait for it.....
rake irb

This will take a while. Not long after the script starts you will see documents showing up in your couchdb. You will see a couple dozen emails are not properly formatted or that wont convert to json but you’ll still end up with most of the emails in your couchdb. Navigate to Couchdb’s Futon and start mappin’ and reducin’ :)