Tim Dysinger RSS

Feb
10th
Tue
permalink

Importing Enron into CouchDB

I have been goofing around with couchdb for about a year now. In order to do anything fun or interesting with it, you first must have some data to play with. To solve this I imported the enron email dataset into couchdb so we can have a couple hundred thousand documents. How? First I downloaded all enron data from the Carnagie Melon University's Enron Email Data. Then I used the 'mail trends' project's enron.py code to convert the loose files into a unix mbox format so it's easily understood by code. Once we have the enron data in a format we like, we can use a ruby script below to take the email and push it into couchdb. (Make sure your couchdb installed and running.) The code is as follows:
cat >rakefile.rb <<\THEEND
%w(time tmail find restclient json).each { l require l }

file_create('enron_mail_030204.tar.gz') do
  `curl -O http://download.srv.cs.cmu.edu/~enron/enron_mail_030204.tar.gz`
end

file_create('maildir' => 'enron_mail_030204.tar.gz') do
  `tar xzof enron_mail_030204.tar.gz`
end

desc('import the email to localhost couchdb')
task(:import => 'maildir') do
  RestClient.put('http://localhost:5984/enron', '') rescue nil
  Find.find('maildir') do path
    next if FileTest.directory?(path)
    begin
      txt = IO.read(path)
      msg = TMail::Mail.parse(txt)
      next if msg.date < @t = Time.parse("1999-01-01")
      attrs = msg.header.merge('to' => msg.to_addrs,
                               'cc' => msg.cc_addrs,
                               'bcc' => msg.bcc_addrs,
                               'body' => msg.body).reject {k,v v.to_s.empty?}
      RestClient.post('http://localhost:5984/enron',
                      attrs.to_json,
                      :content_type => 'application/json')
    rescue Interrupt
      exit(1)
    rescue Exception => ex
      puts "#{path} #{ex.inspect}"
    end
  end
end
THEEND

sudo gem install rake rest-client json tmail
rake -T
rake import
# .....wait for it.....
rake irb
This will take a while. Not long after the script starts you will see documents showing up in your couchdb. You will see a couple dozen emails are not properly formatted or that wont convert to json but you'll still end up with 290k+ emails in your couchdb. 290k emails is 99.9% complete and 4 years of enron email to play with. Navigate to http://localhost:5984/_utils and start mappin' and reducin' :)