Tim Dysinger RSS

Feb
10th
Tue
permalink

Importing Enron into CouchDB

I have been goofing around with couchdb for about a year now. In order to do anything fun or interesting with it, you first must have some data to play with. To solve this I imported the enron email dataset into couchdb so we can have a couple hundred thousand documents. How? First I downloaded all enron data from the Carnagie Melon University's Enron Email Data. Then I used the 'mail trends' project's enron.py code to convert the loose files into a unix mbox format so it's easily understood by code. Once we have the enron data in a format we like, we can use a ruby script below to take the email and push it into couchdb. (Make sure your couchdb installed and running.) The code is as follows:
cat >rakefile.rb <<\THEEND
%w(time tmail find restclient json).each { l require l }

file_create('enron_mail_030204.tar.gz') do
  `curl -O http://download.srv.cs.cmu.edu/~enron/enron_mail_030204.tar.gz`
end

file_create('maildir' => 'enron_mail_030204.tar.gz') do
  `tar xzof enron_mail_030204.tar.gz`
end

desc('import the email to localhost couchdb')
task(:import => 'maildir') do
  RestClient.put('http://localhost:5984/enron', '') rescue nil
  Find.find('maildir') do path
    next if FileTest.directory?(path)
    begin
      txt = IO.read(path)
      msg = TMail::Mail.parse(txt)
      next if msg.date < @t = Time.parse("1999-01-01")
      attrs = msg.header.merge('to' => msg.to_addrs,
                               'cc' => msg.cc_addrs,
                               'bcc' => msg.bcc_addrs,
                               'body' => msg.body).reject {k,v v.to_s.empty?}
      RestClient.post('http://localhost:5984/enron',
                      attrs.to_json,
                      :content_type => 'application/json')
    rescue Interrupt
      exit(1)
    rescue Exception => ex
      puts "#{path} #{ex.inspect}"
    end
  end
end
THEEND

sudo gem install rake rest-client json tmail
rake -T
rake import
# .....wait for it.....
rake irb
This will take a while. Not long after the script starts you will see documents showing up in your couchdb. You will see a couple dozen emails are not properly formatted or that wont convert to json but you'll still end up with 290k+ emails in your couchdb. 290k emails is 99.9% complete and 4 years of enron email to play with. Navigate to http://localhost:5984/_utils and start mappin' and reducin' :)
Oct
13th
Mon
permalink

Using Amazon EC2 Metadata as a Simple DNS

I use the amazon metadata for creating /etc/hosts and do this on a cron schedule. This does everything I need. Instead of fancy DynDNS tricks or having to run and manage an internal DNS server I just have a ruby script that looks at the metadata ec2 to build /etc/hosts. It's easy. To set it up yourself and try it all you need are 3 easy steps. Step 1- Start each of your instances with unique named key that matches what you want their internal hostname to be. Such as "onion" or "potato" or whatever you want to call them. Step 2- Make sure you have ruby, rubygems and amazon-ec2 (rubygem) installed. Then create a ruby script in /usr/local/sbin/hosts that has the following:
#!/usr/bin/env ruby
%w(optparse rubygems EC2 resolv pp).each { l require l }
options = {}
parser = OptionParser.new do p
  p.banner = "Usage: hosts [options]"
  p.on("-a", "--access-key USER", "The user's AWS access key ID.") do aki
    options[:access_key_id] = aki
  end
  p.on("-s",
       "--secret-key PASSWORD",
       "The user's AWS secret access key.") do sak
    options[:secret_access_key] = sak
  end
  p.on_tail("-h", "--help", "Show this message") {
    puts(p)
    exit
  }
  p.parse!(ARGV) rescue puts(p)
end
if options.key?(:access_key_id) and options.key?(:secret_access_key)
  puts "127.0.0.1 localhost"
  EC2::Base.new(options).describe_instances.reservationSet.item.each do r
    r.instancesSet.item.each do i
      if i.instanceState.name =~ /running/
        puts(Resolv::DNS.new.getaddress(i.privateDnsName).to_s +
             " #{i.keyName}.ec2 #{i.keyName}")
      end
    end
  end
else
  puts(parser)
  exit(1)
end
Step 3- Setup a cron job to update /etc/hosts as often as you like. I do it once per hour on all my machines
0 * * * * /usr/local/sbin/hosts -a myaccess -s mysecret >/etc/hosts
All my machines have this ec2 security key + script + cron approach. I do not have to run dyndns or any private dns servers to keep track of all my internal server ip addresses. My /etc/hosts looks like the following on the three machines in the test cluster: 127.0.0.1 localhost 10.252.202.221 oahu.ec2 oahu 10.253.115.175 maui.ec2 maui 10.253.114.190 hawaii.ec2 hawaii
permalink

Gentoo, EC2, Portage and Puppet Make a Good Smoothie

This last spring I blogged about creating gentoo images for ec2. Since then I have evolved it into a project and added portage tools and Puppet to the mix. You can find the project over at github.com. Basically I now use Puppet to bootstrap and configure servers from scratch for ec2, including completely optimizing the entire opperating system for the image-type @ ec2. After the image is created you can use puppet on-going to maintain your instances at EC2. Puppet is a great tool for unix server management. The combination of Gentoo (my fav) with EC2 (another fav) and automation is awesome. Even if you just want to use gentoo and don't want to bother with puppet, this project can be of use to you. The base images are nothing more than gentoo base system, dhcp, ddclient, openntp, syslog, vixie-cron, postfix, java, ruby & rubygems, puppet and amazon api/ami tools. This is pretty small and (besides puppet) is the base minimum to run an ec2 instance & exercise the api/ami tools. I have prebuilt the following public images for people to use if they would like:
ami-d22cc8bb m1.small  32-bit 1-core "athlon-xp" optimized
ami-c02cc8a9 m1.large  64-bit 2-core "opteron"   optimized
ami-c12cc8a8 m1.xlarge 64-bit 4-core "opteron"   optimized
ami-5d2dc934 c1.medium 32-bit 2-core "prescott"  optimized
ami-332cc85a c1.xlarge 64-bit 8-core "nocona"    optimized
Oct
9th
Thu
permalink

Mochirest: Rails UI and Erlang/Mochiweb REST JSON Service

It's been a while. I have been super busy at work. I have mostly been doing puppet sysadmin on gentoo (more on that later) and erlang coding. I released a demo app (mostly for fun) of a Rails scaffold app (simple) talking to a mochiweb rest (json) backend on github. Erlang and Ruby are a great partnership. I think the languages support each other well and am excited to continue on my path to Erlang mastery.