Elastic Web Mining Talk

November 2, 2009

Here’s the presentation I gave at the ACM data mining unconference on elastic web mining – how to create scalable, reliable and cost effective web mining solutions using an open source stack (Hadoop, Cascading, Bixo) running in Amazon’s Elastic Compute Cloud (EC2). [slideshare id=2407600&doc=acmuctalk-091102194640-phpapp02] But I don’t see my notes showing up, so here’s the PDF version with full notes, which make the resulting slides a lot more meaningful. [slideshare more…

Session writeups for ACM data mining unconference

November 2, 2009

I wound up being the scribe for two sessions at this past Sunday’s ACM data mining unconference. The first session was on open/public datasets, which are very useful for people working on data mining algorithms. The second session (last one of the day) was on open source data mining tools. Lots of people at this one, with a nice demo on KNIME and a good discussion of the R language more…

Announcing the Public Terabyte Dataset project

November 1, 2009

We’re very excited to announce the Public Terabyte Dataset project. This is a high quality crawl of top web sites, using AWS’s Elastic MapReduce, Concurrent’s Cascading workflow API, and Bixo Lab’s elastic web mining platform. Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users. In addition, the code used to create and process the dataset will be available for download more…

Presenting at 2009 Silicon Valley Data Mining Camp

October 30, 2009

This coming Sunday is the big Bay Area data mining “unconference“, and with more than 200 people already signed up, it’s going to be a lot of fun. I’ll be presenting at some point during the day – since it’s an unconference, you don’t really know who’s going to be talking about what/when. My topic is “Elastic web mining using open source (Hadoop/Cascading/Bixo) in Amazon’s EC2 cloud“. If you scan more…

Amazon Drops EC2 Prices

October 28, 2009

And that’s good news for our customers! In their official announcement, Amazon said: Finally, we are also lowering prices on all Amazon EC2 On-Demand compute instances, effective on November 1st. Charges for Linux-based instances will drop 15% — a small Linux instance will now cost just 8.5 cents per hour of computing, compared to the previous price of 10 cents per hour. Since your Bixolabs usage fee is based on more…

Bixolabs Less Stealthy

October 27, 2009

It’s time to raise the curtain a bit on our new web mining platform. We’re currently running test crawls for early partners, and tuning up the GUI for the Bixolabs admin console. In the meantime, we’ll be adding more details to this site about web mining in general, and how Bixolabs deals with some of the very unusual issues you run into while crawling the web (video poker link farms more…