Proposals for Big Data web mining talk

November 16, 2009

I’m going to be giving a talk at the Bay Area ACM data mining SIG in December, and I need to finalize my topic soon – like today 🙂 I was going to expand on my Elastic Web Mining talk (“Web mining for SEO keywords”) from the ACM data mining unconference a few weeks back. But the fact that I’ll have 10s to 100s of millions of web page data more…

Web Miners vs Web Masters – An Uneasy Truce

November 11, 2009

The life of a webmaster is hard, and web crawlers make it harder http://www.flickr.com/photos/absolutely_loverly/ / CC BY 2.0   There’s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might think many web crawlers are evil. But web crawlers serve a very real, important role more…

Paul O'Rorke summary of elastic web mining talk

November 4, 2009

Paul posted a nice summary of my elastic web mining talk over at his blog. He captured one of the key points I was trying to make when he said: It was impressive to see how much of the processing was generated by Bixo and Cascading and how only a small fraction of the code needed to be custom coded “by hand.” That’s a recurring theme when I show workflow more…

Elastic Web Mining Talk

November 2, 2009

Here’s the presentation I gave at the ACM data mining unconference on elastic web mining – how to create scalable, reliable and cost effective web mining solutions using an open source stack (Hadoop, Cascading, Bixo) running in Amazon’s Elastic Compute Cloud (EC2). [slideshare id=2407600&doc=acmuctalk-091102194640-phpapp02] But I don’t see my notes showing up, so here’s the PDF version with full notes, which make the resulting slides a lot more meaningful. [slideshare more…

Session writeups for ACM data mining unconference

November 2, 2009

I wound up being the scribe for two sessions at this past Sunday’s ACM data mining unconference. The first session was on open/public datasets, which are very useful for people working on data mining algorithms. The second session (last one of the day) was on open source data mining tools. Lots of people at this one, with a nice demo on KNIME and a good discussion of the R language more…

Announcing the Public Terabyte Dataset project

November 1, 2009

We’re very excited to announce the Public Terabyte Dataset project. This is a high quality crawl of top web sites, using AWS’s Elastic MapReduce, Concurrent’s Cascading workflow API, and Bixo Lab’s elastic web mining platform. Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users. In addition, the code used to create and process the dataset will be available for download more…