Crawler-commons project gets started

December 3, 2009

Back in November we helped put together a small gathering for web crawler developers at ApacheCon 2009. One of the key topics was how to share development efforts, versus each project independently implementing similar functionality. Out of this was born the crawler-commons project. As the main page says: The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. more…

Comments are off for this post
Filed under: Uncategorized by kkrugler

Public web crawler projects

December 2, 2009

Tags: heritrix, nutch, public terabyte dataset, web crawler

Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I’d summarize the ones I now know about below. And if you know of others, please add your comments and I’ll update the list. Wayback Machine – A time-series snapshot of important web pages, from 1996 to present. 150B pages crawled in total as of 2009. The data is searchable, but not available more…

1 comment so far
Filed under: Uncategorized by kkrugler

Crawler-commons project gets started

Public web crawler projects

Recent Blog Posts

Site Tags