Text feature selection for machine learning – part 2

July 21, 2013

In my previous blog post on text feature selection, I’d covered some of the key steps: Extract the relevant text from the content. Tokenize this text into discrete words. Normalize these words (case-folding, stemming) (and a bit of filtering out “bad words”). In this blog post I’m going to talk about improving the quality of the terms. But first I wanted to respond to some questions from part 1, about more…

Text feature selection for machine learning – part 1

July 10, 2013

We do a lot of projects that require extracting text features from documents, for use with recommendation systems, clustering and classification. Often the “document” is an entity like a person, a company, or a web site. In these cases, the text for each document is the aggregation of all text associated with each entity – for example, it could be the text from all pages crawled for a given blog more…

The Durkheim Project goes live!

July 3, 2013

As of today, the Durkheim Project is now live. This is a research project involving Patterns and Predictions, the Geisel School of Medicine at Dartmouth, the U.S. Department of Veterans Affairs (VA) and Facebook. See the Durkheim Project launch announcement for full details. The worthy goal of the Durkheim Project is to improve the medical community’s ability to predict suicides. The driving force was original the military’s concern about increasing more…