Text classification 1
Text classification is top of mind, after the recent completion of a big classifier build – 550 or so categories in 3 languages. (Note to self – categorize this blog!)
So there will be a few posts on this topic, beginning with the choice of technology. Statistical? Or linguistic? (rule based).
I will confess to a bias. At my previous company we built and deployed a statistical classifier, using Support Vector Machine (SVM). The SVM Classifier was successful. We were able to achieve precision/recall of roughly 70% at a first cut. We trained the classifier using example documents (as you would). About 20 documents per category were required. This makes the training process pretty fast. An intelligent layman can asemble 20 training documents for a general category in less than an hour. Compiling the training set is only a small part of the task, but nonetheless significant. If you want to read more about statistical classifiers, I’d suggest a visit to Susan Dumais’ home page.
We doubted that the statistical classifier could scale to very large taxonomies, or that it could deal with very granular categories. Machine classifiers tend to run into the same issues as people do when they try to classify documents. This page, example, could be dropped in the “classifiers” category. But if there were two more granular categories – “statistical classifiers” and “linguistic classifiers” – you would have to place it in both. Or neither? We can have another post about category design and the idea of “disjoint” categories.
In the recent project, the approach was already chosen, and it was linguistic. A category in a linguistic categorizer contains a rule, which specifies the terms the target document must contain (how many, where in the page, in which combinations and so on).
I had always thought this approach rather simplistic. Words are too ambiguous to be tied down by rules. My reservations may apply to some types of content, but for general web content, rule based classification works pretty well. The difficulty lies in constructing the rules. This process is more laborious than anyone might have thought, and there are very few people (information scientists) with the necessary skills.
Recently, there have been experiments in using Wikipedia and other open source datasets to train or to provide background knowledge for classifiers, and these look promising. But it seems inevitable that most taxonomies will be highly customized, and this poses a problem.
The bottom line is that editorial talent and labour is required in the building of a classifier, and that even then, the resulting indexing process will require supervision.

6 Responses to “Text classification 1”
JARED - June 27th, 2010
PillSpot.org. Canadian Health&Care.Special Internet Prices.No prescription online pharmacy.PillSpot.org. Herbal-supplements@buy.online” rel=”nofollow”>.…
Categories: Blood Pressure/Heart.Anxiety/Sleep Aid.Womens Health.Skin Care.Weight Loss.Antibiotics.Stomach.Eye Care.Mens Health.Antiviral.Anti-allergic/Asthma.Mental HealthPain Relief.Antidepressants.Antidiabetic.Stop SmokingVitamins/Herbal Supple…
MARTIN - July 15th, 2010
Buy:Seroquel.Acomplia.Female Pink Viagra.Cozaar.Amoxicillin.Ventolin.SleepWell.Benicar.Nymphomax.Buspar.Advair.Prozac.Lasix.Lipitor.Wellbutrin SR.Female Cialis.Zocor.Zetia.Aricept.Lipothin….
ROSS - July 22nd, 2010
Buy:Lumigan.Zyban.Nexium.Petcam (Metacam) Oral Suspension.Accutane.Synthroid.Zovirax.Actos.Prevacid.Arimidex.100% Pure Okinawan Coral Calcium.Valtrex.Mega Hoodia.Human Growth Hormone.Prednisolone.Retin-A….
39 - August 29th, 2010
part http://jwicker9zd.ANTIQUEFURNINISHING.INFO/tag/39+part+model/ : model…
part…
130m - August 29th, 2010
iqzoom http://leczematyt6.02JEEPPARTS.US/tag/35mm+iqzoom+130m/ : iqzoom…
iqzoom…
GPS - August 30th, 2010
sony http://jsonymkn.BESTPARTSPLUS.INFO/tag/GPS+sony+for/ : for…
GPS…
Leave a Reply
You must be logged in to post a comment.