Text classification 1
Text classification is top of mind, after the recent completion of a big classifier build - 550 or so categories in 3 languages. (Note to self - categorize this blog!)
So there will be a few posts on this topic, beginning with the choice of technology. Statistical? Or linguistic? (rule based).
I will confess to a bias. At my previous company we built and deployed a statistical classifier, using Support Vector Machine (SVM). The SVM Classifier was successful. We were able to achieve precision/recall of roughly 70% at a first cut. We trained the classifier using example documents (as you would). About 20 documents per category were required. This makes the training process pretty fast. An intelligent layman can asemble 20 training documents for a general category in less than an hour. Compiling the training set is only a small part of the task, but nonetheless significant. If you want to read more about statistical classifiers, I’d suggest a visit to Susan Dumais’ home page.
We doubted that the statistical classifier could scale to very large taxonomies, or that it could deal with very granular categories. Machine classifiers tend to run into the same issues as people do when they try to classify documents. This page, example, could be dropped in the “classifiers” category. But if there were two more granular categories - “statistical classifiers” and “linguistic classifiers” - you would have to place it in both. Or neither? We can have another post about category design and the idea of “disjoint” categories.
In the recent project, the approach was already chosen, and it was linguistic. A category in a linguistic categorizer contains a rule, which specifies the terms the target document must contain (how many, where in the page, in which combinations and so on).
I had always thought this approach rather simplistic. Words are too ambiguous to be tied down by rules. My reservations may apply to some types of content, but for general web content, rule based classification works pretty well. The difficulty lies in constructing the rules. This process is more laborious than anyone might have thought, and there are very few people (information scientists) with the necessary skills.
Recently, there have been experiments in using Wikipedia and other open source datasets to train or to provide background knowledge for classifiers, and these look promising. But it seems inevitable that most taxonomies will be highly customized, and this poses a problem.
The bottom line is that editorial talent and labour is required in the building of a classifier, and that even then, the resulting indexing process will require supervision.
Google’s “forgotten attachment detector”
We all forget attachments.
Now Google is automating this failing our of our lives.
I love that.
In 1999 it could have been a start up.
When my mind changes, I change the facts.
I am hardly qualified to add to the deluge of commentary on the financial crisis. But given that the qualified led us to this pass, I will plunge in regardless.
Supposedly, the crisis was caused by a lack of information. Or perhaps by a misinterpretation of information. It does astonish me that this perspective of the crisis has been largely ignored.
There have been several attempts over the last few years to improve the flow and interpretation of financial information. The aim of the Basel II accord (from its host):
“The Basel II Framework describes a more comprehensive measure and minimum standard for capital adequacy that national supervisory authorities are now working to implement through domestic rule-making and adoption procedures. It seeks to improve on the existing rules by aligning regulatory capital requirements more closely to the underlying risks that banks face. In addition, the Basel II Framework is intended to promote a more forward-looking approach to capital supervision, one that encourages banks to identify the risks they may face, today and in the future, and to develop or improve their ability to manage those risks. As a result, it is intended to be more flexible and better able to evolve with advances in markets and risk management practices.”
There was Sarbanes-Oxley, designed to assure the adequacy of corporate systems to withstand abuse and failure.
Then XBRL, designed to standardise the reporting of financial information so that apples, as it were, could be compared to apples.
So, many fine minds were bent on financial transparency and control just as the errors stoking the crisis were creeping up. And measures were taken more energetically in the US than elsewhere - while it was in the US that the storm brewed up.
Some senior banking executives say that they didn’t know what their employees were doing. Though this may be hard to like, it is not hard to believe. Information is filtered as it floats to the top of organisations. Employees don’t like to be the bearers of bad news. And in any case, the writing on the wall can be ignored.
Would improved information management really make a difference? Surely it would. The motion “if information were improved, it would make no difference to our decisions” would surely be defeated.
“When the facts change, I change my mind”, said Keynes and, I think, Kissinger after him. But to change your mind, you must know of the change in the facts.
Keynes’ dictum also skirts the (mis-)appropriation of facts to support opinions:
“Scholars assume that citizens perform better when they know pertinent facts. Factual beliefs, however, become relevant for political judgments only when people interpret them. Interpretations provide opportunities for partisans to rationalize their existing opinions.”
You could almost reverse him, thus: “When my mind changes, I change the facts. What do you do, Sir?”
Welcome
Hi there and thanks for visiting. This is my third attempt at establishing a blog - previous efforts have fallen casualty to consulting projects with deadlines, children with bath times, ever lingering uncertainties about privacy and confidentiality, glasses of red wine … I could go on, but you doubtless get the picture.
There is a bit of a separation now between Church and State, and you can find my business affairs at www.content-engineers.com.
This blog will deal with information. If ever there was a Cinderella profession, it is that of information management. As obvious in its utility and as ubiquitous as water, information is no more - much less - glamorous. How can we change this? It’s not immediately obvious how, other than by writing intelligently about it.
If you share my interests, please comment.
