Hello there – thanks for visiting my site.

I offer expertise in “enterprise information architecture”. Put in everyday terms, this means “the ways in which enterprises organise their information”.

I believe that one of best things I can offer is my combination of experience in business, information science and technology. I certainly have 10,000 hours in each of the two former categories, and several 000 in the latter.

I’m now a director of Motif Consulting, a consulting practice focused on this area. You can find a more up to date blog, SharePoint and Office 365 information on Motif’s site: www.motif-consulting.com.

Information sharing

Information sharing – between organisations and systems – is one of the hardest nuts to crack. Yet it is one of the most pressing priorities, particularly in areas relating to government and the health of our society. Part of the solution is provided by information standards.

Two examples have arisen in the last couple of days. The hideous story of a man in Sheffield who fathered many children by raping his daughters has prompted calls for greater information sharing (Roger Graef on the Today programme). Mass murder in India prompts calls for greater information sharing between nations and their respective security agencies.

The truth is that most information systems are still walled gardens. The walls are sturdily constructed in part because of the need for privacy and data protection. It is no trivial matter to pass information from one system to another, where a host of prying eyes can look at it. In a local government environment, for example, it is quite difficult even to think through the implications of exchanging information between schools, social services and hospitals, still less across local and regional boundaries. However, I understand that much valuable work has already been done in the area of children’s services.

Let’s assume that the policy and data protection issues could be managed. We then have to deal with the issue of extracting information from one system and passing it easily to another. This is where standards-based information management should triumph. If ways of recording and storing information were standardised, it would be much easier to exchange it freely.

This has been tried, of course, within e-GIF, the British e-Government Interoperability Framework, with its Dublin Core based metadata scheme and its standardised taxonomy. e-GIF, however, was only a start. Work needs to be re-initiated to produce up-to-date, detailed metadata standards for government activities and to begin to negotiate their enforcement. Standardised information will be much more readily sharable.

The more obvious “national database of X” is often proposed, but such projects are often unsustainable. The work of developing and spreading a standard is much less expensive, but requires both political will and grassroots demand. We must hope .

Taxonomy trends

Google Trends tells us the sad news that taxonomy, ontology, metadata and tagging all enjoyed a peak of search attention in 2004, and have been in slow decline ever since.

For the chart, click here.

At the same time, usage of the terms in news stories has increased, and is now at a peak.

What’s going on here?

At first glance, it might seem that our field hit the peak of inflated expectation in 2004, and is now languishing in the “trough of disillusionment”. Possibly.

But let’s look at Google Trends a bit more closely. Over the same period, almost every geeky term has seen the same decline, even those in widespread use like “Linux” or “PostGres”. Maybe the web is broader now, and therefore geeky terminology is relatively less prominent.

Following the thought, terms relating to Western people and places have also shown relative declines – “New York City”, for example, and “San Francisco”. The same principle could easily apply as the web becomes more global. And on the subject of globalisation, far and away the leading country in ontology is Korea, according to Google.

So I return to Protege without worrying unduly. Time will tell.

Text classification 1

Text classification is top of mind, after the recent completion of a big classifier build – 550 or so categories in 3 languages. (Note to self – categorize this blog!)

So there will be a few posts on this topic, beginning with the choice of technology. Statistical? Or linguistic? (rule based).

I will confess to a bias. At my previous company we built and deployed a statistical classifier, using Support Vector Machine (SVM). The SVM Classifier was successful. We were able to achieve precision/recall of roughly 70% at a first cut. We trained the classifier using example documents (as you would). About 20 documents per category were required. This makes the training process pretty fast. An intelligent layman can asemble 20 training documents for a general category in less than an hour. Compiling the training set is only a small part of the task, but nonetheless significant. If you want to read more about statistical classifiers, I’d suggest a visit to Susan Dumais’ home page.

We doubted that the statistical classifier could scale to very large taxonomies, or that it could deal with very granular categories. Machine classifiers tend to run into the same issues as people do when they try to classify documents. This page, example, could be dropped in the “classifiers” category. But if there were two more granular categories – “statistical classifiers” and “linguistic classifiers” – you would have to place it in both. Or neither? We can have another post about category design and the idea of “disjoint” categories.

In the recent project, the approach was already chosen, and it was linguistic. A category in a linguistic categorizer contains a rule, which specifies the terms the target document must contain (how many, where in the page, in which combinations and so on).

I had always thought this approach rather simplistic. Words are too ambiguous to be tied down by rules. My reservations may apply to some types of content, but for general web content, rule based classification works pretty well. The difficulty lies in constructing the rules. This process is more laborious than anyone might have thought, and there are very few people (information scientists) with the necessary skills.

Recently, there have been experiments in using Wikipedia and other open source datasets to train or to provide background knowledge for classifiers, and these look promising. But it seems inevitable that most taxonomies will be highly customized, and this poses a problem.

The bottom line is that editorial talent and labour is required in the building of a classifier, and that even then, the resulting indexing process will require supervision.

November 5, 2008 • Tags: , • Posted in: Text classification • Comments Off

Google’s “forgotten attachment detector”

We all forget attachments.

Now Google is automating this failing our of our lives.

I love that.

In 1999 it could have been a start up.

November 4, 2008 • Tags:  • Posted in: Bits and bobs • 6 Comments

When my mind changes, I change the facts.

I am hardly qualified to add to the deluge of commentary on the financial crisis. But given that the qualified led us to this pass, I will plunge in regardless.

Supposedly, the crisis was caused by a lack of information. Or perhaps by a misinterpretation of information. It does astonish me that this perspective of the crisis has been largely ignored.

There have been several attempts over the last few years to improve the flow and interpretation of financial information. The aim of the Basel II accord (from its host):

“The Basel II Framework describes a more comprehensive measure and minimum standard for capital adequacy that national supervisory authorities are now working to implement through domestic rule-making and adoption procedures. It seeks to improve on the existing rules by aligning regulatory capital requirements more closely to the underlying risks that banks face. In addition, the Basel II Framework is intended to promote a more forward-looking approach to capital supervision, one that encourages banks to identify the risks they may face, today and in the future, and to develop or improve their ability to manage those risks. As a result, it is intended to be more flexible and better able to evolve with advances in markets and risk management practices.”

There was Sarbanes-Oxley, designed to assure the adequacy of corporate systems to withstand abuse and failure.

Then XBRL, designed to standardise the reporting of financial information so that apples, as it were, could be compared to apples.

So, many fine minds were bent on financial transparency and control just as the errors stoking the crisis were creeping up. And measures were taken more energetically in the US than elsewhere – while it was in the US that the storm brewed up.

Some senior banking executives say that they didn’t know what their employees were doing. Though this may be hard to like, it is not hard to believe. Information is filtered as it floats to the top of organisations. Employees don’t like to be the bearers of bad news. And in any case, the writing on the wall can be ignored.

Would improved information management really make a difference? Surely it would. The motion “if information were improved, it would make no difference to our decisions” would surely be defeated.

“When the facts change, I change my mind”, said Keynes and, I think, Kissinger after him. But to change your mind, you must know of the change in the facts.

Keynes’ dictum also skirts the (mis-)appropriation of facts to support opinions:

“Scholars assume that citizens perform better when they know pertinent facts. Factual beliefs, however, become relevant for political judgments only when people interpret them. Interpretations provide opportunities for partisans to rationalize their existing opinions.”

You could almost reverse him, thus: “When my mind changes, I change the facts. What do you do, Sir?”