More commentary on why we are going to continue needing human intercession in the parsing of information. Peter Norvig, Director of Quality Seach at Google, speaks on the problem of how to ensure proper "tagging" of information.
One challenge is ensuring consistency, even in seemingly minor issues on spelling, style (for instance, how names are written and presented), "correct" transliteration from one alphabet to another and the "proper" handling of abbreviations. Any one trained at the grindstone of a newspaper know how such style, grammar and spelling rules are knocked into heads. Not everything published on the Web goes through that same vetting process.
Somebody’s got to do that kind of canonicalization. So the problem of understanding content hasn’t gone away; it’s just been forced down to smaller pieces between angle brackets. […]
Another, is battling intentional deception on the part of some people tagging the information on the Internet.
The last issue is the spam issue. When you’re in the lab and you’re defining your ontology, everything looks nice and neat. But then you unleash it on the world, and you find out how devious some people are. What this indicates is, one, we’ve got a lot of work to do to deal with
this kind of thing, but also you can’t trust the metadata. You can’t
trust what people are going to say. In general, search engines have
turned away from metadata, and they try to hone in more on what’s
exactly perceivable to the user. For the most part we throw away the
meta tags, unless there’s a good reason to believe them, because they
tend to be more deceptive than they are helpful. And the more there’s a
marketplace in which people can make money off of this deception, the
more it’s going to happen. Humans are very good at detecting this kind
of spam, and machines aren’t necessarily that good. So if more of the
information flows between machines, this is something you’re going to
have to look out for more and more.