Yesterday, Microsoft announced that they would be acquiring Powerset–a semantic analysis company. The idea is a good one. Bring in linguistic specialist to help improve search and advertising targetting.
Don Dodge points explains this further on his blog.
I do have a couple comments for Don on linguistics, searching and ads.
Both of these sites go a step beyond page rank and try to cluster and rank limited sourced content. It’s a simple idea: The quality goes up if the quality goes in. Now, leveraging trusted sources will get you only so far. There is more clustering work to do and TechMeme does a pretty good job here.
Notice how Powerset is able to leverage trusted sources too. At the time Powerset went public with its search, Michael Arrington pointed out how good Powerset’s search results were so a few of us took the search challenge. And what did we find? That in large part by restricting a search to Wikipedia pages on non-semantic aware search engines we could return the same results. Limiting the search across all of Wikipedia’s permutations was the key to the search. Powerset just does this because that’s all it indexes.
Now this doesn’t mean that context isn’t a good thing. Knowing when a post is a review or commentary is worthwhile and should help it when matching against content. However, for high quality content you’d be surprised how often context is self-contained or easily extractable.
I do suggest that Microsoft take a step back and realize that there’s quite a bit of low hanging fruit here courtesy of some human input.
For instance, I’ve long advocated to Microsoft that they should focus some of their efforts on delivering quality ads to none other than their MVP community. First, the sites and blogs that the MVP community members manage have already been vetted. It’s like TechMeme’s and thredr’s blog list. I can guarantee you that few if any of the sites these people run want ads outside of their domains, yet most of the ad services are just as happy to deliver ads from eBay or NextTag and the like. There’s economic incentive to do this unfortunately. These poorly targetted ads though junk up the sites and lower the overall content. Is there any wonder that TechCrunch, et al use focused advertising and not Google or Microsoft’s ad service? Nope.
One other point here. A computer algorithm isn’t going to be the ultimate answer. Remember, people write the programs. In fact, the code is doing editorializing itself; it’s just that it’s automated. Further, a great search or ad algorithm is going to require constant adjustments. That will be its value. It will have scale plus freshness. Take either of these away and it’ll be less interesting.
Oh, and another issue that most online sites run into: A big company can’t buy an ad on a little site. A large enterprise won’t have the processes to cut a payment to an unknown vendor. If you don’t believe me, ask around. Plain and simple they have to go through intermediaries, which only creates inefficiencies. It’s a simple fact. Creating an ad purchasing infrustructure that enables the biggest buyers from the smallest sites and the smallest buyers from the largest sites is the key here. I can’t explain it any simpler than this. It’s not rocket science.
As you may know, I’ve been advocating for awhile indexing content in ways that leverage more than text. I’ll dig up a link when I get a chance. But till then, it doesn’t take much brainstorming to realize that there’s great value out there that electronic devices can sense and record and leveraging this information can be quite useful–moreso than text in many instances I believe.
Finally, enterprise search. Is there a partially naive approach to enterprise search like there is to content clustering of blogs? Could be. Inside an organization there’s quite a bit of standardization–even among its loose collection of information. Some of it will be in databases already with psuedo meaningful column headings. But there’s also a bunch of information that’s easliy mineable in chart and column data that needs to be “reverse generated.” In other words the data was once in a computer and generated yet its value to a typical search engine is as flat as all other words. It’s actually not, but most indexing extraction tools don’t know any better. Is linguistics the solution? Not here. There are much simpler and more direct approaches available.
I worked at a company awhile back that did quite well turning computer generated content back into queriably content. You’d be surprised how simple and valuable this can be. Yeah, you could try to hook all of your disperate databases together so the CEO/CFO can ask this or that question of the enterprise, but it’s actually far easier to analyze the computer generated content. Go figure. So my advice to everyone, is to look for this type of enterprise content first and analyze it. I bet you’ll be surprised how valuable this lost content will be. Just think about it, there was a reason someone purchased this content from the outside or generated the content internally. It has value. It’s not like all the other words.