Google’s BlogSearch has some bugs in their index

0
86

There’s no doubt in my mind, something is up with Google’s BlogSearch. There are completely valid blog results that sometimes appear and sometimes don’t depending on whether I go to the BlogSearch results the first time, or click on Sort by relevance, or Sort by date, or change the time range from one hour to lets say 12 hours. In the later case, I fully expect to see recently added blog entries, that were just picked up in the last hour to also appear in the last 12 hours list. Sometimes they do. Sometimes they don’t.

Here are some screenshots that show the problem in one small case. Here’s a Search By Relevance query I did for “InkWell and Tablet” for the last hour:

SortByRelevanceLastHour.PNG

It has one result from winbeta.org.

And now here are the results from doing a Search by Relevance query for the last day:

SortByRelevanceLastDay.PNG

Interestingly, the winbeta.org link isn’t there. Was it pruned out accidentally or maybe is this Google’s effort to prune out spam posts? Weird.

To make it even more confusing, here are the results for Search by Date for the last day:

SearchByDateLastDay.PNG

(By the way, I cropped out a spam link at the bottom.)

Surprisingly, the winbeta.org link now shows up but the link to Gottabemobile–which is a high quality link–is gone. Hmmm. This shakes my confidence in Google’s BlogSearch.

What about the other search engines?

Technorati does an OK job, but doesn’t get all the links. Nor does BlogLines. Oh, and good old reliable TechMeme has picked up the story and a couple links, although it’s not comprehensive–it’s missing Gottabemobile. Not bad, but since the other search results are so bad, I wish TechMeme was even better. I trust it more and more.

Why can’t someone do a good blog search????

Update: An engineer from Google emailed me to let me know that at the end of the listings there is some text which you can click on to reveal blog entries which are very similar. The filtering is a good idea because there are quite a few “spam” sites that copy/paste news from other blogs. There’s no reason to give them any exposure. This might explain some of what I was–or should I say, was not–seeing. However, I’m not sure if it explains the differences I was encountering when switching between “Seach by Relevance” and “Search by Date.” There was at least one high quality site that got dumped. Maybe it was being removed because of the duplicates. The problem was that it was the site others were copying. I’ll have to pay attention to this more. It may not be a bug that I was encountering, but something by design. Either way it was producing results that didn’t make sense to me. I rely on Google’s BlogSearch quite a bit so I hope this is more of a problem with me and how I use it than a problem with the search engine itself. I’ll be watching more closely.