At an early stage of a project, I wasn't too concerned about human visitors (that aren't too many, honestly), I was concerned about the search engine bots. The log file I got indicated that Googlebot would visit my site daily, but it stopped at the main page and did not crawl further. So every day, there's an isolated Googlebot log entry visiting the main page once and didn't do anything else.
Like...
2007-04-25 22:05:58 66.249.66.138 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /xoops/modules/wtHome/ ref=That does not make sense, there are plenty of simple links on my front page that any search engine crawler should be able to crawl. But then, these isolated log entries repeated every day, Google just didn't crawl my project website. What's worse, searching for "site:wt-toolkit.sourceforge.net" on Google still gives me the "Generated Javascript Documentation" result, which indicates that Google completely ignored the new project website despite that fact that they have seen the main page a few times already.
While there's some fancy Javascript trickery on my project website (like the project logo), most of the project site is written in traditional PHP/HTML such that search engine crawlers can easily understand it. The project website looks perfectly legible even if you disabled Javascript. What can possibly go wrong here?
I found a tool today, that (claims to be) is able to simulate what Googlebot sees from your website.
"Be The Bot"
http://www.avivadirectory.com/bethebot/#
So I entered "http://wt-toolkit.sourceforge.net" into the tool, and surprise! It says Googlebot sees a completely empty page there.
How could that happen? Immediately I thought of the redirecting index.php I put up in the root directory of WT Toolkit's project website. It only had one line of PHP code (three lines if you count the php opening and closing brackets):
<php?I put it there because I installed XOOPS (which is the CMS behind WT Toolkit's project website) under the xoops directory, and not the root directory. I did that for convenience. Going inside "xoops/" would give you yet another redirection, which gets you to the "Home" module's URL "/xoops/module/wtHome/".
header("location: xoops/");
?>
Was Googlebot not able to process the redirection? It seems to be able to follow the redirections, otherwise it wouldn't be visiting "/xoops/modules/wtHome/" in the log file. Be The Bot's simulation also left the same log entry in my site log file, however.
So I entered the URL without redirections to Be The Bot: http://wt-toolkit.sourceforge.net/xoops/modules/wtHome/
This time, it displayed the project website correctly, albeit without the images.
Something was definitely wrong there. The log file indicates that Be The Bot was redirected to "/xoops/modules/wtHome" successfully, yet it couldn't retrieve the HTML correctly. Without redirection, the correct HTML content was retrieved. XOOPS might be part of the problem here, but I'm not sure.
Anyway, this means I have to restructure the project web site a bit so that the main page can be retrieved without redirection. This is not difficult... Done. No redirections for the main page now.
Let's see if Google could crawl it correctly tomorrow or a few days later.
1 comments:
Try phpsitemapNG for XOOPS to generate a Google Sitemap and
register your site in Google Webmaster Central
HTH
Post a Comment