Monday, April 30, 2007

Why is Google Bot not crawling WT Toolkit website?

To monitor traffic to WT Toolkit's website, I sneaked in some PHP code that logs down information about incoming visitors from 24th April - just 6 days ago.

At an early stage of a project, I wasn't too concerned about human visitors (that aren't too many, honestly), I was concerned about the search engine bots. The log file I got indicated that Googlebot would visit my site daily, but it stopped at the main page and did not crawl further. So every day, there's an isolated Googlebot log entry visiting the main page once and didn't do anything else.

Like...
2007-04-25 22:05:58 66.249.66.138 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /xoops/modules/wtHome/ ref=
That does not make sense, there are plenty of simple links on my front page that any search engine crawler should be able to crawl. But then, these isolated log entries repeated every day, Google just didn't crawl my project website. What's worse, searching for "site:wt-toolkit.sourceforge.net" on Google still gives me the "Generated Javascript Documentation" result, which indicates that Google completely ignored the new project website despite that fact that they have seen the main page a few times already.

While there's some fancy Javascript trickery on my project website (like the project logo), most of the project site is written in traditional PHP/HTML such that search engine crawlers can easily understand it. The project website looks perfectly legible even if you disabled Javascript. What can possibly go wrong here?

I found a tool today, that (claims to be) is able to simulate what Googlebot sees from your website.

"Be The Bot"
http://www.avivadirectory.com/bethebot/#

So I entered "http://wt-toolkit.sourceforge.net" into the tool, and surprise! It says Googlebot sees a completely empty page there.

How could that happen? Immediately I thought of the redirecting index.php I put up in the root directory of WT Toolkit's project website. It only had one line of PHP code (three lines if you count the php opening and closing brackets):
<php?
header("location: xoops/");
?>
I put it there because I installed XOOPS (which is the CMS behind WT Toolkit's project website) under the xoops directory, and not the root directory. I did that for convenience. Going inside "xoops/" would give you yet another redirection, which gets you to the "Home" module's URL "/xoops/module/wtHome/".

Was Googlebot not able to process the redirection? It seems to be able to follow the redirections, otherwise it wouldn't be visiting "/xoops/modules/wtHome/" in the log file. Be The Bot's simulation also left the same log entry in my site log file, however.

So I entered the URL without redirections to Be The Bot: http://wt-toolkit.sourceforge.net/xoops/modules/wtHome/

This time, it displayed the project website correctly, albeit without the images.

Something was definitely wrong there. The log file indicates that Be The Bot was redirected to "/xoops/modules/wtHome" successfully, yet it couldn't retrieve the HTML correctly. Without redirection, the correct HTML content was retrieved. XOOPS might be part of the problem here, but I'm not sure.

Anyway, this means I have to restructure the project web site a bit so that the main page can be retrieved without redirection. This is not difficult... Done. No redirections for the main page now.

Let's see if Google could crawl it correctly tomorrow or a few days later.

WT Toolkit 0.3.3 Performance Optimizations

A problem plaguing WT Toolkit ever since its birth is performance. WT Toolkit 0.1.x and 0.2.x felt slow all the time because of the garbage collector running in the background. 0.3.0 eliminated the background garbage collector and yet we kept it automatic so the programmer doesn't have to care about lapsed listeners (well, most of the time).

But at 0.3.2, our performance is still bad compared to other popular toolkits like Dojo Toolkit and Qooxdoo. Widget creation latencies increase linearly in a very quick manner as the number of on-screen widgets increases. The effect isn't very noticeable under Firefox, but WT Toolkit 0.3.2 definitely felt slow under Internet Explorer 6 or 7.

Well... not anymore for the upcoming WT Toolkit 0.3.3! Even though I've already submitted my FYP final report, new work has begun on performance optimizations! Yeah, baby!

How much have we optimized? Let's see what a little trick called "delayed execution" (available in WT Toolkit 0.3.3) can do...

Before optimizations:



After optimizations:



As a result of the work on performance optimizations, I left the work on WT Toolkit website to Marco. He couldn't complete it on 29th April because he had other academic work to do at that time. But anyway, we're having steady progress on WT Toolkit's website, we'll be seeing more and more amazing things as time goes on. :-)

Sunday, April 29, 2007

Deleted

Group norm 的確係不好挑戰的

算, 我係 hea 的. 我無野好講.

Friday, April 27, 2007

FYP Poster for WT Toolkit


Drawn and color-printed out last night with GIMP and Inkscape. Marco pasted the individual A4 sheets on the poster board and handed it in to CSE department.

Wednesday, April 25, 2007

Plans for WT Toolkit

Now that we are fairly feature complete, time for some publicity.

27/4/2007 - FYP Poster
29/4/2007 - Completed WT Toolkit Website
1/5/2007 - Submit WT Toolkit to Ajaxian
1/5/2007 - Submit WT Toolkit to freshmeat
1/5/2007 - Submit WT Toolkit to Open Directory Project
13/5/2007 - Visual programming demo for WT Toolkit
19/5/2007 - FYP Code CD (Not sure what contents are needed, probably a Linux LiveCD)
21/5/2007 - FYP Presentation (schedule - we are group DE3)

Monday, April 23, 2007

Watched Michael and John's FYT pre-presentation today

Why did I go to the presentation? That's because I talked to Michael about his research today morning and I found it interesting.

The topic was implementing an efficient DHT on an ad-hoc mobile network. Efficient DHTs for fixed-line, broadband Internet are already there, like Chord and Pastry, and everybody is using those knowingly or unknowingly. Michael's research is about how to make DHTs efficient on ad-hoc mobile networks, which is much harder than implementing DHTs on top of our everyday IP network. Some difficulties include:

1. Message routing. Mobile nodes do not and should not have fixed routes like our desktop computers. Although routing on the physical network can be partially solved by things like AODV, you still have to make sure the hops on the DHT's overlay network are efficient. e.g. assuming you've got perfect data routing in the physical network, it's still useless if one of the DHT hops goes to another country with a 12-hour timezone difference - your message will be hopping across many many many nodes in the physical network for just one DHT hop.

2. Bandwidth overhead. (??) I don't know how bad the problem is since I haven't seen the simulations myself. Probable causes I've heard are AODV-style flooding and Bloom filter inefficiencies. Gnutella-style implementations were mentioned for the audience to point and laugh at, I guess.

One of the related papers to M and J's work:
http://www.cs.ucsb.edu/~ravenben/publications/pdf/idlp-comsware07.pdf

Now what's Michael and John's proposed solution... They proposed a DHT that's organized in a tree-like fashion, instead of the ring/skip list type seen in Chord or Pastry. The root node in the tree is called a "landmark", which should have a fixed location and has no extra hardware resource requirements when compared to other nodes. Their algorithm takes care of the physical routing as well so there's no need for AODV flooding or playing with Dijkstra's algorithm as in SrcRR. No AODV, no route request flooding, less bandwidth overhead. Bloom filters are used in narrowing/selecting paths in the tree, which is very intuitive and easy to understand (just a simple trick with bits, with the hard probability maths done for you 30 years ago), despite the seemingly cryptic name.

Prof. Gary Chan asked lots of questions during the presentation, he had a very sharp sense for things that seemed to be "strange" or inefficient. The object duplication algorithm (put in there to make p2p swarming possible) in John's presentation was one of the quirks Gary spotted, the algorithm seemed like a placeholder, I guessed it shouldn't be too hard to correct though.

So what I've got from the seminar... let's see
1. Revision of some old algorithms (Bloom filters... I almost forgot them completely, never used them once in the past few years), learned some new ones, and some new problems.
2. The 40 minutes presentation time I've got for my FYP is preciously short. Michael and John's presentation went for like 1.5 hours, and they were still missing on some details.
3. I need to keep my audience interested by doing demonstrations, with both WT Toolkit and WT Toolkit's competitors.

Regular Expressions - how good theory is ignored in popular software

You think the regular expression implementation in Java, Perl, Python, PHP, Ruby and PCRE (which is a C library) should have been refined many many times and thus highly optimized? Think again.

http://swtch.com/~rsc/regexp/regexp1.html

The title of the article is "Regular Expression Matching Can Be Simple And Fast", but what's more interesting is the subtitle - "(but is slow in Java, Perl, PHP, Python, Ruby, ...)". Slow, how slow? Look at the first graph of the article, for some pattern matching inputs, Perl 5.8.7's built-in regular expression matching is millions times slower than a 40-year-old algorithm.

How can that happen? Well... it could be argued that the expression used in the example is a pathological case. But is it a pathological problem in theory? i.e. not belonging to P, or belonging to P with a very large exponent? Well, obviously not. Otherwise, the 40-year-old algorithm wouldn't be able to perform the matching quickly as well.

What actually happened here was this... all the popular programming language developers (Java, Python, Perl, PHP, etc.) copied/borrowed their implementation from a popular extended regular expression matching algorithm that was known to be "fast enough", but not known to be provably fast. 40+ years of theories of finite automata went into the trash bin when programmers (including the guy who invented the correct algorithm 40 years ago!) needed to release softwares fast and neglected to spend time to think about the mathematics behind.

The regular expression engine that the article's author described was only a very simple one, however. Can it be expanded to processing modern extended regular expressions without going into the same performance hell of Perl, PCRE, Python, etc.? The author gave some justifications that it could, but he was very light on the details. Even if he has missed out some details that makes his proposal infeasible, however, it still stands that the regex engines we're using every day are far from optimal.

Biometrics a fad?

How secure is it to use your fingerprint as an authentication token? Much research has been done to that, so it must be secure, right?

But wait a moment... you leave your fingerprints everywhere, every day. It's pretty much public information. And using public information as a secret key sounds like a dumb idea, doesn't it?



Yup... it's dumb. Everybody can crack a fingerprint scanner with a printer, transparency slides, PCB etching tools, and any moldable plastics. It's at its heart security by obscurity. And it's remarkable how much bullshit went into that "unbreakable door lock" in the video. Using moisture as an authentication condition?! On come on, is moisture really so scarce or secret on Earth? Now what's next? Iris scanners? Your iris pattern can be captured everywhere, in 3D, even... it might be a little bit more difficult to capture and reproduce, but it's public information, nonetheless. If what they are betting on is the resolution of cameras (which can definitely be improved as time goes on), then they're relying on security by obscurity.

It's remarkable how far snake oil technologies can make into the market, government institutions, and even academia.

By the way, the video rocks! It feels like reading an early issue of the Phrack magazine (much of the hacks don't work anymore, of course. But wait... the fork bomb still works ) or some of the classical papers/theses (like, Chord). Easy to read, concrete procedures, concrete results, and profound implications.

WT Toolkit listed under opensource.hk


Just saw this when I was searching in Google. Good to know there are people who know my project exists, and there are other people doing the same thing as me.

URL: http://opensource.hk/opensrcproj

Among the projects, the only other ones I can recognize are CK-ERP and RMSS. CK-ERP's author, C.K. Wu, has worked on the project for many years. He posted many advertisements in local newsgroups. Sadly, there's rarely any public replies to him. There should have been quite a number of people talking to him privately though, as ERP systems are generally very expensive and have major impact on business.