Web Finger proposals overview

Comments Off on Web Finger proposals overview

If all you had was an email address, would it not be nice to be able to have a mechanism to find someone’s home page or OpenId from it? Two proposals have been put forward to show how this could be done. I will look at them and add a sketch of my own that hopefully should lead us to a solution that takes the best of both proposals.

The WebFinger GoogleCode page explains what webfinger is very well:

Back in the day you could, given somebody’s UNIX account (email address), type

$ finger email@example.com 

and get some information about that person, whatever they wanted to share: perhaps their office location, phone number, URL, current activities, etc.

The new ideas generalize this to the web, by following a very simple insight: If you have an email address like henry.story@sun.com, then the owner of sun.com is responsible for managing the email. That is the same organization responsible for managing the web site http://sun.com. So all that is needed is some machine readable pointer from http://sun.com/ to a lookup giving more information about owner of the email address. That’s it!

The WebFinger proposal

The WebFinger proposed solution showed the way so I will start from here. It is not too complicated, at least as described by John Panzer’sPersonal Web Discovery” post.

John suggests that there should be a convention that servers have a file in the /host-meta root location of the HTTP server to describe metadata about the site. (This seems to me to break web architecture. But never mind: the resource http://sun.com/ can have a link to some file that describes a mapping from email ids to information about it.) The WebFinger solution is to have that resource be in a new application/host-meta file format. (not xml btw). This would have mapping of the form

Link-Pattern: <http://meta.sun.com/?q={%uri}>; 

So if you wanted to find out about me, you’d be able to do a simple HTTP GET request on http://meta.sun.com/?q=henry.story@sun.com, which will return a representation in another new application/xrd+xml format about the user.

The idea is really good, but it has three more or less important flaws:

  • It seems to require by convention all web sites to set up a /host-meta location on their web servers. Making such a global requirement seems a bit strong, and does not in my opinion follow web architecture. It is not up to a spec to describe the meaning of URIs, especially those belonging to other people.
  • It seems to require a non xml application/host-meta format
  • It creates yet another file format to describe resources the application/xrd+xml. It is better to describe resources at a semantic level using the Resouces Description Framework, and not enter the format battle zone. To describe people there is already the widely known friend of a friend ontology, which can be clearly extended by anyone. Luckily it would be easy for the XRD format to participate in this, by simply creating a GRDDL mapping to the semantics.

All these new format creation’s are a real pain. They require new parsers, testing of the spec, mapping to semantics, etc… There is no reason to do this anymore, it is a solved problem.

But lots of kudos for the good idea!

The FingerPoint proposal

Toby Inkster, co inventor of foaf+ssl, authored the fingerpoint proposal, which avoids the problems outlined above.

Fingerpoint defines one useful relation sparql:fingerpoint relation (available at the namespace of the relation of course, as all good linked data should), and is defined as

	a owl:ObjectProperty ;
	rdfs:label "fingerpoint" ;
	rdfs:comment """A link from a Root Document to an Endpoint Document 
                        capable of returning information about people having 
                        e-mail addresses at the associated domain.""" ;
	rdfs:subPropertyOf sparql:endpoint ;
	rdfs:domain sparql:RootDocument .

It is then possible to have the root page link to a SPARQL endpoint that can be used to query very flexibily for information. Because the link is defined semantically there are a number of ways to point to the sparql endpoint:

  • Using the up and coming HTTP-Link HTTP header,
  • Using the well tried html <link> element.
  • Using RDFa embedded in the html of the page
  • By having the home page return any other represenation that may be popular or not, such as rdf/xml, N3, or XRD…

Toby does not mention those last two options in his spec, but the beauty of defining things semantically is that one is open to such possibilities from the start.

So Toby gets more power as the WebFinger proposal, by only inventing 1 new relation! All the rest is already defined by existing standards.

The only problem one can see with this is that SPARQL, though not that difficult to learn, is perhaps a bit too powerful for what is needed. You can really ask anything of a SPARQL endpoint!

A possible intermediary proposal: semantic forms

What is really going on here? Let us think in simple HTML terms, and forget about machine readable data a bit. If this were done for a human being, what we really would want is a page that looks like the webfinger.org site, which currently is just one query box and a search button (just like Google’s front page). Let me reproduce this here:

Here is the html for this form as its purest, without styling:

     <form  action='/lookup' method='GET'>
         <img src='http://webfinger.org/images/finger.png' />
         <input name='email' type='text' value='' />         
         <button type='submit' value='Look Up'>Look Up</button>

What we want is some way to make it clear to a robot, that the above form somehow maps into the following SPARQL query:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?homepage
   [] foaf:mbox ?email;
      foaf:homepage ?homepage

Perhaps this could be done with something as simple as an RDFa extension such as:

     <form  action='/lookup' method='GET'>
         <img src='http://webfinger.org/images/finger.png' />
         <input name='email' type='text' value='' />         
         <button type='submit' value='homepage' 
                sparql='PREFIX foaf: <http://xmlns.com/foaf/0.1/> 
                 GET ?homepage
                 WHERE {
                   [] foaf:mbox ?email;
                      foaf:homepage ?homepage
                 }">Look Up</button>

When the user (or robot) presses the form, the page he ends up on is the result of the SPARQL query where the values of the form variables have been replaced by the identically named variables in the SPARQL query. So if I entered henry.story@sun.com in the form, I would end up on the page
http://sun.com/lookup?email=henry.story@sun.com, which could perhaps just be a redirect to this blog page… This would then be the answer to the SPARQL query

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?homepage
   [] foaf:mbox "henry.story@bblfish.net";
      foaf:homepage ?homepage

(note: that would be wrong as far as the definition of foaf:mbox goes, which relates a person to an mbox, not a string… but let us pass on this detail for the moment)

Here we would be defining a new GET method in SPARQL, which find the type of web page that the post would end up landing on: namely a page that is the homepage of whoever’s email address we have.

The nice thing about this is that as with Toby Inkster’s proposal we would only need one new relation from the home page to such a finder page, and once such a sparql form mapping mechanism is defined, it could be used in many other ways too, so that it would make sense for people to learn it. For example it could be useful to make web sites available to shopping agents, as I had started thinking about in RESTful semantic web services before RDFa was out.

But most of all, something along these lines, would allow services to have a very simple CGI to answer such a query, without needing to invest in a full blown SPARQL query engine. At the same time it makes the mapping to the semantics of the form very clear. Perhaps someone has a solution to do this already. Perhaps there is a better way of doing it. But it is along these lines that I would be looking for a solution…

(See also an earlier post of mine SPARQLing AltaVista: the meaning of forms)

How this relates to OpenId and foaf+ssl

One of the key use cases for such a Web Finger comes from the difficulty people have of thinking of URLs as identifiers of people. Such a WebFinger proposal if successful, would allow people to type in their email address into an OpenId login box, and from there the Relying Party (the server that the user wants to log into), could find their homepage (usually the same as their OpenId page), and from there find their FOAF description (see “FOAF and OpenID“).

Of course this user interface problem does not come up with foaf+ssl, because by using client side certificates, foaf+ssl does not require the user to remember his WebID. The browser does that for him – it’s built in.

Nevertheless it is good that OpenId is creating the need for such a service. It is a good idea, and could be very useful even for foaf+ssl, but for different reasons: making it easy to help people find someone’s foaf file from the email address could have many very neat applications, if only for enhancing email clients in interesting new ways.


It was remarked in the comments to this post that the format for the /host-meta format is now XRD. So that removes one criticism of the first proposal. I wonder how flexible XRD is now. Can it express everything RDF/XML can? Does it have a GRDDL?

November 30th 2009 Uncategorized

Identity in the Browser, Firefox style

Comments Off on Identity in the Browser, Firefox style

Mozilla’s User Interface chief Aza Raskin just put forward some interesting thoughts on what Identity in the Browser could look like for Firefox. As one of the Knights in search of the Golden Holy Grail of distributed Social Networking, he believes to have found it in giving the browser more control of the user’s identity.

The mock up picture reproduced below, shows how Firefox, by integrating identity information into the browser, could make it clear as to what persona one is logged into a site as. It would also create a common user interface for allowing one to log in to a site under a specific Identity, as well as allow one to create a new one. Looking at the Weave Identity Account Manager project site one finds that it would also make it easy to generate automatically passwords for each site/identity, to sync one’s passwords across devices, as well as to change the passwords for all enabled sites simultaneously if one feared one’s computer had fallen in the wrong hands.
These are very appealing properties, and the UI is especially telling, so I will reproduce the main picture here:

The User Interface

One thing I very strongly support in this project is the way it makes it clear to the user, in a very visible location – the URL bar -, as what identity he is logged in as. Interestingly this is at the same location as the https information bar, when you connect to secure sites. Here is what URL bar looks like when connected securely to LinkedIn:

One enhancement the Firefox team could immediately work on, without inventing a new protocol, would be to reveal in the URL bar the client certificate used when connected to a https://... url. This could be done in a manner very similar to the way proposed by Aza Raskin in the his Weave Account manager prototype pictured above. This would allow the user to

  • know what HTTPS client cert he was using to connect to a site,
  • as well as allow him to log out of that site,
  • change the client certificate used if needed

The last two feature of TLS are currently impossible to use in browsers because of the lack of such a User Interface Handle. This would be a big step to closing the growing Firefox Bug 396441: “Improve SSL client-authentication UI”.

From there it would be just a small step, but one that I think would require more investigation, to foaf+ssl enhance the drop down description about both the server and the client with information taken from the WebID. A quick reminder: foaf+ssl works simply by adding a WebID – which is just a URL to identify a foaf:Agent – as the subject alternative name of the X509 certificate in the version 3 extensions, as shown in detail in the one page description of the protocol. The browser could then GET the meaning of that URI, i.e. GET a description of the person, by the simplest of all methods: an HTTP GET request. In the case of the user himself, the browser could use the foaf:depiction of the user, to display a picture of him. In the case of the web site certificate, the browser could GET the server information at its WebId, and display the information placed there. Now if the foaf file is not signed by a CA, then the information given by the remote server about itself, should perhaps be placed on a different background or in some way to distinguish the information in the certificate, from the information gleaned from the WebId. So there are a few issues to work on here, but these just only involve well developed standards – foaf and TLS – and some user interface engineers to get them right. Easier, it seems to me, than inventing a whole protocol – even though it is perhaps every engineers desire to have developed a successful one.

The Synchronization Piece

Notice how foaf+ssl enables synchronization. Any browser can create a public/private key pair using the keygen element, and get a certificate from a WebId server, such as foaf.me. Such a server will then add that public key as an identifier for that WebId to the foaf file. Any browser that has a certificate whose public key matches that published on the server, will be able to authenticate to that server and download all the information it needs from there. This could be information

  • about the user (name, depiction, address, telephone number, etc, etc)
  • a link to a resource containing the bookmarks of the user
  • his online accounts
  • his preferences

Indeed you can browse all the information foaf.me can glean just from my public foaf file here. You will see my bookmarks taken from delicious, my tweets and photos all collected in the Activity tab. This is just one way to display information about me. A browser could collect all that information to build up a specialized user interface, and so enable synchronization of preferences, bookmarks, and information about me.

The Security Problem

So what problem is the Weave team solving in addition to the problem solved above by foaf+ssl?

The weave synchronization of course works in a similar manner: data is stored on a remote server, and clients fetch and publish information to that server. One thing that is different is that the Weave team wish to store the passwords for each of the user’s accounts onto a remote server that is not under the user’s control. As a result that information needs to be encrypted. In foaf+ssl only the public key is stored on a remote server, so there is no need to encrypt that information: the private key can remain safely on the client key chain. Of course there is a danger with the simple foaf+ssl server that the owner of the remote service can both see and change the information published remotely depending on who is asking for it. So an unreliable server could add a new public key to the foaf file, and thereby allow a malicious client to authenticate as the user in a number of web sites.

It is to solve this problem that Weave was designed: to be able to publish remotely encrypted information that only the user can understand. The publication piece uses a nearly RESTful API. This allows it to store encrypted content such as passwords, identity information, or indeed any content on a remote server. The user would just need to remember that one password to be able to synchronize his various Identities from one device to another. There is a useful trick that is worth highlighting: each piece of data is encrypted using a symmetric key, which is stored on the server encrypted with a public key. As a result one can give someone access to a piece of data just by publishing the symmetric key encrypted using one of her public key.

Generalization of Weave

To make the above protocol fully RESTful, it needs to follow Roy Fielding’s principle that “REST APIs must be hypertext driven“. As such this protocol is failing in this respect in forcing a directory layout ahead of time. This could be fixed by creating a simple ontology for the different roles of the elements required in the protocol: such as public keys, symmetric keys, data objects, etc… This would then enable the Linked Data pattern. Allowing each of the pieces of data to be anywhere on the web. Of course nothing would stop the data from being set out the way specified in the current standard. But it immediately opens up a few interesting possibilities. For example if one wanted a group of encrypted resources to be viewed by the same group of people, one would need only one encrypted symmetric key each of those resources could point to, enabling less duplication.

By defining both a way of getting objects, and their encoding, the project is revealing its status as a good prototype. To be a standard, those should be separated. That is I can see a few sperate pieces required here:

  1. An ontology describing the public keys, the symmetric keys, the encrypted contents,…
  2. Mime types for encrypted contents
  3. Ontologies to describe the contents: such as People, bookmarks, etc…

Only (1) and (2) above would be very useful for any number of scenarios. The contents in the encrypted bodies could then be left to be completely general, and applied in many other places. Indeed being able to publish information on a remote untrusted server could be very useful in many different scenarios.

By separating the first two from (3), the Weave project would avoid inventing yet another way to describe a user for example. We already have a large number of those, including foaf, Portable Contacts, vcard, and many many more… I side for data formats being RDF based, as this separates the issues of syntax and semantics. It also allow the descriptions to be extensible, so that people can think of themselves in more complex ways that that which the current developers of Weave have been able to think of. That is certainly going to be important if one is to have a distributed social web.

Publishing files in an encrypted manner remotely does guard one from malicious servers. But it does I think also reduce the usability of the data. Every time one wants to give access to a resource to someone one needs to encrypt the symmetric key for that user. If the user looses his key, one has to re-encrypt that symmetric key. By trusting the server as foaf+ssl does, it can encrypt the information just in time, for the client requesting the information. But well, these are just different usage scenarios. For encrypting password – which we should really no longer need – then certainly the Weave solution is going in the right direction.

The Client Side Password

Finally Weave is going to need to fill out forms automatically for the user. To do this again I would develop a password ontology, and then markup the forms in such a way that the browser can deduce what pieces of information need to go where. It should be a separate effort to decide what syntax to use to markup html. RDFa is one solution, and I hear the HTML5 solution is starting to look reasonable now that they removed the reverse DNS namespace requirement. In any case such a solution can be very generic, and so the Firefox engineers could go with the flow there too.

RDF! You crazy?

I may be, but so is the world. You can get a light triple store that could be embedded in mozilla, that is open source, and that is in C. Talk to the Virtuoso folks. Here is a blog entry on their lite version. My guess is they could make it even liter. KDE is using it….

November 25th 2009 security

my time at Sun is coming to an end

Comments Off on my time at Sun is coming to an end

Many have been laid off at Sun over the past few years, and we are in a new round now in France: it looks like it may be my turn next.

I am lucky to be working from Europe where these things take quite some time to be processed. There may be even some way I can extend my pay for 3 months, if I volunteer to depart, and don’t take some time to find another job inside of Sun. In France people don’t get fired, unless they did something really bad – their jobs are terminated.

I have known this was on the cards for the past 6 months, and so I had really hoped that the Social Web Camp in Santa Clara would help me demonstrate the value of what I had been doing to a larger cross section of people in the Bay Area. Sadly that was messed up by the decision by the SFO Homeland Security bureaucrats to send me to jail instead; a very interesting experience with hindsight, that has triggered a number of new interests, that could well guide me to a radical departure of my career as writer, sociologist, psychologist, political scientist. So many interesting things to do in life…

My time at Sun has certainly been the best experience of work I have ever had. I learned so much here. Certainly, I would have preferred it if we could have launched a large and successful semantic web project while I was here, but somehow that just seemed to be a very elusive task. My hope was to simplify the Semantic Web down to a core, and to show how there is a tremendous opportunity in distributed Social Networks. But Sun’s current financial difficulties and the uncertainties of the takeover by Oracle, have meant that the company had to focus more on its core business. Much bigger projects have failed, and many much better engineers have lost their job here.

Still this means that I am a bit in limbo now. I will certainly continue to work on Decentralized Social Networks (esp, foaf+ssl), as I believe these have a huge potential. But even more so that over the past few months, I will be doing this under my own steam.

November 24th 2009 Uncategorized


Comments Off on The PARTNERKA

The PARTNETKA, what is it, why you should care.

Copyright 2010, BLACKHAT-SEO.COM
This post is originally from http://www.blackhat-seo.com/. If you want more like this, visit my website to subscribe.
(#fnkey: b4e6bbb02fae174c227d407f9522b5a0 ( )

November 21st 2009 News

Building Links “Outside The Box”

Comments Off on Building Links “Outside The Box”

outside-the-boxYou often hear people say “think outside the box” when it comes to building links,  its a way to say “do something different”  or ” be creative”.    But what exactly is ”the box” and how do you ”think outside” it when it comes to  links? 

Good questions but hard to give stock answers to so I went looking for an example to illustrate the point.  Found a good one after reading a press release today from the Cable & Telecommunication Association for Marketing.  Let’s take a look at how ”thinking outside the box” can help you find credible  resources and build links.

Recently the  CTAM  released a report analyzing four generational groups and their online behavior.  No surprises overall save one as it relates to the Mature (age 65+) market.  Here’s some of the findings:

Seniors aged 65 and older (also referred to as “Matures”) have made the Internet an integral part of their everyday lives. In a recent study, 77 percent report that they shop online. In fact, Matures lead all other generational groups when it comes to this online activity. They regularly use email (94 percent), go to the Internet to look up health and medical information (71 percent), read news (70 percent), and manage their finances and banking (59 percent). Matures also turn to the Internet for gaming, approximately half (47 percent) of online Matures regularly play free online games.  

Bold in red mine because it’s the part that raised an eyebrow and got the link brain going.  People 65 and older are playing games online? At first I was surprised since I equate “online games” with things like WarCraft and WhackAToad  but then I remembered hearing how intellectually stimulating activities such as crossword puzzles, SuDoku and word search have the potential to keep Alzheimer’s at bay in older people and it made perfect sense.    Here’s where the “thinking outside the box” kicks in.

Developing a widget for a crossword puzzle, or daily email blast would be easy, helpful and a great passive tool to expose your brand to a segment of the market with a lot of disposable income.  If you’re catering to this crowd, create the puzzle (do something different)  and make a lot of noise (be creative) when doing so:

  • Launch a media blitz
  • Take out an ad in on/offline magazines
  • Get involved on social networking sites like Eons, ThirdAge
  • Get involved on blogs like Aging Hipsters 
  • Co-partner with another company selling to same demographic, drop puzzles in items e/mailed 

The demographic itself may not link since they tend not to have websites  but all the organizations who cater to them – will.   This is the “thinking outside the box” part.  :)

The real secret to good link building isn’t about redirects or directories or librarians, it’s about opening the box and looking beyond the obvious for opportunities and openings.   Might be time to start unpacking!

November 20th 2009 News

Correcting Corrupted Characters

Comments Off on Correcting Corrupted Characters

At some point, for some reason I cannot quite fathom, a WordPress or PHP or mySQL or some other upgrade took all of my WordPress database’s UTF-8 and translated it to (I believe) ISO-8859-1 and then dumped the result back right back into the database. So “Emil Björklund” became “Emil Björklund”. (If those looked the same to you, then I see “Börklund” for the second one, and you should tell me which browser and OS you’re using in the comments.) This happened all throughout the WordPress database, including to commonly-used characters like ’smart’ quotes, both single and double; em and en dashes; ellipses; and so on. It also apparently happened in all the DB fields, so not only were posts and comments affected, but commenters’ names as well (for example).

And I’m pretty sure this isn’t just a case of the correct characters lurking in the DB and being downsampled on their way to me, as I have WordPress configured to use UTF-8, the site’s head contains a meta that declares UTF-8, and a peek at the HTTP response headers shows that I’m serving UTF-8. Of course, I’m not really expert at this, so it’s possible that I’ve misunderstood or misinterpreted, well, just about anything. To be honest, I find it deeply objectionable that this kind of stuff is still a problem here on the eve of 2010, and in general, enduring the effluvia of erroneous encoding makes my temples throb in a distinctly unhealthy fashion.

Anyway. Moving on.

I found a search-and-replace plugin—ironically enough, one written by a person whose name contains a character that would currently be corrupted in my database—that lets me fix the errors I know about, one at a time. But it’s a sure bet there are going to be tons of these things littered all over the place and I’m not likely to find them all, let alone be able to fix them all by hand, one find-and-replace at a time.

What I need is a WordPress plugin or something that will find the erroneous character strings in various fields and turn them back into good old UTF-8. Failing that, I need a good table that shows the ISO-8859-1 equivalents of as many UTF-8 characters as possible, or else a way to generate that table for myself. With that table in hand, I at least have a chance of writing a plugin to go through and undo the mess. I might even have it monitor the DB to see if it happens again, and give me a big “Clean up!” button if it does.

So: anyone got some pointers they could share, information that might help, even code that might make the whole thing go away?

November 20th 2009 wordpress

http://openid4.me/ — OpenId ♥ foaf+ssl

Comments Off on http://openid4.me/ — OpenId ♥ foaf+ssl

OpenId4.me is the bridge between foaf+ssl and OpenId we have been waiting for.

OpenId and foaf+ssl have a lot in common:

  • They both allow one to log into a web site without requiring one to divulge a password to that web site
  • They both allow one to have a global identifier to log in, so that one does not need to create a username for each web site one wants to identify oneself at.
  • They also allow one to give more information to the site about oneself, automatically, without requiring one to type that information into the site all over again.

OpenId4.me allows a person with a foaf+ssl profile to automatically login to the millions of web sites that enable authentication with OpenId. The really cool thing is that this person never has to set up an OpenId service. OpenId4.me does not even store any information about that person on it’s server: it uses all the information in the users foaf profile and authenticates him with foaf+ssl. OpenId4.me does not yet implement attribute exchange I think, but it should be relatively easy to do (depending on how easy it is to hack the initial OpenId code I suppose).

If you have a foaf+ssl cert (get one at foaf.me) and are logging into an openid 2 service, all you need to type in the OpenId box is openid4.me. This will then authenticate you using your foaf+ssl certificate, which works with most existing browsers without change!

If you then want to own your OpenId, then just add a little html to your home page. This is what I placed on http://bblfish.net/:

    <link rel="openid.server" href="http://openid4.me/index.php" />
    <link rel="openid2.provider openid.server" href="http://openid4.me/index.php"/>
    <link rel="meta" type="application/rdf+xml" title="FOAF" href="http://bblfish.net/people/henry/card%23me"/>

And that’s it. Having done that you can then in the future change your openid provider very easily. You could even set up your own OpenId4.me server, as it is open source.

More info at OpenId4.me.

November 19th 2009 security

Letting Igons be Igons

Comments Off on Letting Igons be Igons


 From "Blowing Up," by Malcolm Gladwell, The New Yorker, April 22,2002

November 19th 2009 News

More on Quarterbacks

Comments Off on More on Quarterbacks


A few more thoughts on quarterbacks:


There are two separate issues with respect to quarterbacks. The first is whether, historically, NFL teams have done a good job of predicting which college quarterbacks will succeed in the pros. Dave Berri and Rob Simmons’ paper in the Journal of Productivity Analysis (that I relied on in the essay “Most Likely to Succeed” in my new book “What The Dog Saw”) proves pretty convincingly, I think, that the answer is no. One of the best parts of that paper is how Berri and Simmons demonstrate how much NFL teams tend to irrationally over-weight “combine” variables like speed, height and Wonderlic score.

There’s a second wonderful paper on this general subject by Cade Massey and Richard Thaler—Thaler being, of course, one of the leading lights in behavioral economics—called “The Loser’s Curse.” The argument of the Thaler-Massey paper goes something like this (and I encourage anyone who is interested in sports to read the whole thing, because I can’t do it justice here). By looking at the trades that NFL teams make, we can estimate the “market value” of a draft pick. And what we find is that teams place a very high value on high first round picks. The first pick in the draft, they write, has historically been valued as much as “the 10th and 11th picks combined, and as much as the sum of the last four picks in the first round.” Then Thaler and Massey calculate the true value of draft picks, using what they call “surplus value.” The key here is that all NFL teams operate under a strict salary cap. So a player’s real worth to a team is the extent to which his performance exceeds the average performance of someone making his salary. And what do they find? That market value and surplus value are radically out of sync: that teams irrationally over-weight the importance of high first round picks. In fact, according to their analysis, the most useful draft picks are in the second round, not the first: that’s where surplus values tend to be highest. Hence the title of the paper: “The Loser’s Curse.” The NFL rewards its weakest teams by giving them the highest draft picks—but those picks are actually not the most valuable picks in the draft.

    It is important to note here that we are talking about relative value. Personnel decisions in the NFL have clear opportunity costs: if you pay $15 million for a quarterback who only gives you $10 million of value, then you hve $5 million less to pay for a good linebacker. As they write: “To be clear, the player taken with the first pick does have the highest expected performance . . . but he also has the highest salary, and in terms of performance per dollar, is less valuable than players taken in the second round.”

What Massey and Thaler are saying, in essence, is that NFL general managers are not rational decision-makers. That’s why I think its so useful in this particular discussion. Those who believe that draft position is a good predictor of quarterback performance are essentially voting for the good judgment of the people who make draft decisions. And what Berri and Simmons in particular—and Massey and Thaler in general—remind us is that that kind of blind faith in the likes of Matt Millen and Al Davis simply isn’t justified. And, by the way, why should that fallibility come as a surprise? We’ve known for a long time that it is not easy to making decisions under conditions of extreme uncertainty. Here is Massey and Thaler from their conclusion:


Numerous studies find, for example, that physicians, among the most educated professionals in our society, make diagnoses that display overconfidence and violate Bayes’ rule. The point, of course, is that physicians are experts at medicine, not necessarily probabilistic reasoning. And it should not be surprising that when faced with difficult problems, such as inferring the probability that a patient has cancer from a given test, physicians will be prone to the same types of errors that subjects display in the laboratory. Such findings reveal only that physicians are human.

Our modest claim in this paper is that the owners and managers of National Football League teams are also human, and that market forces have not been strong enough to overcome these human failings. The task of picking players, as we have described here, is an extremely difficult one . . . Teams must first make predictions about the future performance of (frequently) immature young men. Then they must make judgments about their own abilities: how much confidence should the team have in its forecasting skills? As we detailed in section 2, human nature conspires to make it extremely difficult to avoid overconfidence in this task.


   This brings up the second question. Is it possible to ever accurately predict which college quarterbacks will succeed in the pros? Both the Thaler analysis and the Berri analysis hold out the real possibility that teams can be a lot smarter than they currently are. The New England Patriots clearly have taken some of Thaler’s lessons to heart, for example. There has also been a real effort by the folks over at Pro Football Outsiders to come up with a more useful algorithm for making quarterback selections. David Lewin’s “career forecast” zeroes in on career college starts and career college completion percentage as the best predictors of professional performance. I took the position in my essay “Most Likely to Succeed” that I didn’t think that quarterbacking (as opposed to other positions on the field) was predictable in this sense—that there is so much noise in the data, and so much variability between the college and professional games—that attempts at rationalizing draft day decisions have real limits. I’m still of that inclination. I’m willing to be convinced, though. I’d love to see more statistically-minded people weigh in on the Lewin analysis, and I’d also like to have a better handle over how the recent innovations in college offenses—particularly the use of ever more aggressive spread formations—affects the accuracy of that algorithm.


November 19th 2009 News

Josh Cohen Interviewed by Eric Enge

Comments Off on Josh Cohen Interviewed by Eric Enge

Published: November 15, 2009

Josh Cohen is the Senior Business Product Manager for Google News. He is responsible for global product strategy, marketing and publisher outreach for Google News, which is currently available in 26 languages and more than 50 countries. Prior to joining Google, Josh was Vice President of Business Development for Reuters Media, the world’s largest news agency. While there, he led business development for Reuters’ Consumer Media team, including all activities with major strategic partners. He was responsible for agreements with AOL, Google, MSN, Yahoo! and numerous media companies around the world for content distribution, revenue generation and strategic investments.

Before joining Reuters, Josh was Director of Business Development for SmartMoney.com where he led business development and licensing activities for the site, a joint venture between Dow Jones and Hearst. Cohen holds degrees from the University of Michigan and Columbia Business School, where he graduated Beta Gamma Sigma.

Interview Transcript

Eric Enge: Can you tell me what your responsibility is within Google?

Josh Cohen: I am the business product manager for Google News. I work with other folks on the news team, on figuring out what is our roadmap, what are the features that we are working on, what we want to do with the product in the next 6 months, 12 months, 18 months, and so on.

A big focus of my job is really working with people outside of Google; so talking to publishers, talking to people in the media and at conferences; just putting a face on Google News and trying to demystify it as much as possible. I also work with a lot of the different cross-functional teams who interact with publishers on a day-to-day basis and try to tie those efforts together a little bit better.

Eric Enge: Tell us what Google News is and what it does, and who uses it.

Josh Cohen: Google News was launched in beta back in 2002. The idea behind Google News is really similar to what we are trying to do in search. Not to throw the company mantra at you, but, it really is about organizing all the news information out there and making it even more accessible and useful for users.

We are trying to do this in every single country, and in every different language. We want as many different sources as possible, so that when people are looking for that information, they can find it. The interest in news overall is probably higher than it has ever been. More and more people are getting this online, and so the challenge is trying to find that information and to provide some context and organization. So, we really are operating as a search engine specifically for news.

Eric Enge: How do you define news versus other types of content?

Josh Cohen: We really try and keep as much as possible as black or white, and we don’t get into qualitative discussions about the nature of the news site. We don’t include any hate speech and pornography. What we look for is whether or not the site is covering current events, is it specifically covering the topics of the day, is there some evidence of an editorial organization, is there at least some editorial review process before something actually gets published. But, our bias is really toward inclusion.

Eric Enge: Right. So, you try to be as broad as possible and include as many different sources as you can. Are you looking for the content that would be unique, rather than somebody just republishing stuff off of a news wire?

Josh Cohen: Absolutely. We don’t have people who are just pure aggregators; there needs to be some original content on that site.

Eric Enge: That makes a lot of sense. What is the process that people go through when they want to have their site or some portion of their site considered for Google News?

Josh Cohen: It is actually pretty straightforward. There is a whole help center on Google News that is specifically for users and explains to them how it works. A whole portion of that is dedicated specifically to publishers, which explains to them how it works, and how to submit their content. Ultimately, they simply submit their sites or the portion of their sites that they’d like to be reviewed for inclusion, and then we take a look at it.

Eric Enge: There is a form people can use?

Josh Cohen: Yes. It is located here.

Eric Enge: What type of questions are covered in the form?

Josh Cohen: There are a few basic questions about the organization itself. We do not make editorial judgments about the nature of the site. It is really up to the user at the end of the day to make those decisions about whether or not they think it is a site that adds value to them. So in the form, we are looking for objective information about their site, and we are not looking for them to make a pitch about their site.

Eric Enge: Evaluating whether it is unique news content is something that your reviewers just do.

Josh Cohen: Yes, there is a support team that will review those sites as they come in. There is not a single editor or journalists who are working on Google News. Once the site is included in Google News, and included in our index, there is no manual intervention around the rankings. It is all done algorithmically.

Eric Enge: Right. Yes, but the people who review the site check to make sure that it is unique content as opposed to duplicated.

Josh Cohen: Yes, they ensure that it meets that criteria. A lot of that can be done algorithmically. We understand duplicate content, and we can do a full-text analysis. But yes, there needs to be original content.

Eric Enge: Right. And you know, for some reason, something goes wrong in the process, and the site does get turned down, but the publisher thinks that there is a fit, and they really believe that they should be reconsidered. Is there a process you would suggest for that?

Josh Cohen: Our bias is towards inclusion, so if there are things that we miss, we certainly want to be able to understand the site better.

Eric Enge: I know one example of a site that got turned down, and it turned out what happened is that, the person who had reviewed it had not looked at the news portion of the site.

Josh Cohen: That is really why we try and ask for as much information about their site as possible, because obviously the webmaster, the owner of the site, the publisher is going to know a lot more about it, understands the details of it. We are looking at thousands of different sites, and so that is the one real manual part of Google News; so the more information we can get about this site, during that submission process, the better.

Eric Enge: We have heard things about other kinds of requirements, like there needs to be a certain volume of news for example.

Josh Cohen: No. There is not any a volume requirement in terms of number of articles published a day or something like that. It can certainly have an impact in the rankings, but not in terms of inclusion or not. We have sites that are publishing hundreds of articles on a daily basis, and we have others that are longer analytical pieces or investigative pieces that are publishing just a handful a week. So, there is really a pretty wide range.

Eric Enge: There is also the notion that the URL needs to have a 3-digit code on it.

Josh Cohen: That is correct, there are certain technical requirements, which have nothing to do with the nature of the site, but the ways in which we can pickup that content. The 3-digit identifier is one of the ways we pick up the news content on a site. As you mentioned, there are sites that have a section that is devoted to news, but maybe the rest of their content is inappropriate for Google News. Oftentimes, in those sites we will see that that 3-digit identifier is a way for us to pick up the specific news content, so that is a requirement for crawling that content.

However, when sites are included in Google News, they are able to submit a News Sitemap, and if you are submitting the News Sitemap to us, then we don’t need the 3-digit URL requirement anymore, and you can ignore that if you are submitting the content via sitemaps, as we can pick it up that way.

Editors Note: Since this interview took place, Google News Sitemaps went through an update into a new format.

Eric Enge: Do the sitemaps bring any other kind of specific advantages?

Josh Cohen: Yes. It doesn’t change the ranking; there is no bias towards a site that submits a site map versus one that doesn’t. The real benefits of submitting a sitemap are, it provides a greater level of control over which of the articles appear on Google News, and it allows for specific metadata to be communicated about each of those individual articles.

Right now it is fairly limited, but we are certainly looking to expand what we do within sitemaps, because the more information we have about a publisher’s site, the better. For individual articles there can be basic stuff like attribution, and bylines, and location, and so forth. Ultimately, sitemaps are a really good way to clearly identify the information that you want to get crawled.

Most questions that a publisher will have around ranking of their content on Google News boils down to some a technical issue; where we didn’t take up an article or when we try to crawl it, it failed the extraction process. So, sitemaps is a real good way to insure that we are crawling that content, and it also allows you to proactively address any of those issues, because you can go right in, you can see when we are having problems crawling your site, whether it is a technical issue on our side or your site. I won’t say sitemaps eliminate all the technical issues, but it can certainly it can limit the impact of some of those, and allows you to have a better way of monitoring them.

Eric Enge: It will reduce errors, and will not affect ranking of included stories. It can affect whether or not the story is included at all.

Josh Cohen: Yes, exactly. And, that is a pretty big difference.

Eric Enge: Yes, it is. Are there other technical issues that people need to be concerned with to make sure that their news articles are friendly to the Google News crawler?

Josh Cohen: There are definitely challenges with images; so there are certain best practices that we try to encourage publishers to do. Larger-sized images with good aspect ratios are always easier for us to pick up; having more description within the captions is always helpful, having them near the title, having them inline and non-clickable. And, for the most part we prefer JPEGs.

Another thing is to have relevant and useful titles that are going to help the readers and to help our crawler know what your page is about.

Try not to break up the body of the article, such as having dates between the title and the body. These are tips that are not just specific to Google News, but certainly help for Google News.

Eric Enge: These things can also influence click-through.

Josh Cohen: Absolutely.

Eric Enge: Who are the people who consume Google News?

Josh Cohen: The focus of Google News, and I think one of the real appeals of it, is trying to offer as many different perspectives as possible on a given story. So, it can be a different political perspective, different geographical perspective, and you have different people who want to understand a story and all the different angles around it, and they really want to delve into a story. And that is why we cluster these stories not by sources, but any request of the articles by story. People click on a bunch of these different links and those are the people who by and large get a lot of value from Google News, because they get that diversity from Google News.

Eric Enge: From our experience, that certainly includes reporters and editors from a variety of sites.

Josh Cohen: They are certainly heavy users of Google News. There are those who will come to the front page and like the fact that we will aggregate the top stories out there on the web, and allow them to browse the top stories, see what is there, click on them, and go read them on the publisher’s site. Looking at those top stories is not dramatically different from somebody who may go to the publisher themselves directly to look for those top stories.

They may be just looking to see what is out there from across the web, from both their favorite sources and sources they don’t know. Then, there are the other half of the users who are using us pretty specifically as a search engine, who are using us just to type in the keywords or news stories that they have heard; whether they have heard it in the office, or on the web, or somebody emailed to them want to learn more about it, and they will just type in a name or few keywords, and use it much more as a search.

Eric Enge: People also set up news alerts, right?

Josh Cohen: Absolutely. They can set up alerts, use our RSS feeds, so there are a number of different ways where they can try and keep on top of stories. We see our role not as a destination site, just as a starting point. Our goal, very similar to what we are trying to do with web search, is to help people find what they are looking for and then send them on their way.

Eric Enge: One of the subtleties of this is that it is obvious to have a title that entices a click-through. But then, you also want that title to whatever it is that the editors you want to reach use as search terms.

Josh Cohen: To be clear, having a clean title matters, and the placement of that title in your page matters; but there are a few different elements that we are going to look for in trying to pickup the correct story. Certainly, the title matters, but URL and most importantly the text in the article itself matter too. If you have got a URL that is somewhat unclear, or the information is not that clear in the body of the article itself, then the title takes on more weight.

These are all different components that we are looking for; so if you have got a URL that has information, the text is very clear for us; then the title I would say is no more important than the other ones.

Eric Enge: Are there other things that go into ranking news stories?

Josh Cohen: Yes. There are two separate ranking processes that take place. One is just the story ranking, such as what is the top sports story of the day, what is the top entertainment story of the day, science and technology, and so on. There are a number of different factors that go into that, but the easiest way to think about it is we are really relying on what editors think the most important stories are. What is the aggregate editorial interest in a given story: that is to say, how many people are covering it, and where are they putting it on their page? These factors do not impact an individual source’s results, but do influence what story lines we think are most important. So, that is the story ranking.

For article ranking there are a number of signals that we are trying to use: is it original content, is it timely, is it relevant, is this a local story, and there is a local source reporting original content on it? That is again, not always relevant to every single story, but it is something else we will look for. Other questions we ask are, is it novel, or is it just a rehash of an article that was out there before, a story that somebody else broke, you just happen to publish it later. These are things that we look for, hard to do, but increasingly something that we are trying to include in our rankings.

Then, there are also source-specific signals that we try to use. This is where volume comes in: what is the volume of publication of original content in a given category? The example that I would like to use is, looking at the business category, you have got the Wall Street Journal, or Bloomberg, or Reuters, all of whom, any given day, are publishing probably hundreds of original stories in business. By itself, that is a decent signal that this is a quality source in that category.

You can compare that then with their volume of publication of original content in the sports category, you are probably not going to see a whole lot, if any, of original publication there.

I would say another really important signal for us in recent quarters has been the user behavior. Their behavior has become a really helpful signal for us in trying to determine that same trusted quality of a given source. So in a given cluster, the first link will get the most clicks, the second gets less clicks, and the third, the fourth, and so on, keep getting fewer and fewer clicks. But, if you look at a user who comes in, and instead of clicking on that first link which is what they were “supposed to do,” and instead let’s say they click on the fourth link; that is a very strong signal about both the source that they clicked on and also the three sources above it that they didn’t click on, even though they were “supposed to” click on that.

Over time, as you aggregate that information, normalize it for different click positions, you can look at this section-by-section to get a sense of what users feel are the best sources in given categories. Again, sticking with the business example, if I have got some random source as the #1 link in Google News, and Reuters in the #3 link, somebody may come to that and say “Wait a second, this is a business story, I want to see what Reuters has to say, I am clicking on that link in the third spot.”

That type of behavior takes place again and again, and it has become another important signal. Now, that doesn’t trump everything else; all these other scores and factors still matter, but all things being equal, we certainly want to take a look at some of the qualitative aspects of a source. We try to algorithmically determine the qualitative nature of a source in addition to the story-variable signals.

Eric Enge: Are inbound links a factor?

Josh Cohen: Not really. It is obviously a signal on the search side of things. With PageRank links certainly, as you know, are an important factor. On the news side of it, just because the nature of news and how quickly that information comes out, to be able to build up links over time is just something that isn’t really all that applicable on the news side of things.

Eric Enge: What about social media signals, such as Twitter?

Josh Cohen: There is nothing specific I can say on those, but I think it is safe to say that we are always looking at new signals. We will always keep working on this, because it continues to remain imperfect. We will test certain ones, and we will do evaluations against them as we did with the user click behavior.

Eric Enge: Anything you can say about plans for Google News?

Josh Cohen: We are trying to experiment in a number of different ways. For example we launched Fast Flip two months ago.

With Fast Flip we tried to introduce that element of serendipity that you get in the offline world. When you pick up a paper and you see the top stories, you may spot the article at the bottom of the page. It is something you would never think to read, you would never really look for, but you do because you spot it.

How do you introduce some of that quality into the online experience? Fast Flip is an attempt to do that. Another key component to that is the speed with which you can browse those pages. If a page takes five to ten seconds to load, you are not going to want to explore different types of content. Fast Flip is an attempt, both in terms of how it is presented visually, and also the speed with which it loads, to allow you to introduce some of the best of the offline experience online. That is a good example of one of the things that we are experimenting with; and I think we like to keep trying to innovate and figure out ways in which we can help our users and work with our partners.

Eric Enge: From my perspective, for a publisher looking to get exposure for what they are doing, implementing a quality-relevant news feed and working with Google News is an outstanding opportunity. I mean, you get visibility that a lot of people would die for. Of course there is an expense in implementing such a news feed. You have to do a quality job, because you don’t want to get in front of people and then have them say this is crap.

Josh Cohen: I think that is well-said. The way that we look at it is that it is a real partnership with the publishers that we have. We are a search index, we are focused on news; but we don’t have any content, we don’t have editors, we don’t have any journalists, and we don’t create any information. We get that from the publishers. For publishers, we think that we bring value in helping them get found and driving the traffic to them. In a given month, Google News sends almost a billion clicks to publishers worldwide.

Eric Enge: Better still, a significant percentage of that is from news editors and bloggers. So, not only you are getting the traffic from Google News, but you are getting the possibility of being written about in other news environments.

Josh Cohen: Sure, getting written about by others within the market is interesting, but we also help publishers obtain loyal users, who may like the aggregation qualities of Google News, but will discover their content and like it.

Eric Enge: Thanks so much for taking the time Josh, to speak with me today.

Josh Cohen: Thank you!

Have comments or want to discuss? You can comment on the Josh Cohen interview here.

Other Google Interviews

About the Author

Eric Enge is the President of Stone Temple Consulting. Eric is also a founder in Moving Traffic Incorporated, the publisher of Custom Search Guide, a directory of Google Custom Search Engines, and City Town Info, a site that provides information on 20,000 US Cities and Towns.

Stone Temple Consulting (STC) offers search engine optimization and search engine marketing services, and its web site can be found at: http://www.stonetemple.com.

For more information on Web Marketing Services, contact us at:

Stone Temple Consulting
(508) 485-7751 (phone)
(603) 676-0378 (fax)

November 18th 2009 News