So I have been away for a while dealing with the more mundane aspects of academia. Trying to finish papers before a deadline, attending meetings …
But in the meantime I also managed to find some really interesting patterns in the way people tag resources.
This is the reason scientists obsess with observations and theories …. to make sense of the mess. Imagine the buzz you get when patterns begin to emerge from the noise: when everyone else sees clouds, you see the regular little formations that lie behind the otherwise amorphous outgrowths.
Consider the set of most popular tags for the New York Times web site. Here they are, followed by the number of people using that tag (as of January, 2006). The numbers were obtained through the Scrumptious plugin for Firefox.
news (2093), newspaper (550), daily (370), nyc (298), newspapers (251), media (229), usa (156), politics (68), newyork (56), new (43), world (40), york (38), nytimes (38), nyt (36), times (34), us (32), culture (28), international (23), safari_export (22), reference (19), national (18), ny (18), business (17), noticias (16), english (15).
The individual tags are all obvious or semi-obvious in the way they describe the resource (the NYT web site). There are some puzzling ones … “new” for example. But presumably this was meant to go together with “york” to form “new york”. A slight problem with this interpretation is that “new” appears slightly more often than “york” which means that if my speculation is correct, then some people must have used “new” independently of “york”. Why? Perhaps they used it in conjunction with another word? Which one? We need to obsess about this problem a little bit in the background until we have an answer. Loose threads are annoying.
But the really interesting question is, why do people use these particular tags? A typical and simple answer that is generally given is that people use whatever tags they think will be useful for them to later retrieve the documents. These tags happen to be useful for most people in most contexts.
But I have to wonder, why?? Why are these words useful for this resource? Can we characterize the most used tags in some interesting way? Can we predict for a new URL how people will tag it?
This last point is important, because prediction is one of the most important aspects of science. Most people won’t use random words to describe a resource. Can we tell which ones they will use?
But I think the claim that tagging is so different to categorization is meant in a strong sense. I think the claim must be that no formal and predictable pattern can be ascribed to user classifications (apart from localized, transient, culturally transmitted patterns). That is, in complete opposition to the sort of formal ontologies used in the YAHOO directory for instance, user tags are supposed to be parasitic on whatever mental association happens to work for a given person at a given time for a given URL in a given context. It is precisely their lack of formal pattern that is supposed to make tags so useful for individuals, and which allows anyone to enter into the classification game.
But is this true? Is there really no formal pattern to be seen in user tags? I think there is plenty, and the topic of my next few blogs will be to see what patterns I found, and how I found them. So think about the tags above … what can you see?? Can you see the handywork of “naive ontologists”?