More thoughts on tag frequency

I was slightly on the wrong track in a previous blog, I think. It happens …
So, the mistake was that I was thinking that all of the popular tags that had survived in a historical context, could be thought of as “category labels” and the less popular ones were the highly individual ones. This would certainly have made things easy. Of course it is good that this is not the case, because that would have been TOO easy. And things that are easy, are not interesting. So why is it not true? Just look at the table below, which shows the top four most popular tags for some of the 50 most popular sites on delicious:

Site Slashdot Flickr Pandora Digg BBC News New York Times
Tags News, Technology, Geek, Daily Photos, Flickr, Photography, Sharing Music, Radio, Recommendation, MP3 News, Technology, Blog, Daily News, BBC, UK, Daily News, Newspaper, Daily, NYC

It is self evident that the popular tags form a heterogeneous collection. Some are clear category labels that would feel at home in a formal taxonomy (e.g. “News”, “Movies”, “Music”). Some, like “Daily” and “Recommendation” appear to describe resources with a particular property which is nevertheless fixed and user independent. Others like “Fun” and “Geek” describe more personal properties that depend on individual interpretation. Finally there are proper names like “UK” and “NYC”.

So, what are the facts telling us???

technorati tags: ,


Folksonomies are not a “bad” thing!

I have been busy lately testing some of my ideas about the semantics of tags .. which I will blog as soon as I have a coherent set of things to say. In the meantime I wanted to write this post in case there is a misconception that I have somehow been maligning this whole tagging business. I have not. I think there is lots of cool stuff being done with them, and I especially love the potential of the Yahoo My Web 2.0 stuff.

I also love the new browser, Flock. It is based on the Firefox code base, so I hope they don’t all start fighting together, but I think Flock really shows the potential of bringing together the content on “my web”.

My point is that folksonomies can do even more, if we are able to extract the full richness of their semantics. This is what I am trying to do right now.

technorati tags: , ,

Our Mind the Taxonomist

So finally after the many pages of blogs and thoughts, I think my main points finally came into focus after the last blog, after Clay Shirky’s clear and direct claim which I reiterate here:Here’s what’s radical about what protends: My vocabulary on folksonomy is personal, not vernacular — no one knows or needs to know which class I’m talking about when I tag something ‘class’, or that I use LOC to mean Library of Congress. This isn’t the same as, say, the dictionary of thieves slang from the mid-18th c. because no one else needs to know my bookmark system, and I don’t need to know anyone else’s,

So here is what’s radical about what Shirky protends: if I want to, I can tag my bookmarks with any vocabulary I chose. So here goes, a sample of my tags: “hdfjkfb”, “orjfkido”, “hjfoå”, “krlofpke”. So this certainly ensures that no one else knows my system, and if everyone else does the same, I won’t know theirs. But how useful would this be to anyone?? So at this rather radical extreme the Shirky claim is without content.
But maybe the example is too radical because the notion of vocabulary precludes the use of random tags. Fine. Let us think of a different example which is perhaps closer to the spirit of Clay’s idea. So I make up a system that only I know, where all interesting sites end with “xyz”, technical sites begin with “krp”, and so on. Everyone else can make up their own system, and no one knows anybody else’s. But an immediate problem with this is that it is cheating … it smuggles in the more natural English vocabulary via the back door by simply equating each English term with an expression in the new “vocabulary system”. Still, this is private knowledge so maybe that is O.K. for the example. But this leads to the interesting hypothesis: suppose you made up vocabularies like “krp…”, “…xyz”, and so on, and got users to adopt them. These can be mapped onto their “parent” vocabularies that they are derived from (remember “krp…” = “technical site”, etc.). To the individual user, each system would be equivalent (more or less, as Katie Melua would say). But which vocabularies would make a better

I think it is pretty clear that we DO need to know SOMETHING about everyone else’s bookmark system! The success of a “social bookmarking system” depends on the fact that we do understand each others vocabularies (to some extent), and can extract value from that shared understanding.

What I have been arguing is that the aggregate data shows us something about what we know about everyone else’s bookmark system. Why are some terms so “natural”? Do all “natural” terms become popular? Are all “natural” terms natural in the same way, or are there different categories according to which terms can naturally relate to their categories (a bit like facets in taxonomies).
So my position is that folksonomies can provide interesting data which can give a clue about the way humans organize knowledge, AND about the ways in which we share each other’s organizational systems. My Ph.D. was originally in Cognitive Psychology, and during my studies I came to believe in the hypothesis that mental architecture fundamentally shapes our perceptions and organization of the world in which we live. Further, essential aspects of the mental architecture are fixed and therefore shared by all humans, which is what makes communication and shared understanding possible. The overlap is not perfect. I say Library of Congress, but Clay wants to say LOC. Good for him. But pity the poor soul who calls it “the square root of negative 2”!
The mind creates categories, because that is what minds do (there is an awful lot more to say about that!). The mental architecture enforces the range of possible ontologies and taxonomies that we can bring to bear on the understanding of our universe. All humans share fundamental aspects of mental architecture and therefore properties of possible taxonomies. Folksonomies provide a fantastic window into the worshop of the mental taxonomist!

More Differences

Just when I finished my previous post on the problems with focusing on individual differences, I discovered this post , once again by Clay Shirky. He makes a good point about why it might be important to think about individual differences when it comes to tag use. He says:

Here’s what’s radical about what protends: My vocabulary on folksonomy is personal, not vernacular — no one knows or needs to know which class I’m talking about when I tag something ‘class’, or that I use LOC to mean Library of Congress. This isn’t the same as, say, the dictionary of thieves slang from the mid-18th c. because no one else needs to know my bookmark system, and I don’t need to know anyone else’s …

So pretty clearly the view is that an individual can do whatever s/he wants, in complete isolation from every other user, and the “radical” system will accommodate each individual equally. Thus, while many people might agree, the system is “ensuring that the emergent consensus view does not have to be pushed onto any given participant“.

So this is why individual differences matter, because they are allowed to exist. Taxonomies are bad because they force everyone to be the same, folksonomies are good because they allow everyone to be individuals. This is probably good ideology, but is it good science? Is it even interesting? Is it useful?

These are the questions I have been asking all along. Where, pray tell, is the evidence that highly unusual tags that differ from the “norm” are even useful? Here are some of my less frequent tags: adhoc, bar, and controlled. I have NO IDEA what any of them stand for! Maybe I am a bad tagger? So ignore me.

Here is an observation based on a single example (so take with grain of salt .. the same grain you use for all other single examples!): a few weeks ago the most popular tag for the New York Times was “news” with 2093 instances. The next few are “newspaper” with 550, “daily” with 370, “nyc” and “media” with 229. These tags kinda make sense .. but after this you have “english” with 15 instances, “business” with 17, and “noticias” with 16. These last few are a real mix. I wouldn’t be surprised if “business” turned out to be useless for most users: it sounds like a tag you add when there is pressure to improve on just “news”, but probably never used as a retrieval aid! (Clearly an empirical question, but would someone so interested in business news ever bookmark news sources that DIDN’T have business news??). “English” and “noticas” are interesting in that they appear to cater to internationalization needs. This is definitely an important requirement for category systems, which many don’t address adequately.

But the question is, is there a significant benefit after “news”, “newspaper”, and say, “daily”? How many users would have so many links that they would not locate the New York Times without also including “business”???

But apart from the usefulness issue, the really interesting observation is still the remarkable overall agreement. Surely the New York Times is “daily”, and it includes a “business” section .. so these are all equally valid features by which to reference it. But why do (roughly) 10 to 200 times the people prefer “news” to the other two?

And then there is my thought experiment. If the evil millionaire convinced 100000 people to tag the New York Times as “finglewick”, how long would that survive? What about “really super cool site”? What about “should subscribe”? Why are some of these tags good but others not? Why not “finglewick”? Why would some good tags, that would probably generally be judged as appropriate (like “should subscribe”) not survive (in my opinion) while some others, like “news”, definitely do?? lets us label the New York Times any which way we like. But in spite of infinite freedom we all label it “news”. Now THAT is news to me!

Individual Differences and the trouble with the Standard Social Science Model

I have just realized that in my thoughts about all this tagging, I have really been focusing on the historical and other forms of aggregate data, and not really been worried about individual users. This is really in contrast to the analyses carried out by the die hard tags vs. categories types. A prototypical example of this is again Clay Shirky’s (classic?) post. Here are the titles of his figures illustrating “Tags per user”, “A single user’s tags”, “Different tag ‘signatures’ for different URLs”. O.K., so he only has three graphs. But a full 66% of them refer to single users.
This reflects differences in bias …. no, not really bias …. but attitudes toward the appropriate level of description at which phenomena are best understood. I remember the moment my attitude suddenly made sense to me, when I came across the brilliant book containing an introductory chapter by the psychologists John Tooby and Leda Cosmides (the last of whom I met at a lunch once, to my great pleasure. Leda is a brilliant and really nice lady). In presenting their Adaptationist view of cognition, they argue that the traditionally popular Standard Social Science Model in which human minds are presented as “blank slates” to be written on by experience and culture, is simply wrong. If I remember correctly (this is reaching back a few years!) they illustrate their argument with the failures of Anthropolgical research in much of the 20th century. The problem is that researchers were so keen on identifying the vast power of cultural differences on shaping human behavior, they over focused on finding those differences at the expense of ignoring the similarities. So in one culture men marry one woman, in others they marry 100. There is clearly a difference here, but is that difference more important than the striking similarity that both cultures have some concept of “marriage”? The truly important observation made by Tooby and Cosmides is that there is an absolutely remarkable series of commonalities between ALL cultures on the face of the planet; similarities to do with affiliation, punishment, and so on. How can this be? It is the “aggregated” similarities rather than the specific differences that pose the really interesting puzzles.
This observation is in many ways similar to the works of the great linguist Noam Chomsky, who changed the face of linguistics when he postulated that the really important part of understanding language is to understand the nature of the mental faculties that must exist in order to learn and to use human language in the way that humans do. The study of these faculties has shown that the knowledge of language in speakers of all human languages probably overlaps a great deal, and that the apparently vast differences between languages are due to relatively unimportant factors like parametric variations and different vocabularies. I am being somewhat cavalier in this, I know, but let me illustrate the basic point with my previous linguistic example: it is true that different individual English speakers might chose to say either “Who do you want to go with” or “Who do you wanna go with”. We could of course have all sorts of theories about the sorts of people would want to say one or the other. Or we could say “look .. I really (really) couldn’t care less about why you would chose to say one or the other”. What is really interesting is that both speakers will admit (unless they have some sort of non-linguistic psychological problems!) that either way of saying it is fine. Furthermore, they will also both agree that “Who do you want to help you” is good, but the superficially similar variation “Who do you wanna help you” is not good. All of the sudden questions about the superficial choice of different surface forms is superseded by a much deeper question about the nature of (shared) knowledge that allows the expressions in the first place. Of course the differences between languages is less trivial, and we can learn a great deal about language by the ways in which different languages can differ. But again it is the generalizations and categorizations of the kinds of possible differences that is important.
So maybe the point about all this to tags is obvious? Well, for what its worth … I have a feeling that the aggregate information is telling us something interesting about the way people (in general) use tags, that will subsequently also help us understand why a person (individually) chooses to use them in their own unique (to a point?) way.
So in a way this is just an attitude (bias?) that I have. But the wonderful thing about science is that our respective biases are eventually tested and weeded out. Does one level of analysis give us deeper understanding than the other? Which one has more predictive power, in the sense that we can uncover more new facts and explain more old ones?? In some cases like linguistics the answer is fairly clear, in psychology less so, and with respect to tags … we have a long way to go.
But I am optimistic about my attitude. Here is a prediction I can make. In an earlier post I suggested an experiment in which an evil millionaire who wants the whole world of computers to bend to his will (hmm .. sounds familiar!), pays a lot of people to insert his favorite tags into popular web bookmarks. So he pays people to insert tags like really_super_cool, must_have_for_my_birthday, blue, and so on. What would happen? My feeling is that, once he stopped paying, the popularity of these tags would quickly drop off. How could we explain that if it were true? Again my feeling is that every possible explanation would eventually boil down to the claim that people just don’t like tags like that to describe resources, in the same way that they like tags like news. Why not? Well, we are back to the linguistics-style argument again.
I have other predictions of course, which Joshua could easily confirm or dis confirm. I bet people mostly click on the big words in tag clouds. But I am not too deeply committed to this feeling. On the other hand I bet more that when people search with combinations of tags, they tend to use relatively few and high frequency tags. And I bet the aggregate data would show this!

More “order from chaos”

This morning I discovered the following words of wisdom from a blog that aims to provide technical tips and tricks to other bloggers. This one is about ways to tag your own blog in order to make it more visible to searches. Here is a really interesting claim about the differences between tags and categories. It seems to me to be a very clear and concise summary of much of the public opinion in favor of tags:

“One of the common distinctions that comes up is whether you’re using tags or categories (“tags-as-categories”) for your posts. The distinction is that categories are fewer in number, generic, chosen beforehand, possibly hierarchical (sub-categories) and persist for a long time. Sometimes people will file a post under only one category. Tags are much more specific, made up on the spot, are “flat”, may be single-use and each post may have half a dozen or more of them.”

But this is really interesting in light of the emerging patterns I discussed last time. First, the claim that categories are fewer in number is not really true if you look at the relatively few tags that become popular for a given site. Second, the claim seems to be that categories persist for a long time but tags don’t .. again blatantly false if the historical trends are any indication! Third, there is a distinction set up between “generic” categories and “specific” tags. While I have not presented much evidence about this, my feeling is that this is also not true, at least for the most popular tags. Consider again the running example. How much more generic can you get than “news” and “New York”? Finally, tags are supposed to be made up “on the spot” and categories “chosen beforehand”. This is tricky one which involves the whole process of creating consensual categories vs. “inventing” tags. But this last one is a lot trickier than the writer suggests: surely there are prior constraints on what kinds of tags are chosen, both internal (cognitive), external (social/cultural) and pragmatic (the tagging tool often suggests tags based on prior community agreement).

It seems to me that the aggregate information from tags is telling a very different story from the “popular wisdom”. I hinted earlier that the tags which become really popular might be systematically different from those which do not. The writer I am quoting above seems to agree with this to some extent in a later part:

“In practice, the category/tag distinction is a question of degree and many authors use a hybrid strategy: tag their posts wildly but reserve ten or so to use as categories. This seems to be a best-of-both-worlds scenario, as your top tags (when sorted by frequency) function like categories while the long-tail of one-shot tags is useful for tag-searchers. “

So here is a first shot crude categorization of tags: high frequency = category, low frequency = something else.
But die-hard anti-categorists would probably be revolted by this conclusion. Here is one of the most emotionally charged pro tagging blogs I have found! He says:

“What’s more, is that I can cross-reference those bad boys! Yessiree! I can see what posts I have written containing any combination of tags. That’s power. I’d like to see your wack-ass categories do that!”

So this is a different argument … it does not care about the abstract nature of categories vs. tags .. what it cares about is how they can be used. For many practical reasons, systems that use categories tend to be exclusive in the sense that a particular resource is filed under just one category. This is why the category structure has to be just right, otherwise the resource will never be found. Of course, it is never possible to get this just right for every conceivable need. Cross referencing is possible (and probably necessary), but results in vastly increased complexity. But tags appear to represent a new paradigm: resources can be tagged with as many tags as one desires, and these tags can cross-classify the resource in any which way. Retrieval is then accomplished by finding suitable intersecting regions of the relevant tag space. The retrieval process seems to make sense with simple examples. In a recent paper, Golder and Huberman illustrate the new paradigm with the following example:

“For example, consider a hypothetical researcher who downloads an article about cat species native to Africa. If the researcher wanted to organize all her downloaded articles in a hierarchy of folders, there are several hypothetical options, of which we consider four:
1. c:articlescats all articles on cats
2. c:articlesafrica all articles on Africa
3. c:articlesafricacats all articles on African cats
4. c:articlescatsafrica all articles on cats from Africa
Each choice reflects a decision about the relative importance of each characteristic. Folder names and levels are in themselves informative, in that, like tags, they describe the information held within them (Jones et al. 2005). Folders like 1. and 2. make central the fact that the folders are about “cats” and “africa” respectively, but elide all information about the other category. 3. and 4. organize the files by both categories, but establish the first as primary or more salient, and the second as secondary or more specific. However, looking in 3. for a file in 4. will be fruitless, and so checking multiple locations becomes necessary.”

The promise of tags is that they will eliminate the need to check multiple locations because they eliminate the need to “guess” which category to look in: you simply search for “Africa+cat”. But this idyllic situation soon begins to look a little worse … suppose the researcher downloads some more articles, this time specifically about “cheetahs”. Neglecting to tag them with the existing tag “cat”, she decides to use the more specific tag “cheetah”. But now the search for “Africa+cat” fails to find these important articles! So the user has to look elsewhere, possibly realizing that the third tag is also relevant. This needs sophisticated tools that can do some fancy computations over the set of existing tags .. information extraction and reasoning techniques. How bad will this get with millions of resources and millions of tags? No one knows, but some speculate the confusion will become so big, and the cost of retrieving useful information in such a landscape so prohibitive, that the whole enterprise will one day disappear!

Since we don’t have any direct evidence (or at least none that I am not aware of!) about the way in which people use their general as well as (possibly very many) idiosyncratic tags for the purpose of retrieval, we have to guess through indirect evidence. Speaking from personal experience though, I have NO IDEA what many of my least frequently used tags are even supposed to be about … so mostly I just use those BIG TAGS on my TAG CLOUD!

So now we need some evidence on how people use tags. I suppose the first thing to look at is how many tags individual people tend to use with individual URLs. Speroni, who has gathered many users’ information for his mindmaps estimates this figure around 10. This site is a very interesting read because he goes on to explain how you can use Pascal’s triangle to calculate the number of URLs that can be uniquely indexed by a combination of n out of a total of m tags. But this is an idealistic calculation which assumes that the tags are used independently (I think!). So I gathered some numbers from the users who generated mindmaps. There were 2202 maps from which I calculated the means and medians. The mean and medium number of links were 179 and 103, respectively. To store 179 tags, you need 10 tags used in combinations of 4. But the mean number of tags per user is 100! Obviously a sub optimal strategy by the users!!
This suggests that people use many more tags than they really need for each bookmark. Many of those tags must be redundant, or even unused. Of course the result is consistent with the idea that a few, frequent tags are used as categories in a way that efficiently narrows the search. The rest of the tags are there either for redundancy, or perhaps to facilitate alternative, albeit more infrequent, access paths.
Here is another interesting observation that supports this view. Golder and Huberman analyzed the tags assigned to individual URLs by individual users, in the order that the tags were assigned to that URL. What they found was that people tend to use the highest frequency tags first, then start using the lower frequency, more idiosyncratic tags. This is again consistent with the view that high frequency tags are like categories or folders to hold relevant items, while lower frequency tags add more personally oriented distinguishing features to each resource.

So, popular tags (and many less popular ones) are stable, intrinsically sensible and communally accepted categories!!

But what sort categories are they? How do they structure the information space? Do they really cross-categorize? Again, back to the beginning!

Emerging patterns

There are many ways to see patterns. The most common perhaps is mathematical and statistical. What tends to go with what? How do things change over time? What mathematical equations can describe patterns of behavior? The very existence of physics is a testament to the power of this approach.

But mathematical analysis is not always appropriate. Complex equations can be used to analyze movement on the stock market, but they can’t explain why some people sell and some don’t, when they appear to be in very similar circumstances. As another example, statistical regularities are plentiful in language, but an analysis of these regularities cannot tell us much about what humans know about language when they use it. Thus, the pair of sentences “Who do you want to go with?” and “Who do you wanna go with?” both sound fine but when we look at a similar pair “Who do you want to help you?” is fine, but “Who do you wanna help you?” is not. Explaining how we know this requires a different sort of analysis and explanation.

In what follows I summarize some of the existing statistical observations. I note their inadequacy, and some time later show an alternative sort of explanation which I think is much more powerful.

There is a very interesting web site that provides visualizations of the historical patterns in tag use for a given URL. The site is cloudalicious.

Here is a visualization of the way in which the most popular tags emerged for the New York Times web site. (An important note: in what follows I will illustrate my claims with the NYT example, which is but a single example. This is terrible practice, although one which seems to be followed by an awful lot of people in the “Blog world”. It is terrible because we have no way of knowing if a particular observation is true in general, or only for a specially selected example. As a result, the proper evaluation of any claim has to be done by looking at a reasonable set of randomly selected sites, to see if the observation repeats itself often enough for us to believe in its generality).

There are a number of interesting observations that can be made about this graphic.

First, there is a hint of the famous “power law curve” apparent in the curves, although this graph is not a particularly good illustration. But the idea is that there are a few tags which are used very often and very many tags which are used less often. In this example the use of the tag “news” dominates more strongly than is typically the case, and the rest of the tags are a bit too close together at the bottom.

But the most striking observation is how much agreement there is in the use of the tags. After a brief unsettled period at the beginning of the history, the pattern pretty much stabilizes and the dominance of the most popular tags is never challenged. (The sudden drop in all tags is linked to a sudden influx of new users tagging that site, which is marked with the faint gray line in the background).
There are some interesting exceptions to this relative stability, and Pietro Speroni comments on some interesting cultural factors which appear to be at work in the tagging pattern of some sites, resulting in interesting changes in the usage of particular tags. But in spite of his optimistic “Change, change everywhere” conclusion, the relative number of sites which are dominated by stability is unknown. I have looked at quite a few, and have very rarely seen such interesting “cultural patterns”. Clearly we need some real data if this turns out to be an important question.

Added Feb. 9: It is always a great feeling when you are working on a hypothesis like the one above, to find some corroborating evidence. So after I blogged this, I discovered a very nice paper by Scott Golder and Bernardo Huberman at HP Labs. They do an extensive study of tagging behavior and come to the same conclusion about the amazing stability of tas use. But they make two additional points of interest. First, they show that tags tend to stabilize after just 100 bookmarks have been assigned to the URL. Thus even not so popular sites end up with stable tag clouds. Secondly they argue that “imitation” made possible by the user interface can’t be the whole explanation of the stability of tags since the less popular tags, which are not shown as suggestions through the interface, display the same stability over time!

An important issue seems to have been overlooked concerning the power law curve observed with tags. Tim Vanderwall attempts to explain the power law curve with a process in which people are brought together into clusters who share common vocabularies because it helps them identify the resources. Since new users typically have access to the most popular tags already used for that site, it is likely that they will chose at least some of those existing tags. But this sets up a positive feedback loop. If the popular tags are picked by new users they will become more popular still, which will influence even more new users to pick that tag, and so on. But this “social” explanation can’t be the whole story. Consider the following experiment: Suppose I am a really rich guy who wants to influence tags on So I pay 10000 people to tag resources according to my schema. I tell them to mark one site with “eek”, another one with “woo hoo”, a third one with “grumpy grumpy head”, and so on. With enough people, these should become the most popular tags. But how long will the dominance of these tags last? This is an experiment that does not really need doing! Another version of the experiment might be worth doing. Suppose I wanted the most popular tag for each URL to be some sort of emotional evaluation like “cool”, “interesting”, “awful”, and the like. Would these stick? My feeling is, NO. My feeling is that “subjective” tags of this sort won’t make it because there is too much individual variability in the emotional reaction to a site … and this reaction is hard to coerce. On the other hand the tags which do survive the “popularity contest” are those about which, contrary to popular opinion, there is not much disagreement and individual difference. Put another way, even though I might not have thought to label “The New York Times” with “News” on a particular occasion, I certainly would not argue that it should not be labeled with “News”.

The point of course is that in order to “make it” as a popular tag, you have to have some pretty special properties in relation to the resource you are tagging. My whole point up til now has been, “What are those??”. My claim is that these properties involve cognitive and linguistic processes, and the emergence of patterns and clouds can only be understood with some insight into those processes. It is not the mathematical generalizations but linguistic/cognitive facts which will give greatest insight into user tags.

There is an additional implication of the fact that highly idiosyncratic tags like “must-read” don’t tend to dominate the distribution (I think there is a side issue that there are many different ways to be idiosyncratic .. it can be a tag only used by a particular individual, or alternatively it can be a tag used by more people, but each time in a highly individual way). If this is generally true, it shows that examples of this sort, which are often cited against the “tags as ontologies” notion, lose some of their power. These highly individual tags seem to me to take on a completely different role in tagging behavior. My feeling is that popular tags are about “collective categories” (of various sorts, to be discussed) and idiosyncratic tags are about user-centered, context dependent memory cues. This has at least two implications. First, we need to consider different sorts of tags .. tags are not all the same. Second, we need to find evidence that highly individual tags are even useful. They often assumed to be, in the interest of the new, empowering, free-to-chose-as-you-like paradigm. But how many things can you label “to read” before the tag loses its meaning? Like the piles of papers rising like mountains in the corners of many of our desks, I am sure! In fairness, I acknowledge the sometimes made claim that user tags have an additional (or perhaps predominant?) role in pointing to similar, possibly useful URLs. According to this view tags are different to formal categories in that the latter are about locating resources in some precise manner while the former is about navigating among potentially useful sites, using tags as pointers. But even if this is true, the point remains that tags like “must-read” and “cool” will add very different amounts of value to different audiences.
There is another interesting attempt to find patterns in tags using statistical co-occurence. Here is an example of my tags translated into a mindmap. The mindmap shows two interesting patterns. First, it shows groups of tags which tend to be used for the same URL. The amount of overlap can be adjusted by a parameter, but the default is set around 60%. That is, if two tags share 60% of their URL’s they are clustered together. An example on my map is [Wikepedia encyclopedia]. More than two tags can be clustered as in [emoticons messenger smiley yahoo]. Actually this is a little more complicated because the parametrically determined number of shared tags also depends on the depth of the nodes. Nodes at the leaves can be clustered even if they share much fewer than 60%. The meaning of the hierarchical relation is the second interesting point in this map. Any tag which appears as a sub-tag in the mindmap is one that never labels a URL which is different from the one the super-tag labels. For example on my mindmap “rss” labels two URLS with the names “RSS Readers for Linux” and “FeedXs”. In turn “rss” has the sub-tags “reader”, “feeds”, “free” and “publishing”. Of these, the first is used to tag “RSS Readers for Linux” and the last three each tag “FeedXs.”
So what are the additonal tags doing?
One possibility is that they are in a sense redundant … rss is always free, so the two tags provide alternate routes for finding the site. In my particular folksonomy either one would have done the trick on its own, but the redundancy might help finding the resource from two different sources.
Another possibility is that the additional tags refine the search. “RSS” would give two links but “rss” + “reader” gives only one. As such they act like subclasses in a formal taxonomy. Except .. they don’t. In the current example it is obvious that “reader” is not supposed to be a subclass of “rss”. Instead, I meant to have a single category “rss reader” .. but does not allow two-word tags! But there are other reasons for two tags to go together, apart from a design side effect and a genuine subclass. For example “September11” and “GeorgeBush” might go together ’til the end of time, but not because one is a subclass of the other, nor is one in any sense a refinement of the other.
These relationships contain valuable information, which I haven’t really thought enough about. But it is pretty clear that a number of different patterns could emerge.
One observation which is pretty clear is that individual taggers (not an aggregation now) have a selected set of tags which in some sense dominates the others. Look at some numbers on the main site again. My map has the following numbers next to it: (78, 168, 56), meaning that I have 78 unique URLs tagged with a total of 168 tags, but only 56 of those are unique. The pattern here varies widely, with some people having many more total tags than main tags (lots of hierarchical clustering) and others having hardly any hierarchical use of tags.
There is clearly lots of interesting information hidden in these relationships.
But I haven’t yet told you about what I think is going on with the popular tags. I think this might also help us understand the individual ones …..

Naive ontologists

So I have been away for a while dealing with the more mundane aspects of academia. Trying to finish papers before a deadline, attending meetings …

But in the meantime I also managed to find some really interesting patterns in the way people tag resources.

This is the reason scientists obsess with observations and theories …. to make sense of the mess. Imagine the buzz you get when patterns begin to emerge from the noise: when everyone else sees clouds, you see the regular little formations that lie behind the otherwise amorphous outgrowths.

Consider the set of most popular tags for the New York Times web site. Here they are, followed by the number of people using that tag (as of January, 2006). The numbers were obtained through the Scrumptious plugin for Firefox.

news (2093), newspaper (550), daily (370), nyc (298), newspapers (251), media (229), usa (156), politics (68), newyork (56), new (43), world (40), york (38), nytimes (38), nyt (36), times (34), us (32), culture (28), international (23), safari_export (22), reference (19), national (18), ny (18), business (17), noticias (16), english (15).

The individual tags are all obvious or semi-obvious in the way they describe the resource (the NYT web site). There are some puzzling ones … “new” for example. But presumably this was meant to go together with “york” to form “new york”. A slight problem with this interpretation is that “new” appears slightly more often than “york” which means that if my speculation is correct, then some people must have used “new” independently of “york”. Why? Perhaps they used it in conjunction with another word? Which one? We need to obsess about this problem a little bit in the background until we have an answer. Loose threads are annoying.

But the really interesting question is, why do people use these particular tags? A typical and simple answer that is generally given is that people use whatever tags they think will be useful for them to later retrieve the documents. These tags happen to be useful for most people in most contexts.

But I have to wonder, why?? Why are these words useful for this resource? Can we characterize the most used tags in some interesting way? Can we predict for a new URL how people will tag it?
This last point is important, because prediction is one of the most important aspects of science. Most people won’t use random words to describe a resource. Can we tell which ones they will use?

But I think the claim that tagging is so different to categorization is meant in a strong sense. I think the claim must be that no formal and predictable pattern can be ascribed to user classifications (apart from localized, transient, culturally transmitted patterns). That is, in complete opposition to the sort of formal ontologies used in the YAHOO directory for instance, user tags are supposed to be parasitic on whatever mental association happens to work for a given person at a given time for a given URL in a given context. It is precisely their lack of formal pattern that is supposed to make tags so useful for individuals, and which allows anyone to enter into the classification game.

But is this true? Is there really no formal pattern to be seen in user tags? I think there is plenty, and the topic of my next few blogs will be to see what patterns I found, and how I found them. So think about the tags above … what can you see?? Can you see the handywork of “naive ontologists”?