Emerging patterns

There are many ways to see patterns. The most common perhaps is mathematical and statistical. What tends to go with what? How do things change over time? What mathematical equations can describe patterns of behavior? The very existence of physics is a testament to the power of this approach.

But mathematical analysis is not always appropriate. Complex equations can be used to analyze movement on the stock market, but they can’t explain why some people sell and some don’t, when they appear to be in very similar circumstances. As another example, statistical regularities are plentiful in language, but an analysis of these regularities cannot tell us much about what humans know about language when they use it. Thus, the pair of sentences “Who do you want to go with?” and “Who do you wanna go with?” both sound fine but when we look at a similar pair “Who do you want to help you?” is fine, but “Who do you wanna help you?” is not. Explaining how we know this requires a different sort of analysis and explanation.

In what follows I summarize some of the existing statistical observations. I note their inadequacy, and some time later show an alternative sort of explanation which I think is much more powerful.

There is a very interesting web site that provides visualizations of the historical patterns in tag use for a given URL. The site is cloudalicious.

Here is a visualization of the way in which the most popular tags emerged for the New York Times web site. (An important note: in what follows I will illustrate my claims with the NYT example, which is but a single example. This is terrible practice, although one which seems to be followed by an awful lot of people in the “Blog world”. It is terrible because we have no way of knowing if a particular observation is true in general, or only for a specially selected example. As a result, the proper evaluation of any claim has to be done by looking at a reasonable set of randomly selected sites, to see if the observation repeats itself often enough for us to believe in its generality).

There are a number of interesting observations that can be made about this graphic.

First, there is a hint of the famous “power law curve” apparent in the curves, although this graph is not a particularly good illustration. But the idea is that there are a few tags which are used very often and very many tags which are used less often. In this example the use of the tag “news” dominates more strongly than is typically the case, and the rest of the tags are a bit too close together at the bottom.

But the most striking observation is how much agreement there is in the use of the tags. After a brief unsettled period at the beginning of the history, the pattern pretty much stabilizes and the dominance of the most popular tags is never challenged. (The sudden drop in all tags is linked to a sudden influx of new users tagging that site, which is marked with the faint gray line in the background).
There are some interesting exceptions to this relative stability, and Pietro Speroni comments on some interesting cultural factors which appear to be at work in the tagging pattern of some sites, resulting in interesting changes in the usage of particular tags. But in spite of his optimistic “Change, change everywhere” conclusion, the relative number of sites which are dominated by stability is unknown. I have looked at quite a few, and have very rarely seen such interesting “cultural patterns”. Clearly we need some real data if this turns out to be an important question.

Added Feb. 9: It is always a great feeling when you are working on a hypothesis like the one above, to find some corroborating evidence. So after I blogged this, I discovered a very nice paper by Scott Golder and Bernardo Huberman at HP Labs. They do an extensive study of tagging behavior and come to the same conclusion about the amazing stability of tas use. But they make two additional points of interest. First, they show that tags tend to stabilize after just 100 bookmarks have been assigned to the URL. Thus even not so popular sites end up with stable tag clouds. Secondly they argue that “imitation” made possible by the user interface can’t be the whole explanation of the stability of tags since the less popular tags, which are not shown as suggestions through the interface, display the same stability over time!

An important issue seems to have been overlooked concerning the power law curve observed with tags. Tim Vanderwall attempts to explain the power law curve with a process in which people are brought together into clusters who share common vocabularies because it helps them identify the resources. Since new users typically have access to the most popular tags already used for that site, it is likely that they will chose at least some of those existing tags. But this sets up a positive feedback loop. If the popular tags are picked by new users they will become more popular still, which will influence even more new users to pick that tag, and so on. But this “social” explanation can’t be the whole story. Consider the following experiment: Suppose I am a really rich guy who wants to influence tags on del.icio.us. So I pay 10000 people to tag resources according to my schema. I tell them to mark one site with “eek”, another one with “woo hoo”, a third one with “grumpy grumpy head”, and so on. With enough people, these should become the most popular tags. But how long will the dominance of these tags last? This is an experiment that does not really need doing! Another version of the experiment might be worth doing. Suppose I wanted the most popular tag for each URL to be some sort of emotional evaluation like “cool”, “interesting”, “awful”, and the like. Would these stick? My feeling is, NO. My feeling is that “subjective” tags of this sort won’t make it because there is too much individual variability in the emotional reaction to a site … and this reaction is hard to coerce. On the other hand the tags which do survive the “popularity contest” are those about which, contrary to popular opinion, there is not much disagreement and individual difference. Put another way, even though I might not have thought to label “The New York Times” with “News” on a particular occasion, I certainly would not argue that it should not be labeled with “News”.

The point of course is that in order to “make it” as a popular tag, you have to have some pretty special properties in relation to the resource you are tagging. My whole point up til now has been, “What are those??”. My claim is that these properties involve cognitive and linguistic processes, and the emergence of patterns and clouds can only be understood with some insight into those processes. It is not the mathematical generalizations but linguistic/cognitive facts which will give greatest insight into user tags.

There is an additional implication of the fact that highly idiosyncratic tags like “must-read” don’t tend to dominate the distribution (I think there is a side issue that there are many different ways to be idiosyncratic .. it can be a tag only used by a particular individual, or alternatively it can be a tag used by more people, but each time in a highly individual way). If this is generally true, it shows that examples of this sort, which are often cited against the “tags as ontologies” notion, lose some of their power. These highly individual tags seem to me to take on a completely different role in tagging behavior. My feeling is that popular tags are about “collective categories” (of various sorts, to be discussed) and idiosyncratic tags are about user-centered, context dependent memory cues. This has at least two implications. First, we need to consider different sorts of tags .. tags are not all the same. Second, we need to find evidence that highly individual tags are even useful. They often assumed to be, in the interest of the new, empowering, free-to-chose-as-you-like paradigm. But how many things can you label “to read” before the tag loses its meaning? Like the piles of papers rising like mountains in the corners of many of our desks, I am sure! In fairness, I acknowledge the sometimes made claim that user tags have an additional (or perhaps predominant?) role in pointing to similar, possibly useful URLs. According to this view tags are different to formal categories in that the latter are about locating resources in some precise manner while the former is about navigating among potentially useful sites, using tags as pointers. But even if this is true, the point remains that tags like “must-read” and “cool” will add very different amounts of value to different audiences.
There is another interesting attempt to find patterns in tags using statistical co-occurence. Here is an example of my tags translated into a mindmap. The mindmap shows two interesting patterns. First, it shows groups of tags which tend to be used for the same URL. The amount of overlap can be adjusted by a parameter, but the default is set around 60%. That is, if two tags share 60% of their URL’s they are clustered together. An example on my map is [Wikepedia encyclopedia]. More than two tags can be clustered as in [emoticons messenger smiley yahoo]. Actually this is a little more complicated because the parametrically determined number of shared tags also depends on the depth of the nodes. Nodes at the leaves can be clustered even if they share much fewer than 60%. The meaning of the hierarchical relation is the second interesting point in this map. Any tag which appears as a sub-tag in the mindmap is one that never labels a URL which is different from the one the super-tag labels. For example on my mindmap “rss” labels two URLS with the names “RSS Readers for Linux” and “FeedXs”. In turn “rss” has the sub-tags “reader”, “feeds”, “free” and “publishing”. Of these, the first is used to tag “RSS Readers for Linux” and the last three each tag “FeedXs.”
So what are the additonal tags doing?
One possibility is that they are in a sense redundant … rss is always free, so the two tags provide alternate routes for finding the site. In my particular folksonomy either one would have done the trick on its own, but the redundancy might help finding the resource from two different sources.
Another possibility is that the additional tags refine the search. “RSS” would give two links but “rss” + “reader” gives only one. As such they act like subclasses in a formal taxonomy. Except .. they don’t. In the current example it is obvious that “reader” is not supposed to be a subclass of “rss”. Instead, I meant to have a single category “rss reader” .. but del.icio.us does not allow two-word tags! But there are other reasons for two tags to go together, apart from a design side effect and a genuine subclass. For example “September11” and “GeorgeBush” might go together ’til the end of time, but not because one is a subclass of the other, nor is one in any sense a refinement of the other.
These relationships contain valuable information, which I haven’t really thought enough about. But it is pretty clear that a number of different patterns could emerge.
One observation which is pretty clear is that individual taggers (not an aggregation now) have a selected set of tags which in some sense dominates the others. Look at some numbers on the main site again. My map has the following numbers next to it: (78, 168, 56), meaning that I have 78 unique URLs tagged with a total of 168 tags, but only 56 of those are unique. The pattern here varies widely, with some people having many more total tags than main tags (lots of hierarchical clustering) and others having hardly any hierarchical use of tags.
There is clearly lots of interesting information hidden in these relationships.
But I haven’t yet told you about what I think is going on with the popular tags. I think this might also help us understand the individual ones …..