More “order from chaos”

This morning I discovered the following words of wisdom from a blog that aims to provide technical tips and tricks to other bloggers. This one is about ways to tag your own blog in order to make it more visible to searches. Here is a really interesting claim about the differences between tags and categories. It seems to me to be a very clear and concise summary of much of the public opinion in favor of tags:

“One of the common distinctions that comes up is whether you’re using tags or categories (“tags-as-categories”) for your posts. The distinction is that categories are fewer in number, generic, chosen beforehand, possibly hierarchical (sub-categories) and persist for a long time. Sometimes people will file a post under only one category. Tags are much more specific, made up on the spot, are “flat”, may be single-use and each post may have half a dozen or more of them.”

But this is really interesting in light of the emerging patterns I discussed last time. First, the claim that categories are fewer in number is not really true if you look at the relatively few tags that become popular for a given site. Second, the claim seems to be that categories persist for a long time but tags don’t .. again blatantly false if the historical trends are any indication! Third, there is a distinction set up between “generic” categories and “specific” tags. While I have not presented much evidence about this, my feeling is that this is also not true, at least for the most popular tags. Consider again the running example. How much more generic can you get than “news” and “New York”? Finally, tags are supposed to be made up “on the spot” and categories “chosen beforehand”. This is tricky one which involves the whole process of creating consensual categories vs. “inventing” tags. But this last one is a lot trickier than the writer suggests: surely there are prior constraints on what kinds of tags are chosen, both internal (cognitive), external (social/cultural) and pragmatic (the tagging tool often suggests tags based on prior community agreement).

It seems to me that the aggregate information from tags is telling a very different story from the “popular wisdom”. I hinted earlier that the tags which become really popular might be systematically different from those which do not. The writer I am quoting above seems to agree with this to some extent in a later part:

“In practice, the category/tag distinction is a question of degree and many authors use a hybrid strategy: tag their posts wildly but reserve ten or so to use as categories. This seems to be a best-of-both-worlds scenario, as your top tags (when sorted by frequency) function like categories while the long-tail of one-shot tags is useful for tag-searchers. “

So here is a first shot crude categorization of tags: high frequency = category, low frequency = something else.
But die-hard anti-categorists would probably be revolted by this conclusion. Here is one of the most emotionally charged pro tagging blogs I have found! He says:

“What’s more, is that I can cross-reference those bad boys! Yessiree! I can see what posts I have written containing any combination of tags. That’s power. I’d like to see your wack-ass categories do that!”

So this is a different argument … it does not care about the abstract nature of categories vs. tags .. what it cares about is how they can be used. For many practical reasons, systems that use categories tend to be exclusive in the sense that a particular resource is filed under just one category. This is why the category structure has to be just right, otherwise the resource will never be found. Of course, it is never possible to get this just right for every conceivable need. Cross referencing is possible (and probably necessary), but results in vastly increased complexity. But tags appear to represent a new paradigm: resources can be tagged with as many tags as one desires, and these tags can cross-classify the resource in any which way. Retrieval is then accomplished by finding suitable intersecting regions of the relevant tag space. The retrieval process seems to make sense with simple examples. In a recent paper, Golder and Huberman illustrate the new paradigm with the following example:

“For example, consider a hypothetical researcher who downloads an article about cat species native to Africa. If the researcher wanted to organize all her downloaded articles in a hierarchy of folders, there are several hypothetical options, of which we consider four:
1. c:articlescats all articles on cats
2. c:articlesafrica all articles on Africa
3. c:articlesafricacats all articles on African cats
4. c:articlescatsafrica all articles on cats from Africa
Each choice reflects a decision about the relative importance of each characteristic. Folder names and levels are in themselves informative, in that, like tags, they describe the information held within them (Jones et al. 2005). Folders like 1. and 2. make central the fact that the folders are about “cats” and “africa” respectively, but elide all information about the other category. 3. and 4. organize the files by both categories, but establish the first as primary or more salient, and the second as secondary or more specific. However, looking in 3. for a file in 4. will be fruitless, and so checking multiple locations becomes necessary.”

The promise of tags is that they will eliminate the need to check multiple locations because they eliminate the need to “guess” which category to look in: you simply search for “Africa+cat”. But this idyllic situation soon begins to look a little worse … suppose the researcher downloads some more articles, this time specifically about “cheetahs”. Neglecting to tag them with the existing tag “cat”, she decides to use the more specific tag “cheetah”. But now the search for “Africa+cat” fails to find these important articles! So the user has to look elsewhere, possibly realizing that the third tag is also relevant. This needs sophisticated tools that can do some fancy computations over the set of existing tags .. information extraction and reasoning techniques. How bad will this get with millions of resources and millions of tags? No one knows, but some speculate the confusion will become so big, and the cost of retrieving useful information in such a landscape so prohibitive, that the whole enterprise will one day disappear!

Since we don’t have any direct evidence (or at least none that I am not aware of!) about the way in which people use their general as well as (possibly very many) idiosyncratic tags for the purpose of retrieval, we have to guess through indirect evidence. Speaking from personal experience though, I have NO IDEA what many of my least frequently used tags are even supposed to be about … so mostly I just use those BIG TAGS on my TAG CLOUD!

So now we need some evidence on how people use tags. I suppose the first thing to look at is how many tags individual people tend to use with individual URLs. Speroni, who has gathered many users’ information for his mindmaps estimates this figure around 10. This site is a very interesting read because he goes on to explain how you can use Pascal’s triangle to calculate the number of URLs that can be uniquely indexed by a combination of n out of a total of m tags. But this is an idealistic calculation which assumes that the tags are used independently (I think!). So I gathered some numbers from the users who generated mindmaps. There were 2202 maps from which I calculated the means and medians. The mean and medium number of links were 179 and 103, respectively. To store 179 tags, you need 10 tags used in combinations of 4. But the mean number of tags per user is 100! Obviously a sub optimal strategy by the users!!
This suggests that people use many more tags than they really need for each bookmark. Many of those tags must be redundant, or even unused. Of course the result is consistent with the idea that a few, frequent tags are used as categories in a way that efficiently narrows the search. The rest of the tags are there either for redundancy, or perhaps to facilitate alternative, albeit more infrequent, access paths.
Here is another interesting observation that supports this view. Golder and Huberman analyzed the tags assigned to individual URLs by individual users, in the order that the tags were assigned to that URL. What they found was that people tend to use the highest frequency tags first, then start using the lower frequency, more idiosyncratic tags. This is again consistent with the view that high frequency tags are like categories or folders to hold relevant items, while lower frequency tags add more personally oriented distinguishing features to each resource.

So, popular tags (and many less popular ones) are stable, intrinsically sensible and communally accepted categories!!

But what sort categories are they? How do they structure the information space? Do they really cross-categorize? Again, back to the beginning!

Advertisements

One comment on “More “order from chaos”

  1. Greg Hill says:

    Hi Csaba,

    I’m the guy you quoted above (Greg Hill). I’d like to preface my remarks by pointing out that this article was written from the point of view of individual bloggers setting up navigation systems, rather than an analysis of a site from the entire community’s perspective.
    1)
    “Categories are fewer in number than tags”. I still stand by this. Just because a few tags become popular for a particular site doesn’t mean the unpopular tags somehow don’t count. Blogs that employ the “pure” tagging approach (ie not tags-as-categories) typically have dozens or hundreds of tags. You might be able to find the odd pathological case – but I doubt it.

    2) “Categories persist for a long time, but tags don’t”. I could have been clearer. I meant the “active life” ie still being associated with content, not just used for retrieval. This is in contrast to the “one-shot” tags.

    3)Generic vs specific. If you’re using “news” as a tag, I’d suggest that’s a category! Which pretty much gets to the nub of it. The flavour of my post is to define a category as a particular (restricted) kind of tag. Why?

    Well, in the context of organising blog posts, some blogging platforms support “true” convential, traditional categories (like WordPress). Others (like Blogger) don’t, and they have to be bolted on with hackery. The major focus on Freshblog is doing this with “web2.0” tech (tags, feeds, web services, AJAX etc). So, for people doing it this way, all they have is tags. The question is whether they wish to “replicate” categories with tags or use tags “natively”. In practice (as I pointed out and you seem to agree), many people use a hybrid approach, where frequency indicates the “degree of categoriness”.

    Cheers,

    -Greg.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s