Different kinds of tags

So what do you make of this?

This picture shows a number of tags that have been grouped according to high level linguistic analysis of the terms. I will not explain the specific semantic abstraction in this post, because I want to see if people (if there are any out there watching my poor blog!) can make any sense of it? That is, do the terms in each branch seem to belong to the same “kind”?
Enough for now.


Tagging: Words versus Magic

The idea for this entry came from some very interesting comments on my previous ramblings about the nature of tagging. It seems to me that there are a lot of strange notions out there about the relationship between concepts, words, and tags. Let me first give you my simple view on tagging, which I think is probably right.

1. Our heads are filled with concepts. We don’t really know what these are like, and philosophers and psychologist (and, unfortunately, computer scientists, mathematicians and witch doctors) will argue for a long time about the best way to characterize them. Nevertheless, we know that children have them before they learn to speak. For example, the psychologist Liz Spelke has done fascinating work on the nature of “innate concepts” in infants. I think animals also have a pretty rich conceptual system.

2. Humans (but not animals) have an instinct to communicate with language. Language provides words and grammar to communicate the concepts in our head. I won’t talk about grammar. Words are pretty much arbitrary sounds that we use in speech. Words have a sound that we learn. They also have some syntactic properties that tells us how they should be used in sentences. So I am thinking about cats. I can say “I am thinking about cats”. “Cats” is the sound I use to talk about the cute furry things I am thinking about. If I was talking to a Hungarian about cute furry things I would say “macska”. It pretty much doesn’t matter which sound I use as long as the other person also knows it. But I wouldn’t say “I catted home today” because I know that the word “cat” is a noun and I shouldn’t use it as a verb. I might also suspect this is the case because nouns tend to be about things, and we can’t use them in place of actions.

3. So words are pretty handy. So now I go to del.icio.us and find a web site which I think is about cats. What should I tag it? Geez. Hmmm.

So there you go. Of course there are some messy steps here. There always are. How do I decide the site is about cats? We don’t really know, but this popular blog suggests one mechanism by which humans (and birds) can make quick intuitive category decisions. But once I decide it is about cats, why wouldn’t I tag it “cats”??? Of course I might add other tags. I can add adjectives which are usually descriptors of some sort. “cute” comes to mind.

I don’t believe that adjectives form categories the way nouns do. So I don’t think there is a category of “cute things” or “big things”. Nouns have pretty fixed meanings, but adjectives like “big” are practically meaningless without the nouns they modify. A tiger is a big cat, but is it a big thing? So of course tags which are adjectives are a lot more personal than nouns, and don’t represent categories.

There are of course other issues. There is ambiguity: someone might label a bulldozer a “cat”.  Or, maybe someone thinks the site is not about “cats” but “obsessions”.

But neither of these issues is a problem for my simple view. Ambiguity occurs because we chose not to have enough words. We have the same sound for two different categories. But the process is still the same: “categorize then name”. The second issue is also O.K. as far as I am concerned.

Tagging works because people are free to go with their “gut feeling” on what categories things belong to, and they are free to use the words they have always used to describe those categories. And then they get to put some spice into the mix with some colorful adjectives (emotional terms?), proper names, and verbs (action words?).

Simple! And what is the alternative? I have not seen one. All I have seen is suggestions of magic …

Vander Wal and Categorizing

Not one day after my posting that tagging is really at least a little bit like categorizing, I discover this blog
in which there is a link to presentation by Thomas Vander Wal himself, in which presentation Thomas says:

People are not so much categorizing, as
providing a means to connect items
(placing hooks) and provide their meaning
in their own understanding.

I guess the first part of this disagrees with my claim that tagging does involve categorizing. Although the phrase “not so much” is a little bit confusing. Maybe they are categorizing a little bit even by Vander Wal’s reckoning?

The second part I must admit I don’t understand. What is it to “connect items” and “place hooks”?  Does it literally mean to say that I can “connect” elephant with subatomic particle? What is the mechanism for this? How does it work? Does an arbitrary connection really lead to “tagging that works”?

What is Tagging?

Tagging has been around a while now, but we still don’t seem to know what it is. Some people argue that it is the complete opposite of categorizing, while some believe that tagging shares much with categorization. I go even beyond that, and I have argued in several publications that rich structural information can be extracted from tags, if we know where to look.

But why does it matter what tagging is? Can’t we go on tagging even if we don’t really know what we are doing? (Works well in the rest of our lives!)

Well, yes, we could go on tagging. The question is what to do with the tags once we have them. If tags are really not categories, and if the extreme view is right that they are simply some sort of completely individualistic associations, then it seems to me that we can’t do ANYTHING with them across the user base. If on the other hand there is some agreement between users about basic categories at least, then we should be able to aggregate them accross users and use them in interesting ways.

The fact that we already do aggregate them and interesting patterns of popular tags can be found, suggests to me that the extreme view against tags as categories must be wrong.

Ad hoc categories

I have not written in a while, because I have been busy developing a theory about the cognition of tags … and probably more importantly, implementing a system that actually does something useful with what I discover. I will write about the first part here, the psychology part. By the way, there is a popular blog which is also about roughly this topic, but is very different in its scope. Rashmi is mainly concerned with why tagging seems cognitively less effortful than other forms of more structured categorization. Her analysis is therefore not very informative about what the tags themselves are like. Is there something interesting to say about the sorts of categories people use in their tags? If so, can this tell us about interesting things to with those tags by means of post processing? This is what I am going to write about.

First, why I think tags are, with few exceptions, real categories. What sort of categories? Ad hoc ones, of course.

The analysis of ad hoc categories comes from the work of the cognitive psychologist Lawrence Barsalou. He was particularly interested in categories like “things to sell at a garage sale” and “things to take on a camping trip”, which are spontaneously generated categories that group entities in goal directed ways. By comparing these to “natural” categories he hoped to discover some interesting differences. The thing is, he didn’t find too many differences! Both types reveal strong typicality (prototype) effects, which are stable across time and people. As a result he proposed a more general theory of categorization which subsumes both common and ad hoc categories. Here is a quick summary his general model.

Ad hoc categories are made up to fulfil some goal. The critical role of ad hoc categories is to provide an interface with a person’s world model, in a way that can help achieve a goal. A world model is ” … a person’s knowledge of locations in the environment, together with knowledge of the entities and activities that exist currently in these locations”. The world model is not the general knowledge one has about the world, but an instantiation involving “… specific knowledge and beliefs about the current state of the world”, which might include culturally shared information and the like. The primary building blocks of the world model are the common taxonomic categories like bird, flower, chair, and so on. Finally, whenever people wish to achieve any goal, they instantiate an event frame which describes the necessary components for achieving the goal. The successful realization of the goal described by the frame depends on a satisfactory interface between the event frame and the individual world model. For example if one wishes to buy groceries then the relevant frame will include things like locations to find groceries, times the store is likely to be open, forms of payment, and so on. But to achieve this goal we need to know specific locations, times and forms of payment. This is where ad hoc categories provide the mapping, by establishing specific categories like places to buy groceries, and so on. Crucially, “mapping different event frames into the same world model defines different partitions on entities in the world model”, requiring flexible ad hoc categorization. Taxonomic and goal directed categories are two complementary ways to categorize the world: taxonomies describe the relatively stable kinds of things in the world whereas goal directed categories are ad hoc collections of different taxonomic kinds that are created to map particular event frames to particular world models.

One potential problem in applying this theory to the folksonomy data is that Barsalou’s ad hoc, goal derived categories tend to be expressed as multi word phrases whereas the majority of tags are, well, single word tags. Part of the reason for this, on del.icio.us at least, is artefactual since the user interface specifically prevents the use of compound words as tags. Marieke reports that 10% of tags recovered from delicious showed evidence that people were trying to form compounds by using some sort of punctuation symbol to represent a space. For example there are examples like “Devel/C++”, “Devel/perl”. But this number does not include the items in which words are simply concatenated, so the prevalence of complex tags may be quite high, opening up the possibility that complex ad hoc categories are used in folksonomies. But an equally important point is that goal derived categories may become lexicalized with common usage. Thus “buyer”, “payment”, “donor” and “gift” are lexicalized concepts that have an important role in many commonly used event frames. But Barsalou gives no indication of how many lexical items are of this sort, or how to identify them.

This is where my work has been focused, since I believe that very many words are lexicalized ad hoc categories. What remains to be done is to discover what goal they serve .. are there high level abstract descriptions of these goals which might allow us to lump words into semantically coherent groups? Are these the sorts of words people are using as tags?

I think the answer to all these questions is “yes”. Next time I will show why.

More thoughts on tag frequency

I was slightly on the wrong track in a previous blog, I think. It happens …
So, the mistake was that I was thinking that all of the popular tags that had survived in a historical context, could be thought of as “category labels” and the less popular ones were the highly individual ones. This would certainly have made things easy. Of course it is good that this is not the case, because that would have been TOO easy. And things that are easy, are not interesting. So why is it not true? Just look at the table below, which shows the top four most popular tags for some of the 50 most popular sites on delicious:

Site Slashdot Flickr Pandora Digg BBC News New York Times
Tags News, Technology, Geek, Daily Photos, Flickr, Photography, Sharing Music, Radio, Recommendation, MP3 News, Technology, Blog, Daily News, BBC, UK, Daily News, Newspaper, Daily, NYC

It is self evident that the popular tags form a heterogeneous collection. Some are clear category labels that would feel at home in a formal taxonomy (e.g. “News”, “Movies”, “Music”). Some, like “Daily” and “Recommendation” appear to describe resources with a particular property which is nevertheless fixed and user independent. Others like “Fun” and “Geek” describe more personal properties that depend on individual interpretation. Finally there are proper names like “UK” and “NYC”.

So, what are the facts telling us???

technorati tags: ,

Folksonomies are not a “bad” thing!

I have been busy lately testing some of my ideas about the semantics of tags .. which I will blog as soon as I have a coherent set of things to say. In the meantime I wanted to write this post in case there is a misconception that I have somehow been maligning this whole tagging business. I have not. I think there is lots of cool stuff being done with them, and I especially love the potential of the Yahoo My Web 2.0 stuff.

I also love the new browser, Flock. It is based on the Firefox code base, so I hope they don’t all start fighting together, but I think Flock really shows the potential of bringing together the content on “my web”.

My point is that folksonomies can do even more, if we are able to extract the full richness of their semantics. This is what I am trying to do right now.

technorati tags: , ,

Our Mind the Taxonomist

So finally after the many pages of blogs and thoughts, I think my main points finally came into focus after the last blog, after Clay Shirky’s clear and direct claim which I reiterate here:Here’s what’s radical about what del.icio.us protends: My vocabulary on del.icio.us folksonomy is personal, not vernacular — no one knows or needs to know which class I’m talking about when I tag something ‘class’, or that I use LOC to mean Library of Congress. This isn’t the same as, say, the dictionary of thieves slang from the mid-18th c. because no one else needs to know my bookmark system, and I don’t need to know anyone else’s,

So here is what’s radical about what Shirky protends: if I want to, I can tag my bookmarks with any vocabulary I chose. So here goes, a sample of my tags: “hdfjkfb”, “orjfkido”, “hjfoå”, “krlofpke”. So this certainly ensures that no one else knows my system, and if everyone else does the same, I won’t know theirs. But how useful would this be to anyone?? So at this rather radical extreme the Shirky claim is without content.
But maybe the example is too radical because the notion of vocabulary precludes the use of random tags. Fine. Let us think of a different example which is perhaps closer to the spirit of Clay’s idea. So I make up a system that only I know, where all interesting sites end with “xyz”, technical sites begin with “krp”, and so on. Everyone else can make up their own system, and no one knows anybody else’s. But an immediate problem with this is that it is cheating … it smuggles in the more natural English vocabulary via the back door by simply equating each English term with an expression in the new “vocabulary system”. Still, this is private knowledge so maybe that is O.K. for the example. But this leads to the interesting hypothesis: suppose you made up vocabularies like “krp…”, “…xyz”, and so on, and got users to adopt them. These can be mapped onto their “parent” vocabularies that they are derived from (remember “krp…” = “technical site”, etc.). To the individual user, each system would be equivalent (more or less, as Katie Melua would say). But which vocabularies would make a better del.icio.us?

I think it is pretty clear that we DO need to know SOMETHING about everyone else’s bookmark system! The success of a “social bookmarking system” depends on the fact that we do understand each others vocabularies (to some extent), and can extract value from that shared understanding.

What I have been arguing is that the aggregate data shows us something about what we know about everyone else’s bookmark system. Why are some terms so “natural”? Do all “natural” terms become popular? Are all “natural” terms natural in the same way, or are there different categories according to which terms can naturally relate to their categories (a bit like facets in taxonomies).
So my position is that folksonomies can provide interesting data which can give a clue about the way humans organize knowledge, AND about the ways in which we share each other’s organizational systems. My Ph.D. was originally in Cognitive Psychology, and during my studies I came to believe in the hypothesis that mental architecture fundamentally shapes our perceptions and organization of the world in which we live. Further, essential aspects of the mental architecture are fixed and therefore shared by all humans, which is what makes communication and shared understanding possible. The overlap is not perfect. I say Library of Congress, but Clay wants to say LOC. Good for him. But pity the poor soul who calls it “the square root of negative 2”!
The mind creates categories, because that is what minds do (there is an awful lot more to say about that!). The mental architecture enforces the range of possible ontologies and taxonomies that we can bring to bear on the understanding of our universe. All humans share fundamental aspects of mental architecture and therefore properties of possible taxonomies. Folksonomies provide a fantastic window into the worshop of the mental taxonomist!

More Differences

Just when I finished my previous post on the problems with focusing on individual differences, I discovered this post , once again by Clay Shirky. He makes a good point about why it might be important to think about individual differences when it comes to tag use. He says:

Here’s what’s radical about what del.icio.us protends: My vocabulary on del.icio.us folksonomy is personal, not vernacular — no one knows or needs to know which class I’m talking about when I tag something ‘class’, or that I use LOC to mean Library of Congress. This isn’t the same as, say, the dictionary of thieves slang from the mid-18th c. because no one else needs to know my bookmark system, and I don’t need to know anyone else’s …

So pretty clearly the view is that an individual can do whatever s/he wants, in complete isolation from every other user, and the “radical” system will accommodate each individual equally. Thus, while many people might agree, the system is “ensuring that the emergent consensus view does not have to be pushed onto any given participant“.

So this is why individual differences matter, because they are allowed to exist. Taxonomies are bad because they force everyone to be the same, folksonomies are good because they allow everyone to be individuals. This is probably good ideology, but is it good science? Is it even interesting? Is it useful?

These are the questions I have been asking all along. Where, pray tell, is the evidence that highly unusual tags that differ from the “norm” are even useful? Here are some of my less frequent tags: adhoc, bar, and controlled. I have NO IDEA what any of them stand for! Maybe I am a bad tagger? So ignore me.

Here is an observation based on a single example (so take with grain of salt .. the same grain you use for all other single examples!): a few weeks ago the most popular tag for the New York Times was “news” with 2093 instances. The next few are “newspaper” with 550, “daily” with 370, “nyc” and “media” with 229. These tags kinda make sense .. but after this you have “english” with 15 instances, “business” with 17, and “noticias” with 16. These last few are a real mix. I wouldn’t be surprised if “business” turned out to be useless for most users: it sounds like a tag you add when there is pressure to improve on just “news”, but probably never used as a retrieval aid! (Clearly an empirical question, but would someone so interested in business news ever bookmark news sources that DIDN’T have business news??). “English” and “noticas” are interesting in that they appear to cater to internationalization needs. This is definitely an important requirement for category systems, which many don’t address adequately.

But the question is, is there a significant benefit after “news”, “newspaper”, and say, “daily”? How many users would have so many links that they would not locate the New York Times without also including “business”???

But apart from the usefulness issue, the really interesting observation is still the remarkable overall agreement. Surely the New York Times is “daily”, and it includes a “business” section .. so these are all equally valid features by which to reference it. But why do (roughly) 10 to 200 times the people prefer “news” to the other two?

And then there is my thought experiment. If the evil millionaire convinced 100000 people to tag the New York Times as “finglewick”, how long would that survive? What about “really super cool site”? What about “should subscribe”? Why are some of these tags good but others not? Why not “finglewick”? Why would some good tags, that would probably generally be judged as appropriate (like “should subscribe”) not survive (in my opinion) while some others, like “news”, definitely do??

Del.icio.us lets us label the New York Times any which way we like. But in spite of infinite freedom we all label it “news”. Now THAT is news to me!