I recently attempted to transform a boring, alphabetical list of tags into a more visually-appealing (and more informational) tag cloud. No doubt you’ve seen tag clouds on blogs and other websites (there’s even one on here!) but for the uninformed:
A tag cloud is a form of ‘weighted list’ where reusable tags (keywords, categories, etc) are arranged on a diagram in such a way that more frequently-used tags appear in a larger font. The arrangement of the tags themselves can be ordered (usually alphabetically) or a more organic layout can be used (e.g. a force-based algorithm). Their purpose is to show an extra dimension of information (i.e. frequency) and thus draw attention to more popular tags.
This blog has already dedicated a lot of time to the generation and layout of diagrams, so this article will focus on another important part of any tag cloud algorithm: how to assign a suitable font size to each tag. This is a somewhat tricky problem in that:
- The total number of tags and upper bound of each tag’s frequency are unknown
- We obviously want to constrain the font size so that it falls within a sensible range
- Regardless of the distribution of the frequencies, we want to have strong variation in font sizes
Before I settled on a solution, I tried a number of other approaches:
Proportional sizing
Given a query to determine the frequency of use for tags, such as:
SELECT Tag, COUNT(*) as [Frequency] FROM PostsTags GROUP BY Tag
We can use a formula like this to assign font sizes:
size = minSize + ((maxSize-minSize) * (frequency / maxFrequency))
This formula scales each tag according to its proportion to the maximum value, ensuring the font size is between minSize and maxSize. The problem is that it is adversely affected by outliers; e.g. if 90% of tags vary by only 10^1 and 10% vary by 10^3, there will be one or two enormous tags and lots of very small ones (with not much variance between them).
Sizing by random sample
In this approach, n tags are randomly selected and their frequencies become the threshold values for each font size (where n is the number of different font sizes that can be used). This takes advantage of the fact that psuedo-random numbers are uniformly-distributed, so querying for the number of frequency levels we want (instead of all values) will give a good indication of the different values in the set.
The SQL query for a random sample (in this case, of size 10) is of the form:
SELECT TOP(10) Tag, COUNT(*) as [Frequency] FROM PostsTags GROUP BY Tag ORDER BY newid()
The result, however, is less-than ideal; the resulting font sizes are random (since we are working with a random sample) and outliers are likely to be ignored (in particular the minimum and maximum, e.g. the most frequent tag will appear as large as the second most-frequent).
Sizing by order using ROW_NUMBER()
This approach uses the T-SQL ROW_NUMBER() windowing function to assign a sequential number to each tag, sorted according to frequency; so, the least-used tag will be numbered as 1, the most-used tag as n (total number of tags). This means that, regardless of the gap between each tag’s frequency, we will get a uniform increase in font size.
The SQL query would look something like this:
SELECT Tag, COUNT(*) as [Frequency], ROW_NUMBER() OVER (ORDER BY COUNT(*)) as [Order] FROM PostsTags GROUP BY Tag
Then, sizes would be assigned using the following formula:
size = minSize + ((maxSize - minSize) * ((order - 1) / maxOrder)))
(Since order starts at 1, we subtract 1 because we want the right-hand size of the expression to evaluate to zero, i.e. size = minSize)
The problem with this approach is that tags with equal frequencies will not be assigned the same font size; rather, they will increase in font size according to table-order. Since a large number of tags are likely to have been used between 0 and 1 times, the result will be unsatisfactory.
The solution: sizing using DENSE_RANK()
The approach above is based on a good premise, but isn’t quite right. If only there was some way of sizing according to order where equal frequencies were assigned the same value… oh wait, there is! The T-SQL DENSE_RANK() windowing function behaves much like ROW_NUMBER(), except that it assigns the same value when there are ties in the set. (We could have used the normal RANK() function, but this leaves gaps in the numbering)
The SQL looks something like this:
SELECT Tag, COUNT(*) as [Frequency], DENSE_RANK() OVER (ORDER BY COUNT(*)) as [Rank] FROM PostsTags GROUP BY Tag
Then, sizes are assigned using the following formula:
size = minSize + ((maxSize - minSize) * ((rank - 1) / maxRank))
(I’ve opted to make maxSize an exclusive boundary, to avoid a situation where maxRank is 1 and we risk dividing by zero)
The result is a tag cloud where:
- Font sizes increase uniformly from the least-used tag to the most-used tag
- Outliers are not ignored, nor do they result in hugely disproportionate font sizes
- Tags with a frequency of zero are always assigned minSize
- Upon first use, a tag immediately stands apart from the unused tags
Keeping font sizes low until tags reach maturity
If you don’t want the maximum font size to be assigned until a certain frequency is reached, you can apply a transformation to the ratio:
size = minSize + ((maxSize - minSize) * ((rank - 1) / maxRank) * Math.Min(1, maxFrequency / threshold))
This will allow font sizes to gradually approach maxSize as the most-used tag’s frequency approaches threshold.
Final words
Hopefully this article gives you an idea of some of the complexities involved in producing good-looking tag clouds, and allows you to avoid some of the common pitfalls associated with this.