Archive for May, 2010

I came across this interesting pattern while trying to visualize some of the Twitter streaming data. The following charts plot the ‘following’ counts vs the ‘followers’ counts (for ~200K user accounts). The data represents one hours worth of data obtained via the streaming API. User accounts falling around the line y ~= 0 tend to generally be celebrities (musicians, sportsmen etc), companies, news and info bots (like the WSJ, CNN etc). The general population usually falls around the line y = x (the ‘I follow you, You follow me’ kind). But thats not whats interesting here (we all knew that). Looking at the zoomed in plots (figure 2 and figure 5), we see a distinct square formed by at (0,0) (2000,2000). This is also observed in another days data (figure 5) so its not just an anomaly. The plateau formed at y=2000 is a bit perplexing. I can’t seem to get my head around that. Figure (3) tries to look at the user accounts with ~2000 ‘following’ – a large number of these users turn out to be spam bots. I suspect most spam account (bots) are concentrated around this region. Its as if the spam bots tend to follow around 2000 users at max so as to not alert the spam controls by mass following users.

Any hypothesis that comes to your mind?

Figure 1: plot for day 1

Figure 2: plot for day 1 (zoomed)

Figure 3: plot for day 1 with y ~ 2000

Figure 4: plot for day 2

Figure 5: plot for day 2