Word clouds for Slack

Hey there! It’s been a while. I’ve been working on lots of stuff, but here’s a small thing I did recently.

My friends and I have a Slack we’ve now been using casually for a few years. You can download the entire logs of your Slack workspace, even if you use the free one (which will cut off the messages it shows you after 10,000 messages, I believe). So I wanted to do a few little projects with it.

One thing my friends and I were talking about was making bots that were crappy, funny imitations of us. So there would be a Declan-bot, Ben-bot, etc., that would talk like we do. Maybe we’ll try that in the future, but after doing the thing in this post, I have a suspicion that the bots might be kind of indistinguishable without extreme tailoring (though I’d love if they weren’t!).

So, what I wanted to do here was make a word cloud of each person’s total corpus in Slack. It actually all came together pretty quick, mostly because I did it in a quick, hack-y way.

First, to get the Slack data, you have to be an administrator. You go to the menu, administration, workspace settings, import/export, export, and then choose the date range. The folder is pretty huge, several GB. You’ll only get public data (not private messages), which makes sense. When you download it, it’s organized by folders corresponding to the channels everything was in, and in each of those, .json files for each day.

json is fortunately super easy to use in python, since it basically gets read as a dictionary. I made a little program that goes through recursively and gets all the files, and then piles everything together into one huge json… dictionary? That lets me easily select everything from a single user, or channel, or other aspect.

Another little detail is that it doesn’t label the users by our names (which you can change), it labels us by a unique identifier string, like U2189232 or something. So I had to make a little translation dictionary to go back and forth between names and IDs.

I decided to use this guy’s great python word cloud generator to make the word clouds. It’s even installable via pip3!

So, that’s the basics. Import all the data into one big json database/dictionary thing, choose a user, translate to their ID, grab all the text with that ID, turn it into a big ol’ list of words (with repeats), and then feed it into that word cloud generator. And it works! Here are a few:

But you’ll immediately notice a few things. One is that there’s some stuff we probably don’t want there, like http and user tags (because if you say someone’s name with the @, it just technically calls their ID and renders it as their name). Additionally, there are a ton of common words. It turns out that we like saying “yeah” and “one” a lot, so that tends to give rise to kind of lame word clouds.

This problem is actually a lot more interesting than you might think at first glance. I wanted to give the clouds a little more “personality”. That is, there are a few unique words in each word cloud that, knowing my friends, I’m able to point out and say “yeah, Max plays Magic, so he probably does say ‘deck’ a lot”, but there aren’t many of those words. What I’d really like is if the top N words of each person’s word cloud were pretty unique to them, but it turns out this is actually kind of tricky.

One thing I tried, that worked with some success, was taking the “megacorpus” from everyone’s combined corpuses (corpi? or is it like octopuses?), taking the 400 most common words from that, removing them from each person’s corpus, and then making them. This is definitely a slight improvement:

It’s not great, though. For example, there could be a word that’s used a ton by one or two users, and no one else, that might get removed. This would be a very “personal” word that I’d definitely want kept in their word cloud(s). I think it’s also hard to know where the point is where you stop removing common, lame words and start removing interesting, personal words.

Here’s what I’d ideally like: to make a list of the top N words, for each user, that are in the top N words of at most X other users I’m considering. So, if I’m making the top 10 lists for 8 of my friends, I might say that a “top 10” word can stay in a given list if 2 people have it in their top 10, but not if 3 do (then it’s just too common).

How do you do this, though? The naive way I tried it was this. Get each user’s corpus, sort by commonality (just for that corpus). So, you have a list with one occurrence for each word, sorted by decreasing use. Then, start with user 0. Start with their most common word (index 0 of that list), and check if it’s in the top N of each other user. If it is, add to the tally of how many others share that word (that starts at 0 for each new word). If more users have that word in their top N than are allowed, remove that word from everyone’s corpus. If you removed the word, you keep the index the same, but now it will be looking at a new word in the top N because the one it was just referring to just got removed. If the word didn’t have to be removed, increase the index (so now it’s looking at the next most common word in user 0’s top N). When you get to index N of user 0, go to the next user and restart. Here’s the relevant code for that:

for user in users: print('\n\ngetting unique words for',user) other_users = copy(users) other_users.remove(user) print('other users:',other_users) index = 0 while index<=N_unique: others_with_word = 0 cur_word = users_corpuses[user][index] print('\ncur word is \'{}\''.format(cur_word)) for other_user in other_users: if cur_word in users_corpuses[other_user][:N_unique]: print('{} also has this word'.format(other_user)) others_with_word += 1 if others_with_word>allowed_others: print('removing \'{}\' from all users'.format(cur_word)) for tempuser in users: users_corpuses[tempuser].remove(cur_word) else: index += 1 print('\n\nTop {} unique words for {} at this point'.format(N_unique,user)) [print('{}. {}'.format(i,word)) for i,word in enumerate(users_corpuses[user][:N_unique])] read more