Python Practice #3
I originally shared this story on Medium.com
Previous articles: PP#1, PP#2
https://www.interviewcake.com/question/python/word-cloud
The problem posed by Interviewcake:
You want to build a word cloud, an infographic where the size of a word corresponds to how often it appears in the body of text. To do this, you’ll need data. Write code that takes a long string and builds its word cloud data in a dictionary, where the keys are words and the values are the number of times the words occurred. Think about capitalized words. For example, look at these sentences:
‘After beating the eggs, Dana read the next step:’
‘Add milk and eggs, then add flour and sugar.’
What do we want to do with “After”, “Dana”, and “add”? In this example, your final dictionary should include one “Add” or “add” with a value of 22. Make reasonable (not necessarily perfect) decisions about cases like “After” and “Dana”. Assume the input will only contain words and standard punctuation.
So looking at the problem I immediately am thinking about the previous practice questions. First of all, it looks a lot like the count sort algorithm I made in the last exercise, but in a dictionary instead of an array, so I should be able to borrow some code from there. Secondly, a big part of the problem is dealing with string parsing and handling, so I’m thinking I should look into the Python documentation for built in string handling methods, and on that front I’ve gotten great results!
I want to remove all standard punctuation from any string I run into and also I’m going to make an executive decision that I want all my words to be lower case, so the first step is to make sure all the strings I get are in that standardized form. To this end I’m going to use string constants from the Python string library in conjunction with the str.maketrans() and translate() functions. Of the two the maketrans function is the key and more difficult part of the operation to grasp, see this link for a nice explanation.
import string
#takes in a string and returns a dictionary of word keys with # of time the word appears in the string as values
def wordclouder(words):
caps = string.ascii_uppercase
lows = string.ascii_lowercase
punc = string.punctuation
#make the translation table
tbl = str.maketrans(caps, lows, punc)
#remove punctuation and make uppers into lowers
clean = words.translate(tbl)
So now we’ve got the strings all clean like we want them, and it’s time to count them up and stick them in a dictionary.
#let's count the words!
warray = clean.split(' ')
wdict = {}
for w in warray:
#if it's already there add 1 to the value.
try:
wdict[w] += 1
#KeyError means we gotta initialize it.
except KeyError:
wdict[w] = 1
return wdict
solution = wordclouder('After beating the eggs, Dana read the next step: Add milk and eggs, then add flour and sugar.')
print(solution)
There you have it, ultimately it’s much easier than the sort count value buckets were to create. I could make it all on the fly by looping over the words in the array I made with the .split() function and using a try except block to check if the word had already been inserted and adding one or initializing it if I ran into a key error. When I think about it the logic of that is kind of backwards to how I might do it by hand, I’m essentially assuming it’s already been created with the try first, even though I know it will return an error the first time around and I only initialize after I run into that error. But if I went through and initialized everything first then I would have to loop over the array again and as I know from the last lessons that just isn’t efficient.
But of course upon looking at Parker’s answer what I’ve got just doesn’t quite cut it! Parker’s post goes really in depth and explains each part thoroughly so I highly recommend checking out the link at the beginning of this article if you want a good explanation, but for the purposes of this post I’ll just give a brief summary. When confronted by the following example sentence that Parker provides my solution runs into a couple key issues: “We came, we saw, we conquered…then we ate Bill’s (Mille-Feuille) cake.”
Since I’m replacing all punctuation with None, and then splitting the sentence on spaces, things like “conquered…then” become one word: conqueredthen. Bill’s becomes bills, and Mille-Feuille become millefeuille. If we weren’t being sticklers then the apostrophe and hyphen cases might not be too problematic, but to me the conqueredthen case is a little too bad to let slide.
Parker makes three decisions in his solution, he uses a class to keep things more readable and centralized, in his words”allowing us to tie our methods together, calling them on instances of our class instead of passing references.” He chooses to make the words uppercase in his dictionary only if it is always uppercase in the original string, follow the link to his post for a more detailed explanation of why, but for my purposes I will stick to just forcing lower case. The third decision to my mind is the most important decision, he decides to build his own split_words() method instead of using a built in one so that he can add the word to the dictionary as it is split, and split words and eliminate punctuation at the same time. This saves space and time by not creating a new array out of the split words and then finally putting them into the dictionary like I did in my solution.
I’m not going to paste Parker’s solution here because it is quite large and if you want to see it you really ought to check out his post and page, it’s worth it. The main things I took away from this lesson are:
To take the time to think through every possible case instead of only solving for the example I have in front of me,
Even though there are great built in methods that should often be taken advantage of, I also should not be afraid to write my own methods in order to solve specific problems, especially if it will ultimately save runtime and space by eliminating unnecessary hash maps and references.
The string library has a lot of useful constants and methods, and even though they didn’t do everything we needed here, I think that the maketrans() and tranlsate() methods could be really useful for me in the future.
Hello! Your post has been resteemed and upvoted by @ilovecoding because we love coding! Keep up good work! Consider upvoting this comment to support the @ilovecoding and increase your future rewards! ^_^ Steem On!
Reply !stop to disable the comment. Thanks!
Congratulations @mrchocoborider! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :
Award for the number of posts published
Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word
STOP
To support your work, I also upvoted your post!