RPi camera, part 3: a few incremental fixes

Round 3! Okay, this is where I try and polish it up in a couple ways.

Here are the things I said last time I needed to make better:

  • Send pics more conventionally
  • Fix detection sensitivity (still often picking up strong shade/sunlight quirks)
  • Total design flaw: since the log file currently gets sent with each picture, but is updated when each picture is written, it is actually sometimes more updated than the pics in the folder. That is, if 30 pictures are created by the camera function, and those are immediately added to the log file, the log file is sent with the first of those 30, and it contains all 30 of them even though only one has been sent
  • Better image classifier architecture
  • Better labeled image dataset (32×32 is tiny)
  • Sliding windows over detected images

Sliding windows

What is “sliding windows”? It’s a pretty simple idea. Here’s the motivating problem, first. When my convnet looks at an image I give it and tries to guess whether it has a car in it, it was trained on images where the car mostly filled the frame, so it will tend to only correctly classify images where the car also mostly fills the frame. If you have an image (like the one below) where there is a car but it’s mostly in one corner, it may not get it (because it’s looking for features bigger than that. There are a couple other effects too. One is that if we’re only classifying 32×32 square images, then what I’ve been doing so far (resizing the image to have the smaller side be 32 and then squeezing the bigger side to also be 32) will distort the image, which will make it harder to classify. Lastly, you can imagine that if the actual image size we’re giving it is something like 256×512, then even if it actually would have correctly classified it given these other problems, by the time it smooshes it down to 32×32, it might not have the resolution in the car region of the image to do it.

So what can fix a lot of these problems? You define a “window”, a subset of the image, and “slide” it over the image in steps. Then, you pass this window subset to the classifier, so it’s actually only classifying the subset. You might choose a stride for the sliding window such that you get M windows in the vertical dimension and N windows in the horizontal. So you’re really doing MxN total classifications for the whole window, and then if one of them says “car!”, you say that the image contains a car.

Here’s a little illustration of mine, where the red grid over the green outlined window shows the windows being used (it’s a little hard to tell them apart, but they’re squares. There are three in the vertical direction):

There are of course a million little quirks and variants and choices to make with this. For example, I think it’s common to choose two sizes for the window, which should let you look at two different “scales”. Also, you have to choose some balance between having more sub windows and the computation time it will take to actually process them. I’m also pretty sure some convnets can have this built in by choosing different filter sizes (like, one that would group a block of pixels as a single pixel to make a larger “window”).

Anyway, how does it work? Here are the results using my CIFAR-10 trained convnet from last time, on the same little group of detected images. I show the certainty distribution, which is the certainty that it thinks it detects a car.

No sliding windows:

Detected:

Not detected:

Sliding windows:

Detected:

Not detected:

Definitely better! But still getting a ton of false positives, which is annoying. Honestly it may be because they’re 32×32.

Fixing image sending

So I had a bit of a mystery on my hands. I was finding that after a while, my program was just…stopping. Not crashing, not giving any error, just stopping after about an hour. What I mean by stopping is that, while normally it outputs a line for each image it detects (on the RPi side), it would stop outputting anything, but not stop running. It took me embarrassingly long to figure out, but here’s what I did. I first made a Debug class that basically logs everything the program does, right at the moment of doing it. This is actually a pretty handy thing to have around anyway, and basically doesn’t slow it down. You’ll notice that I’m periodically logging the CPU/Mem/temp, since I read somewhere that that can cause a problem, but all the values I’ve seen on mine are fine. Anyway, here was the first clue, you can see where it stops, after about an hour:

So you can see that it’s saving them steadily for a while, and then stops saving them, but continues to send them. Welp, you probably guessed before I did, but while I was aware of how little space my RPi had on it (~700MB to play with), I thought that because I was removing the files right after sending them, I’d be okay. Howeverrrr:

So I was running out of space!

One thing I did was immediately get a 32GB micro SD card and clone my RPi onto it, just to have a bit more wiggle room. To be honest, that might solve the problem, since I doubt I’d ever keep the program running long enough to generate that much data, but that would be not addressing the real problem here, which is that my files are sending way too heckin’ slow!

My files are usually ~100kB, which should be easy to send and keep up with, even if something like 10 a second are being produced. For example, I know off the top of my head that when I send files via scp between my desktop and RPi, the transfer rate it shows is usually something like 1.5 MB/s. So what’s going on?

It turns out that that “S” that stands for “secure” in SCP (or SSH, which it’s based on) is pretty important! As they discuss in this thread where it seems like the person was having exactly my problem, there’s actually some pretty nasty overhead involved in encrypting the file you’re going to send. Of course, I don’t care about that! I’m just sending stuff I don’t care about over my LAN.

So one option in that thread was using a weaker cipher, while another was to use the rcp command, which is kind of like a totally unencrypted scp. I’m going to do a little diversion for a minute here because I wanted to know just how much these compared.

What I did was create a few dummy files, smallfile.txt (100 kB), mediumfile.txt (1 MB), and bigfile.txt (5 MB). First I just sent smallfile.txt 10 times to get a rough sense of the speed and overhead:

for i in range(10): file = files[0] print('processing file',file) start = time() subprocess.check_call(['scp',file,remoteHostPath]) total = time()-start print('time elapsed:',total) times.append(total) print('done') print(times) read more

Kaggle Housing challenge, my take

In this article, I’m doing the Kaggle Housing challenge, which is probably the second most popular after Titanic. This was very much a “keeping track of what I’m doing for learning/my own sake” thing, but by the end I’ve gotten a ranking of 178/5419 on the public leaderboard (LB). That said, this is super long because I tried a million things and it’s kind of a full log of my workflow on this problem.

I’ve really learned a bunch from going through this very carefully. What I did here was to try the few techniques I knew when I started, and then I looked at notebooks/kernels for this challenge on Kaggle. A word on these kernels: even the very most top rated ones vary in quality immensely. Some are excellently explained and you can tell they tried different things to try and get an optimal result. Others are clearly people just trying random stuff they’ve heard of, misapplying relatively basic techniques, and even copying code from other kernels. So I viewed these as loose suggestions and guideposts for techniques.

This challenge is a good one. In contrast to the Titanic’s classification, this is a regression for the label “SalePrice”, the price a given property sold for. Another key characteristic of this one is that it has 80 features, which is a large number (for me, anyway), and of those, a decent number of missing values (NA’s).

I’m evaluating the scores by sometimes looking at the accuracy from the score() function of sklearn models (which is out of 100, increasing score is better), and then towards the end only looking at the Root mean square logarithmic error (RMSLE), for which lower is better.

Starting off:

Using literally only the feature “GrLivArea”, the most obviously linear:

TTdat = train_df[["GrLivArea","SalePrice"]] display(TTdat.isnull().sum()) X_train, X_test, y_train, y_test = train_test_split(TTdat.drop("SalePrice",axis=1), TTdat["SalePrice"], test_size=0.3, random_state=42) lm = linear_model.LinearRegression() lm.fit(X_train,y_train) print("\nLM score: {}".format(lm.score(X_test,y_test))) print("\n\n\tLM RMSLE: {}".format(rmsle(np.array(y_test),np.array(lm.predict(X_test))))) rfr = RandomForestRegressor(n_estimators=300) rfr.fit(X_train,y_train) print("\n\nRFR score: {}".format(rfr.score(X_test,y_test))) print("\n\n\tRFR RMSLE: {}".format(rmsle(np.array(y_test),np.array(rfr.predict(X_test))))) read more

Grouping IMDb top movies by runtime

Howdy!

This is a fun lil one. For an upcoming article, I need to know a list of (hopefully good) movies I haven’t yet seen, with similar runtimes. Now, I could have just scrolled down the list of IMDb.com’s top 250 movies, ctrl + clicking on the ones I haven’t seen, and then compared them by eye, because, to be honest, I think I’ve seen many (/most?) of them (we’ll see shortly).

But of course, that would be an efficient use of my time, and I’m learning pandas these days anyway, so why not use it!

To do what I want, I basically need to take that top 250 list (let’s say I don’t really care about the ratings within that list, I just want to select movies from it), get a column with the runtimes, and then manually make a column with 0/1 entries for if I haven’t/have seen it, and then select for the ones I have. Then, I could group, or at least visualize them.

The first (tiny) hurdle comes from the data source. IMDb.com very helpfully provides several zipped files of ALL their movies (beware, the “basics” one is 420MB unzipped!), buuut… they have separate files for the table with the runtimes (title.basics.tsv.gz) and the ratings (title.ratings.tsv.gz). That wouldn’t be too bad, you might think: you could just sort the ratings file by the rating column, take those entries, and select from the basics file, to get the runtimes.

Buuuut… a quick glance (or if you’ve ever just perused the dark back alleys of IMDb itself) reveals that there are many entries with very high ratings, higher than the top 250 scores (which range from 8.0 to 9.2). This is probably because there are lots of smaller productions where you get a selection bias, such that you pretty much only get people who really like the movie voting, so they’re all voting 10. Of course, IMDb links actually links to an explanation of this effect:

As indicated at the Top Rated Movies page, only votes from regular IMDb voters are considered when creating the top 250 out of the full voting database. This explains any difference between the vote averages reported in the top 250 lists and those on the individual title pages. This also explains why movies or shows you might think from their averages ought to appear on the list yet do not actually appear there.

To maintain the effectiveness of the top 250 lists, we deliberately do not disclose the criteria used for a person to be counted as a regular voter.

and says on the top 250 page itself:

The list is ranked by a formula which includes the number of ratings each movie received from users, and value of ratings received from regular users
To be included on the list, a movie must receive ratings from at least 25000 users

I assumed this before starting this thing but wasn’t sure how they did it. I always assumed that the simplest way would be to just have some threshold of a minimum number of votes (which they do), to even qualify. The “small, religiously devoted fanbase theory” of those artificially high ratings would probably break if you set it correctly — I mean, once you set the threshold of minimum votes high enough, if the rating is still high, it’s not really “artificial” anymore, is it? There are potential problems with that, like only selecting for really large productions (depending on the threshold). It appears they actually do this, but also add a secret “special sauce” where they weigh certain votes more, but they don’t say how.

Anyway, that’s a bit of a long winded way to say that it’d be hard to do what I said above, to get the runtimes of the top 250. At this point, I saw a few options: I probably could try their method manually, using the ratings file (which has average ratings and number of votes for each one), just taking the subset of movies that have a rating at and above the minimum of the top 250 list, and then thresholding those for the minimum number of votes. Maybe I’ll try this anyway, but I assumed (because they say they do something else in addition) that I might end up with another list if I did that.

Another thing I briefly considered (that, at this point, may have been much easier) would be some sort of web scraping. It’d be reaaaaaal easy (in theory) to have a script go to the link for each entry on the top 250 page (which would lead directly to the actual movie, which as we’ll see shortly, is actually a bit of a pain), and then each page has a well defined “runtime” field right below the title. I briefly debated this (and maybe I’ll try it later), but I don’t actually know how to do web scraping in python yet, so it would probably be a really hacky job on my part.

So, speaking of hack-y, here’s what I ended up doing. Everything is presented in bits because I did it in a Jupyter notebook.

To get around the “which are actually the top 250 movies” problem, I literally copied and pasted the top 250 list from the page, and pasted it into a text file, which I imported and dropped two things that ended up being garbage columns. Then I had to do a tiny bit of parsing, because using “\t” as a delimiter worked to separate the title and year, but not the rank and title. So, I had to delimit that column with the “.” after the rank #, but setting n=1, because some titles have a period in their name as well (Monsters, Inc. for example), so you wouldn’t want to chop those up into separate fields. Ahhh details!

movieRatings_df = pd.read_csv("copypasteRatings.csv","\t") display(movieRatings_df.head()) movieRatings_df = movieRatings_df.drop(["Your Rating","Unnamed: 3"],axis=1) display(movieRatings_df.head()) dotsplit = movieRatings_df["Rank & Title"].str.split(".",n=1) titleYears = pd.Series([entry[1] for entry in dotsplit]) rank = pd.Series([entry[0] for entry in dotsplit]).rename("Rank") years = pd.Series([int(title[-5:-1]) for title in titleYears]).rename("Year") titles = pd.Series([title[1:-7] for title in titleYears]).rename("Title") ratings_df = pd.concat([rank,titles,years],axis=1) ratings_df.head() read more

Machine Learning with Python, 1

Here, I’m beginning my ML journey by reading: A. C. Mueller and S. Guido – Introduction to Machine Learning with Python, 2017

These will very much be just random notes, thoughts, and questions on the book as I go along.

Chapter 1 is mostly just getting set up and some general background. I skimmed it but not a ton worth writing here. A lot of it is about getting and setting up the packages we’re going to use.

scikit-learn: pretty easy install with pip3, which I got with apt-get install pip3, because I want to be using Python 3.

mglearn is a package that (I think?) corresponds just to the stuff in this book, but is actually in pip3 as well, which is nice.

They appear to leave some stuff out. For example, to do the plots they have in Ch1, you need to add:

import matplotlib.pyplot as plt

plot.show()

Which is a little inconsistent. They first teach you:

knn.fit(X_train,y_train)

y_pred=knn.predict(X_test)
 print("predictions:"+format(y_pred))

print("test set score:"+str(np.mean(y_pred==y_test)))

then later use:

knn.fit(X_train, y_train) print("Test set score: {:.2f}".format(knn.score(X_test, y_test))) read more