Kaggle Housing challenge, my take

In this article, I’m doing the Kaggle Housing challenge, which is probably the second most popular after Titanic. This was very much a “keeping track of what I’m doing for learning/my own sake” thing, but by the end I’ve gotten a ranking of 178/5419 on the public leaderboard (LB). That said, this isĀ super long because I tried a million things and it’s kind of a full log of my workflow on this problem.

I’ve really learned a bunch from going through this very carefully. What I did here was to try the few techniques I knew when I started, and then I looked at notebooks/kernels for this challenge on Kaggle. A word on these kernels: even the very most top rated ones vary in quality immensely. Some are excellently explained and you can tell they tried different things to try and get an optimal result. Others are clearly people just trying random stuff they’ve heard of, misapplying relatively basic techniques, and even copying code from other kernels. So I viewed these as loose suggestions and guideposts for techniques.

This challenge is a good one. In contrast to the Titanic’s classification, this is a regression for the label “SalePrice”, the price a given property sold for. Another key characteristic of this one is that it has 80 features, which is a large number (for me, anyway), and of those, a decent number of missing values (NA’s).

I’m evaluating the scores by sometimes looking at the accuracy from the score() function of sklearn models (which is out of 100, increasing score is better), and then towards the end only looking at the Root mean square logarithmic error (RMSLE), for which lower is better.

Starting off:

Using literally only the feature “GrLivArea”, the most obviously linear:

TTdat = train_df[["GrLivArea","SalePrice"]] display(TTdat.isnull().sum()) X_train, X_test, y_train, y_test = train_test_split(TTdat.drop("SalePrice",axis=1), TTdat["SalePrice"], test_size=0.3, random_state=42) lm = linear_model.LinearRegression() lm.fit(X_train,y_train) print("\nLM score: {}".format(lm.score(X_test,y_test))) print("\n\n\tLM RMSLE: {}".format(rmsle(np.array(y_test),np.array(lm.predict(X_test))))) rfr = RandomForestRegressor(n_estimators=300) rfr.fit(X_train,y_train) print("\n\nRFR score: {}".format(rfr.score(X_test,y_test))) print("\n\n\tRFR RMSLE: {}".format(rmsle(np.array(y_test),np.array(rfr.predict(X_test))))) read more