Descending into modular neuroevolution for logic circuits

A while ago, I did a post on beating OpenAI games using neuroevolution (NE). Go read that if you’re interested, but here’s the gist: a typical strategy for training an agent to beat those games is to have a neural network (NN) play the games a bunch, and then improve the weights of the NN using a reinforcement learning algorithm that uses gradient descent (GD), and it of course works pretty well.

However, an alternative to those methods is to use a gradient free method (which I’ll call “GD-free”), like I did in that post: you try a bunch of random changes to the NN’s weights, and only keep the resulting NNs that play the game well. That’s the “evolutionary” aspect of it, and using methods like that to create NNs is often called “neuroevolution” (NE). read more

Training a real robot to play Puckworld with reinforcement learning

After I trained an agent to play “puckworld” using Q-learning, I thought “hey, maybe I should make a real robot that learns this. It can’t be that hard, right?”

Hooooooooo boy. I did not appreciate how much harder problems in the physical world can be. Examples of amateurs doing Reinforcement Learning (RL) projects are all over the place on the internet, and robotics are certainly touted as one of the main applications for RL, but in my experience, I’ve only found a few examples of someone actually using RL to train a robot. Here’s a (very abridged!) overview of my adventure getting a robot to learn to play a game called puckworld. read more

Beating OpenAI games with neuroevolution agents: pretty NEAT!

Let’s start with a fun gif!

Something I’ve been thinking about recently is neuroevolution (NE). NE is changing aspects of a neural network (NN) using principles from evolutionary algorithms (EA), in which you try to find the best NN for a given problem by trying different solutions (“individuals”) and changing them slightly (and sometimes combining them), and taking the ones that have better scores. read more

Solving the Brachistochrone and a cool parallel between diversity in genetic algorithms and simulated annealing

In my first post on Genetic Algorithms (GA), I mentioned at the end that I wanted to try doing some other applications of them, rather than just the N Queens Problem. In the next post, I built the “generic” GA algorithm structure, so it should be easy to test with other “species”, but didn’t end up using it for any applications.

I thought I’d do a bunch of applications, but the first one actually ended up being pretty interesting, so… here we are. read more

Training an RL agent to play Puckworld with a DDQN

Last time I messed around with RL, I solved the classic Mountain Car problem using Q-learning and Experience Replay (ER).

However, it was very basic in a lot of ways:

  • There are really only two actions, and the state space had only two dimensions (position and velocity).
  • The way I was representing the state space was very simple, “coarse coding”, which breaks the continuous state space into discrete chunks, so in a way it still has discrete states. More interesting problems have continuous, many dimensional state spaces.
  • The representation of Q was just a state vector times a weight vector, so just linear. You can actually get a decent amount done with linear, but of course all the rage these days is in using neural networks to create the Q function.
  • The problem was very “stationary”, in the sense that the flag (where the car wanted to go) was always in the same place. Even if I had the flag move around from episode to episode, the strategy would always be the same: try to pick up enough momentum by going back and forth. A more interesting problem is one where the goal moves.

Genetic Algorithms, part 2

Last time, in case you missed it, I left off with a laundry list of things I wanted to expand on with Genetic Algorithms (GA). Let’s see which of those I can do this time!

This is pretty wordy and kind of dry, since I was just messing around and figuring stuff out, but I promise the next one will have some cool visuals.

Using Reinforcement Learning to solve the Egg drop puzzle

So last time, I solved the egg drop puzzle in a few ways. One of them was using a recent learn, Markov Decision Processes (MDP). It worked, which got me really stoked about them, because it was such a cool new method to me.

However, it’s kind of a baby process that’s mostly used as a basis to learn about more advanced techniques. In that solution to the problem, I defined the reward matrix and the transition probability matrix , and then used them explicitly to iteratively solve for the value function v and the policy p. This works, but isn’t very useful for the real world, because in practice you don’t know  and , you just get to try stuff and learn the best strategy through experience. So the real challenge would be letting my program try a bunch of actual egg drops, and have it learn the value function and policy from them.

Skyscraper fun with OR-Tools!

My friend Mike recently showed me a puzzle game called Skyscrapers, which you can play here. It’s a neat idea, in the general theme of “fill in the numbers with these constraints” puzzles like Sudoku or Verbal Arithmetic.

The rules are like so. You’re given a board like this, representing a group of city blocks (one building per square), with numbers around the sides:

Your goal is to fill in the squares with the numbers 1 to the width of the puzzle (4 in this case), where the number represents the height of the building on that square. There can’t be any repeats of numbers in a given row or column.

For each number on the side, that’s the number of buildings you can see, looking down that row or column, in the direction of the arrow next to it. If there’s a bigger building (number) in front of a smaller building (number) (from the viewpoint of the number on the side), you can’t see the smaller building behind it. So if you were looking down a column that had [1, 2, 4, 3] in that order, you would see buildings 1, 2, 4, but the building with height 3 is hidden behind the one with height 4.

So, you can always see at least 1 (e.g., if it were [4, 2, 1, 3]), and at most 4 ([1, 2, 3, 4]). You have to place the numbers such that all the “number of buildings seen” from each side panel are satisfied, as well as the constraint I mentioned above about the numbers in each row and column all being different.

Here’s that puzzle solved, to show it:

Note that for each “seen” number on a side, it’s *from that viewpoint*, looking up or to the left or whatever, just to be clear.

One more complication to add. There are ones of bigger sizes, like 8×8 ones, but they also make them harder by removing clues along the sides, and give you hints by adding numbers that have to be in the solution. For example:

So I wanted to solve this using techniques I’ve been learning. There are probably a few ways to go about it. I actually tried both Genetic Algorithms and Simulated Annealing, with varying success, but I’ll save that for another post because I think they can do better that they are currently if I tweak them a bit.

This immediately appeared to me as a Constraint Satisfaction Problem (CSP), like we did in our Coursera Discrete Optimization course, which I’ve made a few posts about in the past. CSP are basically where you set up a set of constraints that represent the problem, such that if you find a model that satisfies all of them, you’ve found the solutions. The actual algorithms you use to solve these CSP are some things we used in the DO course (like branch and bound), but in practice you probably use a CP solver that someone else has already written, because it will probably do something special like look at the structure of the problem to set it up in an optimal way. If you do this, then you simply get a CP solver and set up the variables and constraints, which can actually be tricky itself.

There are many, many subtypes of CSP, and it’s an insanely important, dense field (that’s actually doing a lot of work behind the scenes that you might not know of). There’s actually a related (/subfield?) of CP called Integer Programming (IP), where all the variables are restricted to be integers, so I guess we’ll technically be doing that. To be honest, it wasn’t totally clear to me what the difference was, but this and this shed a little light on the distinctions. I think we’ll actually be doing CP now, because I use a few constraints like Min and Max, whereas IP only uses linear in/equalities.

We actually used Gurobi for our course, mostly because Phil knew the most about CP solvers and suggested it. It was actually really straightforward and pleasant to use in python, and we used it to solve a Vehicle Routing Problem, which is basically the Traveling Salesman Problem on crack. My only qualm was that it seemed a little annoying to install, and it’s commercial, so you can either get a free license that limits the number of variables you can use, or get an academic license for free if you’re part of a school.

I instead opted to use OR-Tools, Google’s Optimization Tools (it’s “operational research”). I did it partly because I was curious, partly because I usually like Google’s style, partly because I didn’t want to have to deal with the Gurobi license thing, and partly because it was super easily installed through pip3. Literally just “pip3 install ortools”. I was actually flying back home from Washington state on a plane that had surprisingly fast free wifi, so I downloaded it and was off to the races.


Now, on to the problem!


I mostly hacked around with code from the OR-tools guides they have here, since there are some details that probably don’t matter immensely for my simple application. I’ll go through my code bit by bit, and use this 9×9 puzzle as an example:

The first part was to actually write the puzzle in code, which is probably going to be messy no matter what. I opted to do it as a list of 4 lists, one for each side, in the order [left, right, top, bottom], where the left and right sides are read in the order top to bottom, and the top and bottom sides are both read left to right. I call this see_list, since that’s what you’d see from the sides. If any aren’t given (like in some puzzles), I make them a 0. I also define the list of the given numbers (if there are any) as const_list, a list of tuples, each of which is the location and value of the given number. I count down and then right, starting at index 0, for the indices, so the first const_list entry is ([1,0],3). So here’s the above puzzle:

ss_99 = [[1,4,5,2,3,2,3,2,4],[2,2,2,3,4,3,1,3,5],[1,4,3,3,4,3,3,2,2],[4,2,3,2,1,3,2,3,3]]
ss_99_constlist = [([0,3],6),([0,7],2),([1,0],3),([1,8],6),([2,0],5),([2,3],7),

see_list = ss_99
const_list = ss_99_constlist

Next, I create the solver object and variable list:

# Creates the solver.
solver = pywrapcp.Solver("simple_example")

#Create the variables we'll solve for
ss_vars = np.array([[solver.IntVar(1, size, "a_{}{}".format(i,j)) for j in range(size)] for i in range(size)])

solver.IntVar() creates an integer that’s bounded between 1 and size, inclusive, and you can give it a name. So we actually have a numpy array of these solver variable objects.

Now we have to add the constraints!

The first constraints are having all the numbers in each row and column different. I think this is the first part that makes what I’m doing CP rather than IP, because I get to use the handy AllDifferent() constraint rather than having to specify them all individually. Note that I can very handily slice the numpy array, but it has to be converted to a traditional python list before getting handed to AllDifferent(), or it whines:


# All rows and columns must be different.
for i in range(len(ss_vars)):

Next, we have to add the constraints for see_list. This is the part, if any, that is a little tricky. It’s pretty easy to look at the puzzle and say “yeah, you can see 3 buildings looking down that row from the right”, but it’s not as immediately clear (to me anyway) how you would actually take the row of numbers and extract the number of ones you can see (and then set that equal to the number you’re supposed to see).

This code is pretty ugly, but I’m not sure of a cleaner way to do it. Here’s what I did. I add the constraints in a loop for each entry of a side (so 1 to the size of the puzzle), and for each iteration, I first add the constraints for the left/right sides, then the top/bottom sides. I’ll just show the first one for now, the constraints looking from the left, to illustrate the principle:

for entry in range(size):
    #left and right
    sidepair = 0
    left_top = 2*sidepair
    right_bot = 2*sidepair + 1
    #print('adding constraint for left/right sidepair {}, entry {}: {} and {}'.format(sidepair,entry,see_list[left_top][entry],see_list[right_bot][entry]))
    if see_list[left_top][entry]!=0:
        solver.Add((1 + solver.Sum([solver.Min(solver.Max(ss_vars[entry,:j+1].tolist()) - solver.Max(ss_vars[entry,:j].tolist()),1) for j in range(1,size)])) == see_list[left_top][entry])

The sidepair, left_top, and right_bot things are just indices to get the relevant element of see_list. The if statement is just making sure the value isn’t 0, i.e., there actually is a value we need to constrain (not a blank, like for the harder puzzles).

The instrumental part is the last line. What it’s doing is the following. It’s basically starting at the first element in the row (from the left in this case), and then taking two subsets of elements of the row, in order from left to right. The first subset goes from indices [1,i] and the second subset goes from [1,i+1]. The first subset represents the buildings you can see counting just the ones from 1 to i, and the second is the buildings from 1 to i+1. It takes the max of each of these (note that because we’re adding a constraint, not just calculating it, we have to use the solver’s Max function, not the python one). The idea here is that, as you include the “next” building (the one in the i+1 position), if the max changes, that means you could see another building (so you have to add 1 to the building count) but if it doesn’t, the max was already in the range [1,i], so you don’t. So we want to iterate over i such that this process will cover every subset in the row, adding 1 to the building count each time the max increases.

Because you just want a count of 1 even if the max changes by more than one (i.e., if adding the i+1 variable increased the max from 2 to 4), it seemed like I had to use something like the Heaviside step function, which I don’t think is in OR-tools, but I was able to figure out a sneaky workaround. If the next building doesn’t increase the max, then the difference between the maxes will be 0, which is what we want to add to the building count anyway. If the next building does increase the max, then their difference will be at least 1, if not more (but never negative, because it can only increase when we take into account more buildings). Therefore, we can take the Min (again, the solver version) of this difference and 1.

Then, we take the Sum (the fancy solver version) of these Min’s, plus 1 (because you always see the fist building), and set that equal to the number of buildings we should see. Now that I look at it, you could actually just use the value at the i+1 index, not the subset, since that’s the only one that matters, but you’d still need to use the subset for the [1,i] range I think. I think you could also get rid of the whole subset thing by doing it in an explicit loop for each one, keeping an updated “max seen so far” variable, but I did this and it works.

I won’t go over it, but you have to do the same for see_list from the right, but you have to reverse the subsets. You also then have to change the sidepair variable so it’s doing it on the top and bottom, and just slice the var matrix differently, but it’s the same idea.

Lastly, we add the constraints for the given constants:

#Add constraints for given constants, if there are any
for const in const_list:
    ind = const[0]
    val = const[1]
    solver.Add(ss_vars[ind[0],ind[1]] == val)

Now to actually solve it, we have to use a bit of ~~MaGicK!~~. We need a “collector”, which basically just collects the solution. We also need a “decision builder”, which we pass a few magic options. Then, we just hit Solve!

#Soluion collector
collector = solver.AllSolutionCollector()

#The "decision builder". I just used the one from:
db = solver.Phase(ss_vars.flatten().tolist(), solver.CHOOSE_FIRST_UNBOUND, solver.ASSIGN_MIN_VALUE)

#Solve it!
time_limit_s = 1
print('\n\nstarting solver with {}s time limit!'.format(time_limit_s))
start_time = time()
solver.Solve(db, [collector])
print('\ndone after {:.2f}s'.format(time()-start_time))

#print solutions
print('\nthis many solutions found:',collector.SolutionCount())
for sol_num in range(collector.SolutionCount()):
    sol = np.array([[collector.Value(sol_num,ss_vars[i,j]) for j in range(size)] for i in range(size)])
    print('\nsolution #{}:\n'.format(sol_num))

I added the time limit because it seems like if you give it a broken puzzle (like, you enter a wrong number), it just hangs and ctrl+c won’t kill it, so you have to exit that terminal. However, the time limit doesn’t seem to work. Hmmm. I also made it print out all solutions it finds (though it’s usually just 1, unless you give it a really simple general puzzle).

How does it do?

It cranks the 9×9 from above in half a second:

That was actually an ‘easy’ one though, not having any blanks. For the hardest one I can find, the 8×8 hard, it actually takes about 5 minutes of sitting there on my Asus Zenbook, but it gets it!

The website notes that larger puzzles are harder than smaller ones, such that the easy 8×8 is “far harder” than the hard 5×5, by the way.


Welp, you get the point. I’m pretty sure it’ll slam anything that doesn’t have too many more blanks. I’ll probably make a post soon about my attempts on this puzzle through other, more handmade means, since this is honestly pretty black-box-y and feels a little like cheating (though it’s a very valuable tool!). I know only the very, very basics of what CP solvers are actually doing under the hood, so it’d be cool to solve it with something where I actually know how it functions.


The egg drop puzzle: brute force, Dynamic Programming, and Markov Decision Processes

I first heard this puzzle when taking an algorithms class in undergrad. The prof presented it as a teaser for the type of thing you could solve using algorithmic thinking, though he never told us the answer, or what the way of thinking is. Then, it more recently came up with my friends while we were hiking, and we were talking about it. I’ll talk about what I have so far, but first let’s say what the puzzle actually is.

There’s a building with 100 floors. You have two identical crystal eggs. They will break if dropped from (or above) some height (the same height for both), and you’d like to find that height using the fewest number of drops possible. If you drop an egg from some height and it doesn’t break, you can use that egg again. Once an egg is broken (i.e., you dropped it from that breaking height or above), you can’t use that egg again. So the question is, what’s the best dropping strategy?

So you can use the first egg to do a faster “search” between floors 1-100 (i.e., if you drop it at floor 20 and it doesn’t break, you’ve saved searching floors 1-20!). However, once your first egg is broken, the second egg has to be used to “scan” the remaining floors from the highest one you know it doesn’t break at to the one you just broke it from. For example, if you dropped your first egg at 20 and it didn’t break, and so you tried again at floor 40 and it broke, you know the breaking height is somewhere between floors [21-39], so you have to drop your second egg on floor 21, 22, 23, etc, until it breaks and you’ve found the floor.

So you can tell that there’s a bit of a balance here between “search” and “scan”. The more aggressively you search with the first egg, the more you get to skip floors when it doesn’t break, but you also have to scan a larger number of floors in between when it does.

Another detail is that this problem is a little ill-posed (at least what we remember): it’s unclear whether the question wants the average number of drops needed for a strategy to find the height, or maybe the least-bad worst case (i.e., if someone knew your dropping strategy and they could choose the floor to make you have to use the max number of drops) ? The average number seems better, but I think it actually ends up not mattering much (see below).

I solved this in three ways. First I did a brute search thing, which should give the actual average/worst case numbers, because it’s literally trying a given strategy for the egg being on each floor, and then averaging the results. Then I do a method Phil suggested based on Dynamic Programming. Last, I do a similar thing, but with a Markov Decision Process.


Brute force

In this section, I’m going to try the brute force method. I.e., for a given drop strategy I test, I’m going to build an “ensemble” where I have it use that strategy for every possible “break floor” (the floor the eggs happen to break on). This will give me the definitive average and worst case numbers for each strategy, though it will be only “empirical”; i.e., I’ll only get the best strategy for ones I try, but not necessarily the best strategy there is (see below for that!).

So I’ll admit: both when I heard this problem long ago and when I heard it again here, my mind instantly went to a bisecting log type search. That means drop the first egg on floor 50. If it doesn’t break, you split the difference of the remaining floors and do 75. If it doesn’t break again, drop from 88 (ceil(87.5)), etc. I guess I carelessly thought that because it seems like so many CS things end up being like that, but I should’ve thought more!

Here’s my quick and dirty code for trying the bisecting search. Excuse any messiness; I wanted to keep those comments in for diagnosing different variants I tried during this post and I’m kind of a code hoarder. The instrumental line is the curFloor = ceil((buildingHeight + curFloor)/2) one, which defines the next floor that it will be dropped from.

import matplotlib.pyplot as plt
import numpy as np
from math import log,ceil,floor

minList = []
maxList = []
avgList = []

buildingHeight = 100

dropCountList = []

for breakFloor in range(1,buildingHeight+1):

    firstEggWhole = True
    drops = 0
    lastUnbrokenFloor = 0
    curFloor = floor(buildingHeight/2)
    while firstEggWhole:
        drops += 1
        #print("drop {}: dropping from floor {}".format(drops,curFloor))
        if curFloor>=breakFloor:
            #print("first egg broken dropping from floor",curFloor)
            firstEggWhole = False
            lastUnbrokenFloor = curFloor
            curFloor = ceil((buildingHeight + curFloor)/2)

    if curFloor==breakFloor:
        #print("have to search {} more floors ({} to {})".format(((breakFloor-1) - lastUnbrokenFloor),lastUnbrokenFloor+1,(breakFloor-1)))
        drops += ((breakFloor-1) - lastUnbrokenFloor)
        #print("total drops:",drops)
        #print("have to search {} more floors ({} to {})".format((breakFloor - lastUnbrokenFloor),lastUnbrokenFloor+1,breakFloor))
        drops += (breakFloor - lastUnbrokenFloor)
        #print("total drops:",drops)


minDrops = min(dropCountList)
maxDrops = max(dropCountList)
avgDrops = sum(dropCountList)/len(dropCountList)
print("min: {}, max: {}, avgDrops: {}".format(minDrops,maxDrops,avgDrops))

So I calculate both the max (worst case) and avg of the ensemble of the 100 cases of when the breaking height is on each floor. I get:

min: 2, max: 50, avgDrops: 19.12

That 50 is really a killer, because for half of the ensemble (floors < 50), you lose your first egg immediately and then have to scan up to it. That’s painful and probably why this method doesn’t work.

Anyway, my smarter friends more immediately thought of a constant search method, where the first egg has some stride that it checks at until the egg breaks. So if you had a stride of 20, you might take the first egg and check at 20, 40, 60, etc, until it breaks, and then scan the rest. It’s pretty easy to calculate the worst case for this method; it’s basically setting the breaking floor to use the first egg as much as possible, and then set it again to make the second egg have to scan as much as possible. So for 20, you want to make the first egg do 20, 40, 60, 80, and break on 100, meaning it will have to use the second egg to search 81-99.

Like above, there’s definitely going to be some sweet spot where the aggressiveness of the search stride isn’t outweighed by the cases where you have to scan a bunch with the second egg. My friends immediately recognized that it was probably a stride of 10, and it’s probably not a coincidence that 10 = sqrt(100). So I made another quick dirty little program (be merciful pls) to try a bunch of these strides and plot the avg and worst cases:

import matplotlib.pyplot as plt
import numpy as np

#skipLengthList = list(range(15,70))+[100,200,300,500,800,980]
skipLengthList = list(range(1,20))+[30,40]
minList = []
maxList = []
avgList = []

buildingHeight = 1000

for skipFloorLength in skipLengthList:
    dropCountList = []
    #print("\nskipFloorLength =",skipFloorLength)
    for breakFloor in range(1,buildingHeight+1):
        firstEggWhole = True
        drops = 0
        lastUnbrokenFloor = 0
        #skipFloorLength = 5
        curFloor = skipFloorLength
        while firstEggWhole:
            drops += 1
            #print("drop {}: dropping from floor {}".format(drops,curFloor))
            if curFloor>=breakFloor:
                #print("first egg broken dropping from floor",curFloor)
                firstEggWhole = False
                lastUnbrokenFloor = curFloor
                curFloor += skipFloorLength

        if curFloor==breakFloor:
            #print("have to search {} more floors ({} to {})".format(((breakFloor-1) - lastUnbrokenFloor),lastUnbrokenFloor+1,(breakFloor-1)))
            drops += ((breakFloor-1) - lastUnbrokenFloor)
            #print("total drops:",drops)
            #print("have to search {} more floors ({} to {})".format((breakFloor - lastUnbrokenFloor),lastUnbrokenFloor+1,breakFloor))
            drops += (breakFloor - lastUnbrokenFloor)
            #print("total drops:",drops)


    minDrops = min(dropCountList)
    maxDrops = max(dropCountList)
    avgDrops = sum(dropCountList)/len(dropCountList)
    #print("min: {}, max: {}, avgDrops: {}".format(minDrops,maxDrops,avgDrops))

avgListArgMin = np.argmin(np.array(avgList))
maxListArgMin = np.argmin(np.array(maxList))
print('best skip length in terms of avg case:',skipLengthList[avgListArgMin])
print('best skip length in terms of worst case:',skipLengthList[maxListArgMin])
print('best avg case:',avgList[avgListArgMin])
print('best worst case:',maxList[maxListArgMin])
avgLine = plt.plot(skipLengthList,avgList,'bo-',label='avg')
maxLine = plt.plot(skipLengthList,maxList,'ro-',label='worst')
plt.axvline(skipLengthList[avgListArgMin], color='b', linestyle='dashed', linewidth=1)
plt.axvline(skipLengthList[maxListArgMin], color='r', linestyle='dashed', linewidth=1)
plt.xlabel('skip floors length')
plt.ylabel('# of drops')
plt.title('building height = '+str(buildingHeight))

So you can see that my friends were right. The dotted lines show the positions of the best values for the two metrics (the red one is actually a plateau at that, it just selected 8 because argmin() selects the first value it finds, so ignore that). Interestingly, the best avg case is also about 10. Hmmm. You can also see that the worst case pretty much perfectly tracks the average case.

To check, I also tried it with a building size of 1000 (and therefore, a step size of floor(sqrt(1000)) = 32, which gave similar results:

Same deal. Here’s a neat little detail, though. This is the very “zoomed in” set of test strides, because I guessed it would be around the sqrt(building height). If you zoom out:

In the region that’s a bad strategy anyway, the worst case increases as you might expect, but you see some pretty interesting behavior of the average case, where there are different “regimes” or something. Maybe I’ll investigate this more at some point, but I wonder if it’s a coincidence that there’s what looks like a cusp at (building height)/2…

Anyway, one last thing for now. When coding this, I realized that if you’re doing the optimal stride of sqrt(building height), and your first egg doesn’t break… well, it’s almost as if you’re restarting the problem, but with a slightly shorter building! So, you shouldn’t use the same stride that was calculated for the “original” building. That is, if building height = 100 and stride = 10, then if it survives the 10 and 20 floor drops, you still have two eggs and now it’s kind of like you’re doing the same problem with building height prime = 80, and ceil(sqrt(80)) = 9. So if you adjust this as you drop for each case, it should be better.

And it is! …very incrementally.

Here it is for building height = 100:

and 1000:

(I just included more output so you can see it adjusting the stride.)

So it’s better but not crazy better. At this point, there’s a really good chance this method (best stride = sqrt of remaining floors) is the best strategy for this more general method (the “stride” method), though I still haven’t actually proven it. But even if I did, that would only prove that it’s the best strategy within the stride method. How do we know there isn’t something better?


DP method

My much smarter friend Phil came up with a very clever way to get what has to theoretically be the best strategy, independent of any general strategy like the stride thing, using concepts from dynamic programming (DP). At the time I didn’t understand it beyond the vaguest idea of “build up from the solutions to smaller subproblems”, but since then I’ve learned a tiny bit about DP. So, I’m going to try it here in a few ways!

Until recently, my only experience with DP was a brief mention of building the Fibonacci sequence up from smaller terms we did in some CS class ages ago. I’ll take the Fibonacci sequence as an example. It’s frequently defined recursively as f(n) = f(n-1) + f(n-2), with the base case of f(1) = 1 and f(2) = 1. So you can calculate it that way. If you want f(100), now just calculate f(99) and f(98), and then to calculate each of those, calculate… etc. The problem is that, if you called this in the most naive way with a recursive function, it would explode into a huge number of terms, and more importantly, very redundant terms. For example (if you’re trying to get f(100)), calculating f(99) needs f(98) and f(97), but you also need f(98) for f(100). So it’s basically totally impractical for anything large.

An alternate way, the DP way, is to “build up to” the goal you want. So if we want f(100), we calculate f(3), which is easy since we have f(1) and f(2). Then f(4) from f(3) and f(2), etc. So you can see that it’s way easier and involves storing way fewer numbers (basically just a single list of the ones you have calculated so far). Incidentally, this feels like a pretty contrived example, since I don’t think any person who isn’t already familiar with and eager to use recursion would define the Fibonacci sequence that way. I’m pretty sure I’ve usually heard people say “start with 1, 1, and then add the last two numbers to get the next number”. So, peoples’ intuitive default seems to be more like DP anyway.

But the point still stands: if we can break the problem into sub problems and then solve those easy subproblems first, the bigger problem will be a lot easier.

As I mentioned above, an important part of this problem is that, if there are 100 floors and you initially drop it at d=10 and it doesn’t break, you now have to solve it for floors 11-100. However, since you still have two eggs and there’s nothing at all special about these remaining floors, the remaining problem is completely identical to solving the initial problem, but with 90 floors instead of 100!

So here’s my strategy. There’s inherently a probabilistic nature to this that gets neatly taken care of with this formulation. I define a function v(f,d,e), which is the average number of egg drops that will be needed if there are f floors left to search, you drop the egg from floor d, and you have e eggs left, and for all following drops, you choose the optimal drop height. The last part is important because it mean that if you’re calculating v(100,78,2), which is obviously a bad move (from what we’ve seen before, d=78 with 100 floors), we’re calculating that quantity as if after that drop, all the following choices are optimal. When there are two eggs, I’ll refer to the optimal value of v(f,d,2) as v(f) as shorthand.

This function can be divided up into two groups: 1 egg, and 2 eggs. 1 egg is simple, you just have to scan up from your highest non-break floor to the top. Therefore, v(f,*,1) = f-1, where I write * because it doesn’t matter.

For two eggs, for a given drop, there are really two possibilities: it breaks or it doesn’t. So we can define v(f,d,2) as the probability of it breaking times the drops needed if it breaks (v(d,*,1) because now you only have to scan up to floor d) plus the probability of it not breaking times the drops needed for solving the new subproblem with two eggs (v(f-d)).

If there are f floors left and you drop it from d, assuming the probability of it breaking from each floor is equal, then the chance of it breaking is (d/f) and the chance of it not breaking is ((f-d)/f).

So, using the above: v(f,d,2) = 1 + (d/f)*v(d,*,1) + ((f-d)/f)*v(f-d). The +1 is because you need to include this current drop for all the ones you consider.

And now we have a recursive relationship that we can use DP to build from! Remember, that v(f-d) is the number of drops using best play.

Lastly, how do you actually use this? It’s assuming you can just plug in the value for the best play, but how do we get that? Let’s just start.

So we start with v(1) and v(2). If you have one floor to search, I think we’re basically defining the problem to say it’s not at height 0, so you can assume that you don’t have to search at all because you know it has to break on floor 1 (maybe this is wrong, but it shouldn’t change the principal or the answers more than 1, I think). So v(1) = 0. For v(2), you can either do d=1 or d=2. If you do d=1, and it breaks, you’re set. If it doesn’t, you know it breaks on 2 anyway, so still good. If you do d=2, it has to break, but you still don’t know whether it would have broken on d=1, so you haven’t learned anything and have to try again.

So: v(2,1,2) =1 + (1/2)*0 + (1/2)*v(1) = 1. On the other hand, v(2,2,2) = 1 + (2/2)*1 + (0)*v(0) = 2. So we see that d=1 is optimal on average for f=2, so v(2) = 1.

Now, you can do the same thing to get v(3), which will have to consider d={1,2,3} and evaluate each. So, at each step, we’re taking the argmin to get the best d, and add it to the list.

So how does it do? Pretty sweet!

You can see that the best d tracks sqrt(f) perfectly (with some weird wobbles?). v(100), which should be the average number of drops needed from 100, doing it perfectly, is actually 12.8. That’s a little weird because if you look above, my brute force method for d = 10 gave an average of 10.9. I haven’t looked into why this is yet, but I’m probably undercounting in one or overcounting in another (I’m guessing at one of the edge cases, like d=1 or d=100).

Here’s the code for that:

import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

def v1egg(f):

#This is for floors 0 and 1, the anchor cases.
v2egg = [0,0]
best_d = [0,1]

f_limit = 100

for f in range(2,f_limit+1):

    vd = [1 + (d*1.0/f)*v1egg(d) + ((1.0*f-d)/f)*v2egg[f-d] for d in range(1,f+1)]

    min_d = min(vd)
    best_d_ind = np.argmin(vd)

    best_d.append(best_d_ind + 1)

print('best for f = {}: avg={}, best move={}'.format(f_limit,v2egg[f_limit],best_d[f_limit]))
print('best d', best_d[:20])

plt.ylabel('v(f), best d')
x = np.arange(1,f)

date_string ="%H-%M-%S")


Markov Decision Process

So that worked pretty quick, but I recently learned a little about Markov Decision Processes (MDP) and wanted to use my new fancy knowledge! I realized that it could probably also be used to solve this, so I tried that. I don’t think I can fully explain what MDPs are, but I’ll just say a tiny bit that’s relevant for what I did.

In a MDP, you have a set of states your system can be in. For this problem, a state is (f,e), where f is the number of floors left (i.e., floors you still need to search to find the break floor) and e is how many eggs left, plus one state I call the “solved” state. So if there are N floors, there are 2*N+1 states. You also have a set of actions that can take you from one state to another. Here, it’s the set of floors you can drop to (plus the solved state), so there are N+1 actions. You also have a reward matrix, R_s,a, which is a reward you get if you take action a from state s. You also have a transition matrix, P_a,s,s’, which is the probability, if you take action a in state s, that you actually get to state s’. So if the system is stochastic, and if you’re in state s and take action a, you could end up in a few different states defined by P_a,s,s’. Because this matrix has an entry for every action and every pair of states, it can be pretty big. In our case, it will be (N+1)*(2*N+1)^2. You also have a state-value function, v(s), which gives you the value of being in a current state, so here is just N+1 long.

Lastly, you have your policy, p_s,a (usually denoted by pi but I’ll use p here). This is generally the goal in MDP or maybe more generally reinforcement learning, because if you solve p, it tells you the optimal action to take in every state. For every state s, it has an array of weights for each action a, so is size (2*N+1)*(N+1). Eventually, these weights should settle (at least in this problem, but maybe always as long as P and R are not time dependent?) so that only one of them is nonzero. Until then, it will add its own stochastic-ness (like P_a,s,s’) by trying different actions in the same state.

In this problem, pi and v will be changing, because you’ll be solving for them. P and R will be static and known. P and R will essentially fully describe the system, but kind of indirectly and not in the most useful way. Therefore, we want to solve for v, which will describe the system in a more useful way (i.e., “it’s good to be in this state, bad to be in that one”). We’ll use this to solve for p, because we want to know what to actually do, also.

Okay, so how do we solve this?

It gets a little mathy at this point. Here’s the idea, though. There’s also a matrix, similar to v(s), called the action-value function, q_s,a. Similar to how v(s) is just the value of being in that state, q_s,a is the value of taking action a in state s. This is obviously going to contain R_s,a, but also has to include all the possible future rewards the system could get from other states and their rewards/actions.

So: q_s,a = R_s,a + gamma*sum_s'(P_a,s,s’*v(s’)). So q is basically the reward for immediately taking that action, plus the value of every state it could immediately end up in times the chance it’ll actually get in that state. The gamma variable is just how much you want to count future rewards. We want to completely in this problem, so we’ll say gamma=1 and forget about it.

Okay, so that’s q. On the other hand, the original v(s) can be calculated by, for each action a you can take from s, adding up the product of your policy for taking that action in this state, p_s,a, and the action-value for taking a in state s, q_s,a: v(s) = sum_a(p_s,a*q_s,a). So now we have v(s) as a function of q_s,a, and q_s,a as a function of v(s). This allows us to plug q in, and get a kind of recursive function of v:

v(s) = sum_a(p_s,a*sum_s'(P_a,s,s’*v(s’)))

This is kind of funny at first glance, v(s) containing itself (partly). However, there’s a theorem that says that if you update v with the equation above using the values it currently has, and the current values of p (as well as the static values of P and R), it has to approach the correct v. At each step, we also update p by, for each state, taking the argmax of the action-value q for all the actions it can take in that state. Basically, looking at the best choice it has in a state, based on our current v.


Oof, okay. To the actual problem. So the main task here is choosing P and R matrices that accurately describe the system. In these matrices, I’ve ordered the states such that index 0 is the solved state, [1,N] are the 1 egg states, and [N+1,2N+1] are the 2 egg states. For the actions, index 0 is going into the solved state, and [1,N] are dropping on that index’s floor.

So it’s a little confusing, because you have to incentivize the process in a strange way. For example, I only want states with one egg to be able to go to the solved state (which will mean they’re scanning up from floor 1), so R_(f,1 egg state),(to solved state) will have reward -(f-1), but 2 egg states will have a huge negative reward (like -900). R_(solved, to solved) is zero, which is fine.

For ordinary 2 egg drops, the reward for every action to a state with the floor below its current floor will be -1 (takes one drop). I think we could actually also let the reward for going to states above our current state also be -1, and it would be disincentivized by it just being a longer path to the solved state, but to be sure, I also set the reward for those to be a large negative.

So at this point, for N=10 floors, here’s what R_s,a looks like. The rows are for each s, the columns are for a.

You can see that the only “allowed” (i.e., not massively negative) actions are either the scan up for 1 egg states, or an ordinary drop for two egg states.

So that’s for R. P_a,s,s’ is basically three dimensional (or 2D and huge), so I can’t really write it here. I’m not even sure I could have it in code as a matrix, because it would be of size N^3 (maybe for 100 would be okay). So I basically made it a function instead, which you call with the three arguments a,s,s’. Because it’s kind of enforcing some rules, it is a trainwreck of if/else statements, but it works. The main idea is that most the P values will be 0 or 1, except for the “ordinary 2 egg drops”, which will “branch” with probabilities d/f and 1-d/f (like above).

Anyway, once I’ve set up P and R, I just initialize p and v randomly, and I’m off to the races! For some iterations, I’ll repeatedly calculate the new v, use that to update p, and then repeat. Here’s the relevant code:

def policyEval(v,pi,R):
    gamma = 1
    v_next = np.zeros(v.shape)

    for s in range(len(v_next)):
        v_sum = sum( [ pi[s][a]*(R[s][a] + gamma*sum([Pmat(a,s,s2)*v[s2] for s2 in range(len(v))])) for a in range(pi.shape[1])] )
        v_next[s] = v_sum


def policyImprove(v,pi,R):
    pi_new = np.zeros(pi.shape)

    for s in range(len(v)):
        q_list = [sum( [R[s][a]] + [Pmat(a,s,s2)*v[s2] for s2 in range(len(v))]) for a in range(pi.shape[1])]
        best_a = np.argmax(q_list)
        pi_new[s][best_a] = 1.0


v_log = np.array([v])

for i in range(20):
    v = policyEval(v,Pi_sa,R_sa)
    v_log = np.concatenate((v_log,[v]))
    Pi_sa = policyImprove(v,Pi_sa,R_sa)

After this is done, we can look at p:

A little harder to read, but if you look at the bottom half of it, those are the probabilities for each action (column) in each state (row). We can plot the values of v(s) as it was improving:

And also the 2 egg state entries of p, for each row (if you look, there is just a single 1 in each row. So we’re plotting the indices of those 1’s here):

Pretty cool! You can see that for N=10, v plateaus at roughly iteration 8.


v values:


This actually takes a couple minutes to run. You can see that while N=10 needed 10 iterations, N=100 only needed about 20. Hmmmm.

The last value for v is actually -12.85, which is exactly what my DP method above found, so either they’re both wrong in the same way, or I’m guessing my original brute force method wasn’t counting the first or last drop or something.

Well, that’s it for now. It’s worth pointing out that this is only useful when you’re already given R and P, so it’s not really RL at this point, more just a method for solving some fully, but inconveniently described system. However, it should also be solvable if a P and R are defined, but not given to it, and it’s allowed to sample many drops. Maybe I’ll try that next time!