Spelunky is probably my favorite PC video game of 2013 (pending when I finally give Divekick a try). There’s a lot of good stuff going on in the gameplay – find your way down through randomly assembled platforming configurations and collect as much treasure as you can. You have a clear goal, but how to pursue that goal is ambiguous, the way hazards can combine are treacherous, and the ghostly dread of the soft time limit adds a sharp tension to every level you attempt. It has all the trappings of a great game you can play basically forever.
There was one feature that really stood out to me when I began playing: the Daily Challenge. A master server randomly generates a single configuration for the day, and all players get exactly one chance to score on it. It nicely counters the random arrangement and allows players to directly compete on an equal footing while still keeping the core gameplay completely intact.
I loved it – it was exciting to boot up the game each day to try my hand at today’s challenge. Knowing that my one shot for the day was on the line added even more tension to the run, and it really brought out my best. I really had to play things smart – I had to know when to take a big risk with low resources and when to just move on, when to hold ‘em and when to fold ‘em. It seemed like the game was at its best. Then I began to see a disturbing trend in the high score list for each Daily Challenge. That’s when I realized that I was playing all wrong and came to a surprising conclusion:
Daily mode as implemented in Spelunky is actually a bad idea.
When to Hold ‘Em and When to Fold ‘Em
Why this 180 in my opinion? It has to do with Spelunky’s design and the way daily mode changes the way you play. Spelunky is largely about risk management. This shows up in several forms throughout the design, but largely ties to resources. There’s a crate behind a rock wall – if I use a bomb to get it, it might pay off with more bombs, which is always good. It might give ropes, which is maybe good depending on how many ropes I have. It might be a great item, or it might be an item I already have, which means using my bomb was a waste. You have a risky choice, but the payoff is predetermined. You just have to decide whether to buy in or not.
This manifests itself in other ways, too. You can sometimes bomb a wall open or use a rope to climb a cliff to get guaranteed treasure. If you can see it on your screen, you even get to know exactly how much. Since bombs and ropes can be bought, you can use that knowledge to quantify how much money you came out ahead. But that assessment assumes that you’re going to find a shop that carries bombs or ropes on a later level and that you won’t be in a situation that required that bomb or rope before you find that next shop. It could be that you use your last rope to get 2,500 points only to find a situation on the next level that a rope would’ve gotten you 4,500, maybe more. Maybe you even fall into a pit that requires that rope to escape, and your run is just over now. You didn’t really have a way of knowing for sure what lies ahead, and this requires you to plan for future levels and situations you haven’t even seen yet. You have to draw on your overall experience to know what risks tend to pay off when.
A really major long-term risk you have to decide on is the secret levels. In each “world” (1 through 4), there’s an item that requires some extra effort to get, and they must (mostly) be acquired in sequence. If you miss an item along the way, your attempt at the secret world 5 fails, and each step is progressively riskier.
Getting the Udjat Eye in the Mines might require some extra ropes and bombs to secure the locked chest and its key. Getting the Ankh in the Jungle requires a $50,000 buy-in, which is directly deducted from your score. Alternatively, you could steal it, which requires killing seven shopkeepers, arguably the hardest, and certainly the most volatile enemy in the game, plus you’ll be fighting additional ones throughout the rest of your run. Getting the Hedjet in the Ice Caves requires you to sacrifice the extra life that the Ankh otherwise grants you, which takes away a powerful safety net, and you must be willing to sacrifice any items or weapons you might be carrying. Getting the Scepter in the Temple requires you to fight Anubis, a powerful enemy with homing one-hit-kill projectiles. Subsequently getting the Book of the Dead in the Temple’s City of Gold requires you to fight Anubis II, who will follow you to all subsequent levels until killed.
Even after all this extra effort and risk, the entrance to the secret world 5 requires even more precision as you fight Olmec and ride him down into lava to enter the door. Doing all this gives you not only access to the City of Gold (as earlier described) but also the secret world Hell, which gives you four more levels worth of treasure to try for and another boss with a guaranteed payload of bonus treasure in his room. It’s the riskiest thing in the game but is extremely well rewarded.
So far it seems like maybe we’re doing pretty well as far as the game dynamics go. You make the best decisions you can with limited information, and when your run is over, you get a place on the leader board that gives you feedback for your decisions. Maybe you got to 3-1 with $200,000, and you got within the top 1,000 players. Then you look at the top slots: loads of them are on level 5-4. All of them approach a million, sometimes ranging over two million. Every single one of them shot the works, went for broke, and got the big payoff. Then it hits you:
If you want to win at Spelunky, you have to do the same.
Every Hand is All In!
You have to take every risk you possibly can. You have to play dangerously at all times. You have to bet the farm on every hand. If you decline a risk that would’ve paid off, if you fail to bomb a wall that leads to treasure you had no other way of knowing is there, you lost automatically because someone else in the world with the exact same information as you (i.e. not much) did take the risk, and it paid off for them.
It’s not that you’re being punished for guessing wrong – it’s that you’re forced to guess that all risks will pay off. You have to be as greedy as possible and just hope it all works out, because inevitably the top ranks on the leader board are full of even greedier people who don’t even think about playing it safe. And really they shouldn’t: if they play it safe, someone else among the thousands of daily participants will beat them. The odds are in the numbers.
And thus the risk management game breaks down. We’ve optimized the strategy into maximizing risk at all times. You always have to try for the secret levels, even though the Ankh is a great safety net and fighting Anubis is so dangerous, because without 5-1 through 5-4, chances are you just can’t compete. You always have to rob and kill every shopkeeper you see, regardless of how dangerous the setup because you need all the bombs they carry to blow up as much of City of Gold as you can (all exploded tiles spawn more treasure) and still have enough bombs to collect the column of gems in King Yama’s chamber on 5-4. And forget about buying bombs instead, because not only does buying deduct from your score, but the shopkeepers themselves drop yet more gold when killed – it’s a deficit you just can’t afford.
And if you have to do all these things, that means everyone is doing them. So really that means there’s not much reason to believe that the top slots on the leader board are much better at the risk-management game than anyone else is. In fact, there is no risk-management game at all anymore – the optimal strategy is known, and everyone just has to execute. And on top of it all, the dedicated player can cheat by watching someone else play out the Daily Challenge before making his own attempt, thus utterly crushing any hint of risky decisions.
You want proof that Daily Challenges have problems measuring decision skill? Just check the Daily Challenge leader boards. You’d think that if it’s a good measurement of skill, you’d see a lot of repeat names among the top ranks. As of the time of this writing, of the 56 entries for the top slots for the Daily Challenge (the top eight for each of the previous seven days), only 7 of the 52 total names are repeated. And from what I remember, when the game first released on PC (when the most people were playing and competition was fiercest), that number was typically zero. Even today, within the past week, no one person holds the top slot more than once. I’m not saying these players aren’t skilled, but I do think that perhaps this mode measures the wrong skill.
High Score By Suicide
The Daily Challenge isn’t the only mode where the risk-management game breaks down. What about the regular Adventure mode? If the decisions are largely optimized for the Daily Challenge, the problem is ten times worse when you’ve got unlimited chances to get a high score. See, even though we know what methods are required to get the absolute best scores, there will still occasionally be times in the Daily Challenge when you might decide that you just can’t pull off that major risk and that you need to play it safe.
Not so in Adventure mode! Now that our only directive is to get the highest score ever, any time our optimal strategy of maximized risk doesn’t pay off, we can just start over! Did you lose two of your starting four hit points on level 1-1? Suicide. Did you misplant a bomb or not get the item you wanted from a crate? Suicide. Does the first shop not have a weapon you can use to kill the shopkeeper? Suicide.
This isn’t anything remotely like managing risk. You don’t have to deal with the consequences of your actions, nor do you have to deal with challenging configurations. This is the game on easy mode. Just keep starting over until things happen just the way you want them to. Anyone who’s watched Spelunky on Twitch knows that by and large, this is the primary way the game is played.
In fact, that’s exactly the way we got the highest video-documented Spelunky score ever: Bananasaurus_Rex’s amazing $3.1 million run was predicated on hours and hours of restarting the game in order to get a plasma cannon (0.1% drop from a crate that isn’t even guaranteed to be present) on level 1-1 and a shop with a jetpack on level 1-2. Does that mean that every run from now on that doesn’t go for that once-in-a-lifetime seed is just a waste of time? Without a scoring rubric to validate scores that aren’t the highest one that has ever been, the answer is unavoidably yes.
At this point, the game has more in common with a slot machine than a risk-management engine. Forever waiting for that best seed ever in order to justify your new high score is not only tedious, but also largely fruitless when you consider that someone else hunting for the perfect seed might get an even better one than you got. In fact, that’s the very reason why we have the Daily Challenge mode to begin with – to mitigate the random seed’s affect on performance! Without that mitigation, it’s hard to justify the high score holder being the world’s best player necessarily. While he’s certainly well-versed and skilled at the game, all we really know about Bananasaurus_Rex from his world record is maybe that he’s just the most patient player.
The same is true for speed runs, a type of play for which Spelunky is ill-suited. I’ve been watching some Twitch streams of players with quite impressive skills in that regard, but invariably speed runners will abandon their run the moment it goes the least bit wrong, especially in the early stages. Maybe the player made a mistake, maybe the seed was unfavorable, but either way the player can simply choose not to deal with even the tiniest hiccup.
The problem is the same whether you’re running for score or for speed: as long as you’re shooting for the best score, all failed attempts are essentially discarded and unscored. Your sub-optimal attempts end up not reflecting on the way the game measures your skill. It’s just as well – the only way you could look at your failed attempts are basically as losses, and the better the score or time you’re trying to beat, the more often you just plain lose.
Dealing With Your Mistakes
Don’t let the article so far convince you that I don’t like Spelunky. It isn’t a bad game by any means – it’s just that the scoring mechanism is a little broken. It encourages behavior that seems contrary to how you’d think the game is played. But we can fix this! We just need some way to reward the player for sticking with his run no matter how bad it gets. That’s what the original goal of the Daily Challenge seems to have been – to make you deal with your mistakes and make the most of them.
Awhile ago I was talking to a friend of mine who’s an SCA member. He participates in ranked archery shoots, and he told me about how they worked. When you shoot a flight of six arrows at one of many qualifying events, you can choose to publish that score. Your ranking is determined by the average of your top three scores. What’s more, the averaged scores have to happen within a designated twelve month period, the “tourney season” if you will. When the season ends, you start over and your old scores go away completely no matter how high they were.
When I heard this, I realized right away that SCA archery tested a quality that Spelunky doesn’t – consistency. It isn’t enough to get a high score once. It isn’t enough that you got the highest of scores three years ago. What matters is, can you keep doing it? I realized that’s exactly what Spelunky needed.
So how do we make Spelunky test your consistency? I started with the idea of averaging your scores together. SCA archery is onto something good, but it still has that same problem wherein it motivates players to ditch scoring attempts early. If your first shot is zero points, that might already be enough to make you abandon the whole flight! Also, since you can shoot as much as you want all year long, but only submit the top three scores, you can run into some strange scenarios: maybe archer Robin only had time to shoot three times during that year, but averaged an exceptional score on those three times alone. Meanwhile, archer Joe shoots five hundred flights, most of which are well under Robin’s average. But every great once in awhile, Joe gets a little lucky and shoots higher than Robin’s average. Maybe he does so three times all year long. Who’s the better archer? The scoring rubric says Joe, but it seems pretty obvious that Robin has more skill. His scores have 100% consistency, whereas Joe is better than Robin less than 1% of the time.
One solution I’ve devised, one that I’m currently pioneering on my Twitch channel, is “Average of Ten.” All you need to do is play the game whenever you like while taking the average score from your previous ten consecutive runs. When a score is more than ten games ago, you bump it off the list like it never existed. This already has a lot of good things going for it. Foremost, it encourages the consistency factor. Sure, you scored $2.1 million on that one magic run six months ago. Who cares: Can you keep doing it? No longer can you just ride on the coattails of past performance – how good are you now right this second? That’s what I really care about.
Also, it makes every single run matter. Are you down to one hit point on the very first level? Tough. Make the best of it, and minimize the impact it has on your average. Turn it around despite all odds. Can you do that? Only average of ten can tell you. It also very much feels like a ladder system. One bad game will hurt, but it won’t just completely kill you like it does in the Daily Challenge. Just keep playing well, and eventually the score will disappear, and your average will be none the worse for it. Conversely when you get an uncharacteristically high score, you only gain so much from it, and with no repeat performances, even that small boost will quickly go away.
It also counters the random seed pretty well since it’s average of ten games. Sure, you’ll get those occasional seeds that are particularly nasty, but ten times in a row? Not likely. And even if it does happen by freak accident, you’ll climb right back out afterwards when your bad luck runs out. If nothing else, it feels like playing a Daily Challenge every time you play the game! I’ve been testing this method with a lot of success. It changes the way you play the game, you can easily track it on your own without the game’s help, and with some effort you can even potentially adapt it to speed runs.
“Average of Ten” is hardly the only way we can effectively rank players in a single-player game. Dinofarm Games’ upcoming title Auro also gives you a randomized level every time you play, but instead of asking you to get the highest score ever, Auro sets a target score based on your skill level. Reach it and you earn points towards graduating to a higher skill level with a higher ranking to show for it – and a harder score to try for. Win several times in a row, and your rank increases faster. Lose and expect to see your rank drop.
Now, here is a single-player game that rewards consistency. It gives you the most honest assessment of your skill, and it gives you something to compare with your friends and rivals. On top of that, every single game has a reasonable, attainable score goal that ensures that games don’t become prohibitively long, like the seven-and-a-half-hour long marathon session that poor Bananasaurus_Rex had to endure. Auro’s scheme is designed to produce quick games that still manage to give you feedback that’s extremely reliable. This idea is so strong that I feel it should be the standard for scoring and ranking single-player games. Keep an eye out for this title when it releases in August!
So what about Daily Challenge mode itself? I don’t think it’s beyond help by any means. Earlier I mentioned the problem of there being thousands of players for each Daily Challenge. Really, that’s the root of the problem: when you have thousands of participants, only one guy gets to win, and that’s the dynamic that makes everyone go for broke every game. But what if you changed the Daily Challenge into an individualized Player to Player Challenge? Imagine if instead of a worldwide free-for-all, the game matched you with someone close to your skill level and gave a pre-seeded dungeon to just the two of you? Now you don’t have to worry about someone out of thousands lucking out with a payout on every long-shot bet – you have just one opponent and a reasonable expectation that he might get killed if he tries any unwise risks.
It turns out that the problem with Daily Challenge mode wasn’t in the format, it was in its scope. Giving the same seed to just two people gives them an opportunity to properly manage risk while still retaining a good chance of winning, even if they fall short of the million dollar mark. Add in a ranking system and an unranked “Challenge a Friend” option to the mix, and this mode is ready to ship! It’s too bad we can’t do this on our own, like we can with the “Average of Ten” method – I’d play Player to Player Challenge all the time.
Does It Matter?
It may seem to some that I’m overanalyzing a simple scoring mechanism, but it’s anything but simple. Score is the single most important part of Spelunky. It represents the goal of the game, the force that drives you to do anything at all. It’s what gives you feedback for your actions and calculates your skill. That makes scoring solely responsible for the way we play the game, and the way we’ve been playing Spelunky isn’t exactly ideal, as evidenced above.
We’ve largely solved it down to just execution, and while execution is certainly a skill, it’s just not as interesting as risk-management decisions. Risk management is what separates Spelunky from Super Mario Bros. Once you beat a Mario game, you can do so any time without much trouble. But Spelunky is never truly beaten. Sure, it has an “ending” after Olmec or King Yama, but dying on level 2-4 is just as valid an ending as those end bosses, provided your score is up to snuff. And since score is the only thing that matters, there’s always room to improve.
See, if you’re only playing the execution game, you’re only playing half the game. Whether or not you can do something is a short-term skill. Whether or not you should do something is a long-term skill, one that you can spend a long time cultivating, indeed one that gives Spelunky life far beyond just completing the last level. If your game is all about execution, you’re missing out on the best part of the game.
I have to say, I strongly disagree with the premise of this article. Here is why:
Your argument is based on the assumption that “winning” the Daily Challenge is getting the top score. Yet, only one person in the world can get that each day. One’s inability to reach that goal necessarily will contain a large luck component in a randomized game where you only get one try.
It is not true that it is required to bet the farm every time. In fact, this will only provide optimal results a small portion of the time. Many games of Spelunky simply don’t generate enough bombs to milk the City of Gold and Yama’s pillar for the maximum cash. If you spend all your bombs on cash in the wrong place, you might not have one for an instance where one is needed. Rather, Spelunky’s Daily Challenge scoring rewards the player who took the greatest risk within the bounds posed by that day’s levels. Because you (generally) have no prior knowledge of how that game will generate, there is an unavoidable luck component to it, and that explains why there are so many different names on the scorelist each day.
It is a mistake to take Bananasaurus Rex’s record game as an indication that the game is broken. He may have hit an upper bound on possible scoring, or close to it, but that doesn’t prove anything, just that a hypothetical has been achieved, once. If enough people play a game enough times, something like that will happen eventually. You may not be able to beat it, and tying it might require a one-in-ten-thousand convergence of events. But this is true of nearly all random games where score may vary according to random events. Spelunky is rare only in that it has been played enough that someone actually managed to attain it. The fact that the max score is not, in fact, infinity or MAXINT-1 is a great design strength of the game, because having watched the Nethack player community, I can tell you that if there is a way to “mint points,” someone will exploit it ruthlessly, and then the race for the high score will become whoever is best able to fill the milk pail, either through skill, luck, boredom or ability to stave off sleep.
Thanks for the comment!
I don’t actually disagree with you at all. But if what you say is true, that doesn’t speak too well of the strategy space. The core of the problem is the ambiguous goal of “get a high score.” It doesn’t mean anything by itself – get a high score compared to what? The only context for what scores are “high” are other scores. And once you realize what you’re up against by looking at the daily challenge leader board, you realize you have to go for broke. You can’t manage risk anymore – you have to always maximize risk using all the known best techniques. And you hope. I mean, you said it yourself, there’s an unavoidable luck component to it. Without the knowledge of whether or not maximizing all risk will pay out, you actually still have to maximize risk just in case it does. Because if you didn’t, and someone else did, you lose. Then again, if maximizing 90% risk was actually the right answer, maybe you win instead. So what should you do? In a sense, players are just guessing at what the best strategy will be for any given seed. That doesn’t sound good.
What’s missing from the skill measurement is consistency. When the leader board immortalizes people who got a high number once, it becomes optimal to get the highest number possible from the seed. You shoot in the dark and hope. If you blink, your score fades to obscurity. What competitive ladders need is a firmer permanence. They need a way for your performance to contribute towards an overall skill assessment. They need a way to show the player “Your skill is rated at ____” and “Your skill has gone up/down by ____.” Instead, players get a leader board place number that fluctuates by thousands every day and doesn’t really mean much of anything. A player sees “Today you’re a 2057th place player!” and shrugs and goes “Huh,” before going for broke again trying to hit the #1 slot, because that’s what really excites him. Anything below is basically “You lost.”
I agree that games where you can infinitely farm points are basically broken (I actually talk about that in the following article), but global high scores can actually create a similar dynamic. If the top score in Nethack is held by the guy who played the longest using an infinite point farm technique, the top score in something like Spelunky is held by the guy who can comb the seeds the longest. If you watch streamers of the game today, the way they all play it is by speed running the first level or so to see what kind of game that the seed will support. If it’s a good speed run seed, they’ll do that. If it’s good for score, they’ll do that. If it’s neither (usually the case), or if anything goes remotely wrong, they suicide and abandon the seed. It’s commonplace for a Spelunky streamer going for any world record to abandon 20 to 30 seeds waiting for one that looks promising, many of which are later abandoned after all for one reason or another. This takes hours at the extreme end, and will only get worse as the global high score climbs. The irony is that their play still isn’t 100% optimal, because they’re willing to sacrifice most of the money on the first board in exchange for seeing more seeds per hour. I mean, isn’t it weird that some practical, real-life situation is externally informing the strategy of a game? Their priorities prove that generating a good seed is the single most important thing, even over playing the seed optimally.
What these players are doing isn’t an anomaly or extreme – it’s just what you have to do eventually. When the only thing that the game measures is “the single best run of all time,” eventually you get to a point where you can’t even play most seeds, and who cares about those seeds anyway seeing as they’re not scored or counted in any way. This has the very real and practical effect of having players waste a lot of time trying to generate seeds worth playing, which could be seen in a way as a form of grinding. Maybe someone other than B_Rex could’ve gotten an even higher score total than he did on the seed he played, but we’ll never know because maybe whoever that person is just hasn’t generated as many seeds as B_Rex. This is no way to compare skill in a game, even a single player one.
A game shouldn’t punish higher skilled players with longer and longer down time between “real” games. It shouldn’t allow him to skip bad seeds in the first place. A game needs to have an immediacy that says “This game counts! No matter what the situation is, you need to make the most of it!” You should have to (get to!) play the game every time. The Daily Challenge is a good attempt to do this, but I think the scope is still wrong. If it’s limited to individual challenges between two players at a time, all of whom are ranked with an Elo-like system, only then do we actually have a meaningful leader board that rewards consistently good play.
It doesn’t say anything about the strategy space of Spelunky. It is an artifact of the nature of scoring systems applied to a random game. They’re not perfect, but the fact that a variety of names come up in the daily challenge scores indicates to me that the process is healthy, not degenerate.
Yes, high scores are only high compared to other scores; that’s an aspect of linear measurement. The fact that the highest score is very very high is simply an artifact of having a large player community and a variety of strategies. If you have a player base of ten people, the highest score will probably be lower than the highest score out of a player base of a hundred. The more players there are, and the more they care about playing well, the more unobtainable the top slot is going to seem to be.
You seem to be inching towards my realization about how scoring in randomized skill games works. I’ve been observing these kinds of patterns for many years, in such places as arcade game and pinball vanity screens, alt.org’s Nethack scoreboard, the Twin Galaxies Crazy Taxi scoreboard and, more recently, Pac-Man CE and CE-DX scoreboards (those last three boards I’ve been competitive on, at one time or another). All of these boards have differnet attributes form each other, but collectively give a good breadth of experience.
On players guessing on what the strategy will be: somewhat, but it is an educated guess. It is something I am planning on touching upon in a future @Play column, tentatively on the difference between three categories of playing skill I call “knowledge,” “skill,” and “wit.” This falls squarely (or as squarely as anything can fall) into the category of wit, or intuition, or hunches, a way of suspecting what may come ahead based on experience but outside of conscious involvement. Randomness does add luck to the game, but that is to Spelunky’s benefit, not detriment. If there was a best solution then someone would probably have discovered it, and there would be only a small number of names on the top of the scoreboard each week, those being the players who best implement that strategy or else beat everyone else to the maximum score.
You talk about consistency, about the daily challenge board “immortalizing” players who get to the top once, but that’s a petty immortality, isn’t it? I certainly can’t name the player who was champ 12 days ago.
“When the leader board immortalizes people who got a high number once, it becomes optimal to get the highest number possible from the seed.”
I think this is a non-sequitur. I don’t follow your logic.
I do think the average score you earn, or maybe your average place on the scoreboard, is a better measurement of your general skill as a Spelunky player than a single daily challenge. But Spelunky does count your average score throughout your career, and average board position is more of a personal thing: what does it mean to rate average performance on a list of averages?
A player shooting for 1st and getting 2,057, that sounds like a problem for his perceptions when faced with the realities of optimizing performance in an uncertain environment more than anything. How many players are below him? Because I can tell you, Spelunky is not a lottery, there is a very strong skill component to performance, it’s not a roulette wheel spin.
Spelunky seed combing is exactly what the Daily Challenge was designed to combat, and it does a good job. In normal games, by the way, quitting until you get a good setup is anologous to what some traditional roguelike players call “start scumming.” It is a well-known tactic and difficult to defend against, but the alternative is really letting people start fewer games, or increasing the cost (like in terms of time) of staring a new game. It’s possible to do things to combat start scumming, but they tend to cause bigger problems than they fix. (BTW, Nethack’s point minting options have a strong upper-bound because players have managed to get to MAXINT-1, the highest score the game can hold in a 32-bit signed variable. But everyone, the Dev Team included, knows Nethack’s scorng is broken.)
On Spelunky streamers playing the first level to try to guess what kind of game it’ll be, what you describe them doing actually sounds to me like highly interesting and strategy behavior. I think what you might be experiencing is culture shock from observing the lengths players will go to to optimize their play. These are valid ways to play. Even if they suicide, they basically are taking themselves out of the running for that seed. Really, player behavior like this is nearly bound to happen for a game as strategically rich and chaotic as Spelunky, if you can get enough people to care about it.
“I mean, isn’t it weird that some practical, real-life situation is externally informing the strategy of a game?”
From one point of view, sure it is. But when competition gets that significant, players will seize upon any advantage, no matter how slight. It’s the nature of competition, not Spelunky. As everyone works out optimal or nearly-optimal strategies to the point where they are taken for granted, lesser aspects naturally rise importance. (This, by the way, is exactly the reason why corporate lawbreaking is such a huge problem; when companies compete against each other optimally, eventually it becomes the company that gains a tiny advantage who wins. If that advantage comes from breaking the law, there had better be a credible law enforcement threat to catch it, or else companies that don’t break the law will get out-competed.)
“When the only thing that the game measures is ‘the single best run of all time,'”
But it doesn’t. It tracks a full leaderboard for each daily challenge. Of course only one player can be best, but I’ve already addressed that. For most players, ranking in the thousands is fine. If you want to rank higher, you have to expend exponentially more time and energy to advance each successive place. Again, that’s the nature of leaderboards, not Spelunky; to fix it, one would probably have to devise a new ranking structure, and that’s actually not as easy to do successfully as you think. I will tell you: ELO systems are not a panacea themselves.
(Some of this may get used in a future @Play. If it does, I will send you a note about it. Also, I am having a devil of a time getting this comment to post successfully. If it somehow gets made more than once I’m sorry, but WordPress doesn’t seem to like me today.)
The actual realization from all of this is that the “high score” model is broken in general. This isn’t to discount the achievements of players who’ve been forced to labor under it for so long, but it has some pretty severe negative aspects, and for as much as Spelunky tries to minimize them, they’re still fundamentally there.
The single largest problem is that it makes games take a stupidly long time to play. The better your skill at a high score game, the longer the game tends to go. This is primarily a problem in “survival” games where the longer you survive, the higher your potential score earnings. The natural implication is that in order to distinguish players of higher and higher skill, the game needs to be able to go theoretically infinitely long with an infinitely high score bound. This is totally absurd and is actually pretty disrespectful to the player. Rewarding a player for sitting down to a 7-hour endurance run for a single game is wasteful of his time completely unnecessary. There are simply better ways to distinguish player score.
Spelunky combats this to some degree by putting a hard cap on the number of levels in each game, but then it turns sharply the opposite direction with the implementation of ghost mining, especially with a jetpack. With unlimited upward mobility, getting the ghost to run over every gem in the stage is easy, it’s a no-brainer, and it takes forever. But it’s just something you gotta go through the motions and do if you want to be competitive, because everyone else has to do the same thing. What’s even weirder is that the ghost was supposed to put a soft cap on the length of time you spend in a level, which is actually another good design stroke, but the jetpack reduces the ghost from a serious threat to a rote task. In order for the ghost to retain its original function, it’d be pretty easy to give it an acceleration over time that makes it impossible to avoid after a certain point, thus reaffirming its “soft time-cap” function.
The problem of game length is precisely what Pac-Man CE was trying to solve. Unfortunately it carried with it a different problem where randomness played a large role in potential scoring, which gave rise to the need to “comb seeds” in that game as well. Matthewmatosis encapsulated this conflict perfectly on his YouTube series. His conclusion is that ultimately the only answer for distinguishing skill perfectly is for games to be infinitely long. However, what he doesn’t exactly highlight is that this is only true for games that use the high score method.
It’s notable that the avenue of video games is the only type of game where solo high-score is ever used. Even in solo sports like golf, the objective is to get a low score, which naturally pushes the game to get shorter and shorter. Of course, the score can only get so low, so as a result the natural way to think about relative skill at the game is (go figure) consistency. A given player can have a phenomenal game, but if they don’t have the ability to repeat it, you wouldn’t really regard them as the best overall player. So while it’s true that Spelunky has an average of your top ten daily scores, that eventually becomes a function of how many daily challenge attempts you’ve made. The fact that you can “bump off” older, worse scores from your average is an artifact from “bumping off” lesser scores of an arcade leader board. Again, if one player plays ten dailies ever and scores over a million on each, is he worse than the player who has ten games that average slightly higher, but also hide 100 other scoring runs that were 1-1 deaths? What if one player’s best ten were on better scoring seeds than another player’s best ten? So actually, the average of your best ever ten dailies brings it right back to the problem of seed scumming!
The fact is a game is responsible for how it’s played, and single-player high-score games have fundamental problems. Seed scumming is “hard to defend against” only in those games, and I think it’s time to move on from that little experiment. Like I said, high score is really only a thing in video games, and I think the reason it’s stuck around as long as it has (not really that long in the grand scheme of things) is that gamers have kind of a “love affair” with the idea. The people who first strove for high scores are still around, are still doing it, and we praise and idolize their efforts (and rightly so, for as much work as they put in). But what I’m getting at is that from a design standpoint, high scores are not viable in the end. They have a way of making games go on far longer than they need to, they reward seed scumming, and generally degrade the play experience of games they appear in. Ranking players using a high score makes the play experience absolutely Spelunky’s fault. Not any more so than any other high score game, but not any less.
And no, it isn’t easy to devise a new ranking structure, but it’s time we got on it. For now, I highly suggest you check out the mobile game Auro I mentioned in the text. It has a combination rank/difficulty system that simultaneously fixes the game length problem and the seed scumming problem, and above all rewards consistency. Remember, when we talk about the reasons why the best athletes are the best, our conversations very rarely start with “There was that ONE time when…” We always describe them more like “One of the most solid, dependable, long-running players the game has ever known.” It’s time for video games to follow.
The problem isn’t that high score is broken, it’s that it’s been called to do so much that the problems with the system, the kinds of problems that are visible in any system when you look at it too closely, have become visible. Game designers have been working on overcoming the issues for a long time; I’ve found, if you look at enough classic arcade games, you can usually find one that has at least anticipated whatever problem with scoring you have (or other game design issues, too!). But that’s beside the point.
It remains: if you want to rate players in a list, you’re going to have to do it by some metric. Even if you don’t use an explicit score, whatever means you use becomes functionally analogous to score, and the problem arises again. So really, it’s not a problem with score, but with rating linearly itself. There is probably interesting work to do at other kinds of one-dimensional sequenced rating systems, means of summarizing play and expressing skill. I’ve thought, a little, about that, but also how other systems tend to have the drawback of not being immediately understandable to a player trying to improve and rate his skill against others.
Unfortunately, this discussion has gotten to the place where it’s become kind of tiring to keep this up, especially when I have other things I have to work on. But I have a lot more to say, especially since I am more familiar than most (probably more than the person who made that video, in fact!) with the game systems of Pac-Man, Pac-Man CE and Pac-Man CE DX (which the video maker misidentifies as CE). Part of the problem with randomness concerning Pac-Man comes from the game’s weird, “discrete” nature, where the player can coerce the game into patterns because of things like integral maze lengths and turn buffering. The fact that patterns work in Pac-Man is entirely because that game’s design encourages them, although maybe not intentionally. Randomness, or rather pseudo-randomness, is used very interestingly in Pac-Man; it’s not absent, but done in a way that ensures that patterns can still work. I think the Pac-Man CE games use it in a similar way. (I’ve thought enough about Pac-Man’s design that I’ve written a Pac-Man-like game, Octropolis, that uses some of my ideas. It’s on itch.io. I can send you a free copy if you like.)
It might be a good idea to have an interactive chat to talk about these issues. Are you on IM or Skype at all?
Sorry for the delay. I’ll send you an e-mail with my info. Looking forward to hearing from you, and thanks for the comments!
Eh. Ultimately, people can just keep restarting each of those 10 times. A bit less than they probably would care to otherwise, but still, it does not invalidate that tactic. To completely eliminate it you’d have to make the score independent on the RNG, reward poor luck, or remove scoring altogether.
Yeah, I wouldn’t say that people who constantly restart are playing the game wrong or anything – actually they’re definitely playing the game right given the design of the game. What I’m saying, though, is combing through random seeds endlessly until you finally get one that has the potential to beat your previous high score isn’t a very fun way to play the game. I mean, you’re basically not playing at all until you get a good seed, and as your high-score-to-beat keeps getting higher, the problem only gets worse as potentially playable seeds get more and more rare.
My Average of Ten method somewhat reduces the role of luck in the scoring process, though not entirely. But I think the most important thing that it does is it lets you finally get to actually play the game! Every time, even! Since every time you start a game means that you have to record that score, it means you have to play every time. That’s the biggest problem with the way high score games like Spelunky work: competitive play eventually boils down to putting in a lot of meaningless work between actual games. That’s the problem that games need to fix so that players don’t feel it necessary to do that work between play.
Neat article. Nice to see how you’ve expanded your thoughts on the matter.
The big problem with the Daily Challenge is that there are guranteed payoffs and that the RNG swings too widly. The Secret World is easily the worst offender, since despite the risks, you are guranteed a massive payoff, without which you cannot compete. Effectively, the players have to go all in every time because the system (the Secret Level) is ALSO going all in every time. The best strategy would not be so obvious if instead of one massive chunk of score at the end, players had to make moment to moment decisions regarding smaller pots. This also applies to shopkeepers – sure they’re tough and dangerous, but when you rob a shopkeeper you have a guranteed payoff, so you have to take the “risk” and rob them all.
The RNG swings are things like the plasma cannon having a tiny chance of being spawned on level one… not acceptable, pure slot machine design. The spawn rates of baddies and goodies needs to be consistent between the seeds. What really gives the situation tension in a risk-management sense is that you have to make a strategy involving uncertain details. Even if the only equipment the game has are ropes and bombs, crates would still have that risk-management situation because you have to understand the consequences of both outcomes.
Sorry for the late reply, slipped my mind.
One of the things I’ve strongly considered in my Twitch streams is banning the plasma cannon and (especially!) the jetpack. It just completely trivializes the runs so badly, and scores with are just miles above runs without. If I were running an Average of Ten league, I’d definitely do it from the outset.
What’s also been interesting is the conversation that those items have inspired between me and my viewers. One of my clever viewers actually proposed a change to the jetpack that would potentially save it from being banned: limited fuel. Obviously you’d have to rebalance the jetpack around that fact (e.g. price in the shops and whatnot), but what’s so interesting about the limited fuel solution is that situation becomes that the jetpack is to ropes what the mattock is to bombs. Mattocks are super useful and save you a lot of bombs potentially, but they don’t last forever. I’d perhaps consider also turning the jetpack into a carried item same as the mattock (and probably changing its name/theme accordingly), but that’s just an initial idea without really any play testing. I thought it was a very savvy suggestion from my viewer, though!
I agree whole heartedly that any game that has a randomness factor to the levels or design would have an issue ranking people by “The Best Score Ever”. What happens when a person gets that magic seed that Everything goes as well as it could? Your “best of 10” ranking system has a lot of merit and would probably be useful in a lot of single player games/events. Ranking for thrown weapons in the SCA works about the same with e “top 3 scores” set up but the same issue happens when people are going for score and they mess up enough that they stop the whole run because it won’t effect their average. They do a top 10 list at the end of the year so there is plenty of time to get scores in and to practice but if you had an amazing day and everything went well it sticks with you for the whole year and can improve your average in significant ways.
Would you think that the more randomness that a game introduces, the more runs should be taken into consideration? I know that there are some games out there with enough randomness to make a 10 game bad streak seem plausible.
As for the “one vs. one with a single seed for those two” option using similarly ranked individuals it brings to mind puzzle games with vs. modes like some of the Tetris games and Dr. Mario so it seems like it would fit right in!
All in all, great article and I hope to see more soon.
Even using Average of Ten (maybe “A10?”) there’s probably a cut off point where a game has introduced sufficient randomness to basically be just pure noise. It really varies by the design, but the randomness is more likely to be valid for a ranking system if it’s randomized configurations rather than randomized determiners. In the former case, you have a reasonable opportunity to react to a randomized situation and exploit or mitigate it. In the latter, everyone’s performance is artificially pulled towards the statistical average, so at that point, adding more runs to the average is a greater measure of probability than of the players themselves.
What’s interesting about Player to Player Challenge is that it takes the old head to head model used by abstract games like Puzzle Bobble and makes it asynchronous. Now people can play against each other worlds apart and not have to worry about each others’ schedules or even latency. They each play at their own pace and the system takes care of the rest. In fact, it’s better than a lot of asynchronous turn-based games because when you have a lot of versus games live at a time, you don’t have to spend the extra mental overhead of relearning the game state for five different games as you return to each one. It does lose the element of players being able to act upon each other, but that actually makes games like Spelunky probably the ideal type of game for asynchronous play.
868-Hack (iPhone) uses the “best of last 10” scoring system, iirc. Free version here: http://868-hack.neocities.org/
I liked the condensed write-up of the issue.
I looked into 868-Hack as I’d heard good things about it, and what I found was that it uses a Streak Score method, wherein you sum your score from all your previous deathless games until you finally do die. The method looked intriguing at first, but ultimately I don’t think it matches the game play too well. This In Machinam article offers a great write-up on why that is so, and I largely agree with it. Amazingly, James Lantz even suggests that Streak Scoring be limited to ten runs, which is pretty spot on with what I’ve devised!
Edit: I just realized that Streak Score introduces a problem that is basically the inverse of the Daily Challenge problem: The optimal way to play is to MINIMIZE all risk. Only grab the easiest points you can and get out of levels as fast as possible so as not to jeopardize your entire Streak Score. Interesting!
Thanks for another insightful article!
This “average of 10” method might actually be a way to make a LOT of games better, not just Spelunky. I mean, we usually can’t just put in a sophisticated and game-specific ranking method like the one in Auro. That’d actually have to be part of the design itself.
But I can imagine “average of 10” working decently well as kind of a “generic plugin scoring system” with e.g. any roguelike out there. In fact, any “highscore” game with non-fixed content should work much better that way! If there’s randomness and decision-making involved, it’s basically never reasonable to go for “the highest score ever”.
I think I might give Spelunky another shot, using your spreadsheet to rank myself over time. The whole scoring mechanic and the therefore “induced aimlessness” of the system frustrated me rather quickly when I first played it.
How true! This problem is far older than Spelunky, and it won’t be the last game it rears its head in, I’m sure. It was the Daily Challenge that got me to thinking about it, though, and I applaud Derek Yu even for the attempt to solve the high score problem – it means that he acknowledges it as a problem, a step not many designers have taken or even thought of.
Be sure to let me know how you do with your average! I don’t think many people are using this method yet, so I have trouble being able to tell if my average is good or not. 🙂
Great job on this article