Artificial Intelligence and Hold’em, Part 1: Counter-Factual Regret Minimization

It has been a good year for machine learning and artificial intelligence. And no, I’m not just referring to the Russian scientist who claims his Robotchef can make a perfect bowl of crab bisque.

More relevant to the poker world, this was the year that researchers at the University of Alberta claimed to have “solved” heads-up limit hold’em, calculating a strategy for all possible cases of the game within 1% of an unexploitable Nash equilibrium. Meanwhile the poker group at Carnegie Mellon University in Pittsburgh have tested a heads-up no limit hold’em system against some of online poker’s best heads-up players, a match that the humans won but only by a statistically inconclusive margin.

These results have been covered here before. What I thought we’d do today is to dive into an algorithm intriguingly called “counter-factual regret minimization” or CFR for short, one that has a lot to do with how poker-related AI works. Besides having a great name, CFR is important — it’s the math doing the heavy lifting for both the near-perfect limit hold’em solution, and the highly competitive no-limit hold’em AI.

Introducing CFR

Viewed from a distance, CFR is pretty straightforward, especially to those who are familiar with the Nash equilibrium.

A Nash equilibrium, where one exists, is an unexploitable, probabilistic strategy. “Probabilistic” means that in some cases, you would bet 50% of the time and check 50% of the time — subject to a random coin flip. “Unexploitable” means that no opponent can gain a long-term advantage — averaging over all possible game actions (that is, all different card and bet combinations) even if that opponent knows what strategy you are using, but does not know your face-down cards or the results of the coin flips that you’ll use to make those 50-50 decisions.

Nash equilibriums are known to exist for two-player games meeting certain conditions, a category of games that includes both heads-up LHE and heads-up NLHE. It’s less clear what Nash equilibriums might also exist for large multi-player games, but in practice, approaching these games by trying to approximate and apply an unexploitable equilibrium strategy can be effective, especially against tough opponents, a topic I explored previously in “Game Theory Optimal Solutions and Poker: A Few Thoughts.”

There are about 320 trillion “information sets” in heads-up limit hold’em, if we record each combination of dealt cards and sequence of bets as a separate event. This sounds like a huge number, but it’s not like the number of atoms in the galaxy, or anything like that. Stephen Hawking makes a good point in A Brief History of Time that every equation that appears in a book cuts readership by half, so I’m going to follow that advice and avoid equations in these articles. You’ll have to take my word on the math, then, and I’ll try to avoid numbers where these are not totally necessary.

The MacBook Air I’m typing on includes a 2.13 gigahertz Intel processor. Multiplying the 320 trillion possible information sets for limit hold’em by 2.13 GHz yields 1.74 days’ worth of computing work. Of course, looping over all the possible hands is not quite that simple, but it can be done, even on your laptop.

However, you can’t hope to loop over all hands (or even 0.01% of all hands) every time the AI needs to make a decision. When you do loop over all of the hands, it would be nice for the computer to crunch a few numbers, update a counter, and move on. Furthermore, it would be great if you could loop over the different hands, excluding all duplicates. Lastly, it would be especially helpful if you could use several computers look at the different situations in parallel, at the same time, and then combine their work into a single Nash equilibrium solution.

This is where the elegance of counter-factual regret minimization comes in.

Minimizing Our Regret

Let’s consider a specific game situation. You have -offsuit on the button and there is \$400 in the middle as the flop comes -rainbow. Your opponent checks and the action is on you. What do you do?

Well, you can check or bet. For each choice you can then play out the rest of the hand against a large random set of opponent hands, through something called a brute-force tree traversal. We can fast-forward past the particulars of how this is done. Think of it as solving a variant of hold’em every time, where your preflop cards and any face up cards are fixed, and we solve for perfect responses by both players if we play out the rest of the hand. Obviously, neither player knows what cards will come next, and your opponent cannot distinguish between states where you have in the hole and other states where you hold different cards, but the betting and face-up cards remain the same.

You’d think that the CFR algorithm would now choose the action with the highest score against a perfect opponent, but this is where the “regret” comes in. What the algorithm does is look at all strategies that do not include the move in question, and count how much we “regret” having excluded this action from our mix. Do we regret it (CFR asks), if our opponent knows that we will never check? And do we regret if our opponent knows that we will never bet in this spot, with this given hand?

In other words, our baseline is to bet 50% of the time and check 50% of the time, simply because these are the two actions that are allowed. Then we explore what happens if we never perform one of those actions. Say we never bet. Do we regret it? By how much? And what happens if we never check? Would we regret that as well?

I think that with our out-of-position -offsuit hand on a -rainbow flop, we would regret never betting, although we might also regret never checking the hand. The amazing thing about the CFR algorithm is that we can take each action in proportion to its cumulative regret (i.e., the remorse of never taking that action), and that ratio gets us provably close to a Nash equilibrium.

It also turns out that we don’t need to solve each game state by playing it out to the end. As long as we’re improving the strategy at each individual game state by counting the regret and attempting to minimize it, we can use the current “best strategy” for playing out the rest of the game to look up the value of the next state in simulating our hand, and that’s good enough. As each state of the regret-minimizing solution improves a little bit as we cycle over all of the game states, we eventually converge on a good average strategy.

Bring In the (Virtual) Bucket Brigade

It still is not practical to visit all 320 trillion possible game states. However, there are really only 2,428,287,420 “canonical” game states, if we ignore bets and just focus on one player’s cards and the community cards. For example, we map all versions of -offsuit preflop to a single state, rather than its 12 possible combinations.

Besides canonicalization, we can further reduce these 2.4 billion-ish game states to something more reasonable by grouping similar hands together, both preflop and after the flop. This grouping is called “bucketing.” Think of the bars in a bar graph. The data points in a bar aren’t the same, but they are close enough, and it’s easier to visualize them as a category.

By bucketing similar hands together, the CFR algorithm gets further away from a Nash equilibrium, but it still approximates a Nash equilibrium in the real poker game, as long as the buckets form “perfect recall,” meaning that hands grouped after the flop follow from preflop groups.

For example, the first CFR implementation for heads-up limit hold’em — which dominated all other limit hold’em bots back at the 2007 Annual Computer Poker Competition — grouped preflop hands into 10 buckets, hands on the flop into 100 buckets (10 buckets each for all preflop bucket), 1,000 buckets on the turn (10 buckets each for all flop groups), and finally 10,000 buckets on the river.

Since then, others have shown that in practice, there is no need for buckets to use this perfect recall requirement. If we’re limited to CFR with 50 million information sets, which would be the equivalent of 10,000 river buckets once we include betting history information for each bucket, it is best to use 169 preflop buckets (since there are only 169 possible preflop hands in canonical form), then 9,000 buckets on the flop, 9,000 turn buckets, and 9,000 buckets on the river.

To a poker player, grouping similar hands together into a single decision point makes sense. Most of us do all right keeping in mind a lot less than 10,000 canonical cases when making our decisions. That said, I would not be surprised if a very strong short-handed hold’em player really could rattle off 1,000 cases for which he or she has an instant strategy ready. Someone who plays 30 tables online would probably know even more cases, which he’d be able to look up and apply immediately.

It might sound tedious to look at every possible game state (even if we chop these down to 10,000 regret buckets), to consider every possible action, and then to calculate the regret of never taking that action. But to a computer these are just calculations — tree traversals, sums, and averages — and after about two days on a fast computer, the 10,000-bucket CFR algorithm stabilizes.

To make a move in a heads-up limit hold’em game, it just needs to look up the current bucket and apply a near-equilibrium strategy in the form of X% bet/raise, Y% check/call, Z% fold. Done and done.

CFR and NLHE

How do we arrange similar hands into buckets? We’ll get to that eventually. First let’s ask another question — how close does CFR get to an unexploitable Nash equilibrium?

The CFR with 10,000 buckets that I described above beat all other competitors in the computer poker world circa 2007, and played poker pros Phil Laak and Ali Eslami to a draw (albeit over a limited sample of hands).

More recent analysis from 2013 showed that this CFR could be exploited by 289 “millibets per hand” by a perfect player. This means it would lose \$28.9 per hand of LHE at \$100/\$200, or a bit more than 1/4 a big bet per hand. This sounds like a big edge, but in practice no real player would win that much, as this player would have to know the CFR’s strategy and at each decision point compute a perfect counter-strategy to exploit CFR’s weaknesses.

By mimicking this perfect strategy against each bucket, the team behind CFR improved it to play within 64.8 millibets per hand, or \$6.48 per hand at \$100/\$200 stakes, even when the CFR is limited to 10,000 hand buckets. That’s very close to an unexploitable player.

With a heroic disregard for silicon, earlier this year the University of Alberta researchers took this process to its logical conclusion, with more buckets and several months of parallel computing on 200 machines, to get the CFR within 0.986 millibets, or just under \$1 per hand of the Nash equilibrium at \$100/\$200 LHE.

Within their equilibrium, the button is worth \$8.90 per hand. They also showed that it’s worth raising the vast majority of preflop hands on the button, except for a few combinations that are better folded (all of them unsuited hands including a deuce, trey, or four). Interestingly, their equilibrium strategy recommends 100% raising, 100% folding, or 100% calling with the vast majority of preflop hands. I found it intriguing that a strategy that is solving across all possible situations and has the ability to learn nuanced strategies (like 20% bet, 80% check), usually suggests that one should take a specific action 100% of the time in individual cases.

None of this is necessarily true for no-limit hold’em. While it’s a challenge to visit all of the game states a sufficient number of times to reach a near-equilibrium solution in limit hold’em, it is possible. In no-limit hold’em, it is not possible. Think about it — instead of two or three decisions, in NLHE we can check, min-bet, 2x min-bet, 3x min-bet, bet half the pot, bet the pot, raise all-in, or anything in between. Even if the AI only uses a few different bet sizes, it still has to react to an opponent that will make bets that the AI itself would not consider.

Even so, in practice CFR has still been by far the best strategy for heads-up no-limit hold’em AIs. The winning NLHE team from Carnegie Mellon in last year’s heads-up no-limit competition at the Annual Computer Poker Competition — whose AI is called Tartanian7 — used a form of CFR, although they also mix in other methods. Meanwhile the second-, third-, and fourth-place teams all used CFR, approximating an unexploitable strategy. On his website, the second-place finisher whose AI is called Prelude boasts about pre-computing “25.5 billion separate probabilities that determine its decision in every possible situation.”

It is daunting having to respond instantly to any game position. I’m not sure how the no-limit hold’em CFR implementations handle bet amounts that their own systems do not generate. I suppose one can always interpolate any value, if you have estimates for a larger and a smaller bet amount. A former Google colleague of mine, whose AI named Slumbot also uses CFR and placed fourth in the same NLHE competition, says his implementation considers many bet amounts for common states, but only a couple of bets for states that are not often visited in practice.

This matters when your algorithm has to consider not only 10,000 hand buckets, but also the betting histories for entering all of those buckets. I’m sure all of the players have a secret sauce in their personal take on CFR, just as human NLHE experts each arrive at their brilliant strategies just a little bit differently.

No-limit hold’em is too big a game to solve, or even to come close to solving. It is perhaps even too big to measure how close one might be to an exploitable equilibrium. But looking for a near-equilibrium solution for a simplified version of heads-up NLHE is effective, and it is the dominant form for creating a strong no-limit hold’em computer AI player.

That’s all for now. Next time, we’ll take a closer look at bucketing, the method by which it’s possible to reduce 2.6 billion canonical hold’em hands into something a bit more manageable. This involves a lot more creativity than you might think. Can we treat all medium preflop pairs as the same hand? Can a computer understand flop texture? As another friend at Google likes to put it, grouping a large number of data points into a smaller number of similar of cases is “more art than science.”

If you’d like to read more about CFR, check out the original 2007 paper, and for more updated techniques and details see this one from 2013. Thanks to all of the researchers at the University of Alberta, Carnegie Mellon University, and elsewhere who created these systems and make them available for anyone to learn from, free of charge.

Image (foreground): “Artificial Intelligence,” Alejandro Zorrilal Cruz. Public domain.

Nikolai Yakovenko is a professional poker player and software developer residing in Brooklyn, New York who helped create the ABC Open-Face Chinese Poker iPhone App.