|Context||Columbia Neural Network Reading Group/Seminar Series|
Poker can be seen as a game where, given the game state (your own private cards, the public shared cards, and all player's previous betting behavior) and an action policy (probability of betting, checking/calling, or folding given the game state), it would be beneficial to be able to compute the value of each action. Actions must be completed at a few discrete times, “turns”; and each sequence of actions will eventually produce a reward (amount of money received after a round of play). To design a poker playing system, we need to be able to compute the expected reward for each action at each turn given the game state. Poker can thus be thought of as a special case of video games, where the value (eventual reward) of each action is estimated at each frame (turn) in the game. One difference is that the number of turns/events is much smaller.
In Texas Hold'em, both players are dealt two private cards, and are then allowed to bet. Three public cards are dealt, and the players can bet again. Fourth and fifth private cards are then dealt, each followed by betting rounds. Finally, the player who can construct the best hand from their private cards and the public cards wins all of the money bet in the round.
Ideally, a poker-playing system would be flexible across different game types. The system proposed here has been trained to play casino video poker, heads-up (1-on-1) limit (meaning there is a maximum bet) Texas Hold'em and 2-7 triple draw. It could ostensibly learn to play other games variations, too.
The game state is represented as a tensor. The cards present are represented by a 4×13 (number of suits x number of cards per suit) binary matrix with 1s where cards are present. This tensor is fed into a sequence of convolutional and pooling layers, followed by dense layers. As a first pass, the network is trained to predict the win % against a random hand, e.g. try to find the probability of a win, on average, given a certain hand type. To learn to make bets, the pot size is numerically encoded as another matrix in the tensor, the player's position (whether start player or not) is encoded as an all-1 or all-0 matrix, and the current betting round is encoded as 5 all-1 or all-0 matrices in the tensor. The network is then trained to predict the value of each action (bet/raise, check/call, or fold), with a training mask used to only allow valid actions at each turn. To allow improvements, multiple models are trained, where each model is trained to beat the previous model. Interestingly, the first iteration of the model bluffs more than it should. The resulting system was better than a random player, a heuristic player, and an open-source poker AI, and was competitive against a professional human player though it lost.
One issue with the model is that it doesn't seem to learn “hard truths”, e.g. game states where it would be very unlikely to win or lose. It also tends to overfit, even with lots of training data. It may be able to learn patterns/generalize better with cards in “canonical” representation. Using different hyperparameter/model structure choices (leaky ReLUs, input normalization, more training data, fully automated self-play, experience replay, recurrence for memory) may help too.
CFR is the dominant algorithm in automatic poker playing research, used by all winners in the Annual Computer Poker Competition since 2007. The best system, from the University of Alberta, achieved a statistical tie against world-class players in one-on-one games, with useful solutions for games with more players. CFR proceeds by computing the “regret” for each action by modeling the opponent's optimal response, then re-balances its strategy in proportion to regret until the strategy doesn't change (“equilibrium”). In order to achieve this search in a reasonable amount of time, the game states are grouped into roughly equivalent “buckets”. Knowing how to design the buckets is hand-made and game-dependent. This approach has the benefit of (approximately) visiting every state, computing the regret for all actions, computing the optimal opponent response, and converging on an un-exploitable equilibrium. We have no such guarantees for a data-based method, which only learns the states/actions present in the training data. However, this may allow for better generalization, and it's also unclear how important finding the equilibrium is in practice.
Most tournament poker games are no-limit (no upper limit to bet amount). Because limit hold'em was “weakly solved” by a computer system, all computer poker competitions will be no-limit from now on. To extend “limit” systems to a variable-betting amount, discrete actions must be chosen, e.g. call, raise 2x, raise 5x, raise 10x, raise all-in, fold. However, the result can be easily exploitable, especially because the systems do not adapt their strategy to each player. To add this functionality to the CNN-based model, it could be modified to predict the value of calling and betting, as well as a confidence. And, given a bet action, the bet amount could be sampled from some distribution.