I am a pretty mediocre chess player, but at the start of 2019 I was playing a lot of online Blitz. My opening repertoire was pretty slim and I felt myself repeating the same mistakes. I wanted to try out some new openings, but looking at existing opening databases, I found a few areas were lacking:
I decided to play around with processing some PGNs of chess games myself and see what I could make. And after talking about the project with a friend who is less interested in chess, but more interested in web design, we decided to collaborate.
Initially, we were going to process a few thousand games, mostly for personal interest. But pretty soon we came across the whopping 800 million game (as of 2019) Lichess database. This was perfect as it contained a wide range of player abilities and time controls, and had such a large number of games that even after being filtered, the sample size would be large enough to draw conclusions from.
After some work, we were able to process all chess states with at least ~800 games (out of the 800 million game database). Using this threshold allowed us to transform the more than 1TB of Lichess PGN's into a much more manageable 60GB of chess state data.
However, 60GB is still completely unreasonable to ask someone to download when they open a webpage! This made it necessary for us to build a server that when requested can return the most relevant information from a particular chess state. This lets us show you an opening in much greater depth while sending you MBs of data instead of GBs of data.
A server capable of searching through 60GB of data, and quickly sending you the most relevant results is by far the most expensive part of this project.
Update: Unfortunately, due to a lack of Supporters, we have had to take down our server. You can read more about this and our future plans on our Blog
After processing the Lichess database we found it was great for representing online Blitz chess, but had fewer classical time control games, especially at a very high skill level. This led us to add Kingbase with its 2 million, >2000 Elo tournament games.
The graph produced was interesting enough that we decided to go for even higher Elo and add chess engine games. Luckily Computer Chess Rating Lists provide a free database of such games. We used the longest time control of 40 moves in 40 minutes on Athlon 64 X2 4600+ (2.4 GHz) to get the highest quality games.
We make use of the following freely provided Datasets:
Lichess: A very large free database of public chess games, provided as public domain information by the Lichess website. It contains a large range of player skill and game modes. lichess.org
Kingbase: A medium size database of tournament and TWIC archive games. It contains only games with >2000 Elo players after 1990, usually collected from tournaments. www.kingbase-chess.net
ChessEngine (CCRL 40/40): A medium size database of a large range chess engines playing each other, with a time control equivalent to 40 moves in 40 minutes on an AMD X2 4600+ at 2.4GHz. There are no human players in the dataset. ccrl.chessdom.com/ccrl/4040
Email: [email protected]