Friday, April 1, 2016

The GDL engine performance-testing framework

The Game Description Language is declarative, not imperative. It defines the rules of a game, but not the process by which a program should apply those rules to produce facts. This has led to a wide variety of implementations of GDL interpreters, or engines, that handle this computation, with a wide range in their speed and accuracy across games.

A paper presented at the GIGA'13 conference by Yngvi Björnsson and Stephan Schiffel compared the performance of six such engines on 12 games. (To access the paper, select "proceedings" on the conference website and scroll to page 55.) To follow up on this experiment, and to test my own creations, I decided to write an open-source framework for this type of performance testing.


The gdl-perf project


The following were design goals of the framework:
  • It should support engines written in any programming language.
    • Corollary: The engine being tested should run in a separate process from the framework code.
  • It should be able to set up testing and let it run for as long as it needs with minimal handholding from the user. If the computer it runs on is shut down, it should be able to pick up from where it left off.
  • Results should be stored in a format usable by different kinds of analysis.
  • It should be easy to test a large number of games.
The framework itself is written in Java. The main loop for running tests:
  • Find an engine/game pair that hasn't been tested yet.
  • Start a process that will run a test for the given engine. As arguments, give it the name of a file containing the game to run, the name of a file to write results to, and the length to run the test for.
    • It is the test process's responsibility to run the test and write results to the file.
    • The time taken to run does not include one-time initialization. The framework assumes that this may take some additional time. The test process is responsible for recording and reporting the actual time taken.
  • The test waits for the created process to complete. If it takes much longer than expected to receive a response, the framework terminates the created process and records an error. (This prevents a single bad case from causing the overall testing to hang.)
Tests themselves consist of continually exploring the outcomes of random series of moves and computing the goal values of the end states. This is done for 30 seconds, to provide hopefully adequate time for cache warmups and on-the-fly optimizations (including Java's JIT compiler). This is slightly different from the original paper, where there was a similar "Monte Carlo" (MC) statistic but also a "Minimax" (MM) statistic that had a different balance of operations between finding legal moves and exploring future states. As Monte Carlo methods are currently dominant, and as this form of testing explores a better balance of early- and late-game states, I've used only a Monte Carlo-type method for testing here.

The statistics collected include the number of state changes made, the number of rollouts made, and the length of time taken. The main statistic used on the analysis pages is the average number of state changes per second.

The games used are nearly all of the games available from ggp.org's game repositories, including the Base, Dresden, and Stanford repositories. Some games are blacklisted for violating parts of the GDL specification (for example, having unsafe rules, or having terminal states with more than one goal value for a role). Since these games may have their rules changed while their names remain the same, a hash is attached to the game ID to ensure we do not conflate two different versions.

There are instructions for trying out the framework yourself in the repository's README.

Findings


I recently ran a set of tests with the current version of my framework on dedicated hardware. The results are here. This includes most of the engines tested in the GIGA'13 paper, including Fluxplayer and CadiaPlayer. There are some substantial differences in my findings, due to a mix of new engines, updates to old engines, and the significantly broader range of games.

I should note that it's also possible that I've misconfigured some engines, or wrote their test code poorly. CadiaPlayer in particular required a bit of finagling to get it to build on my machine, and it currently experiences segmentation faults on many games. Anyone experienced with these players is free to update the code I used to test them (Fluxplayer, CadiaPlayer) or to install and run the players on their own machines.

Another caveat is that this currently does not test the accuracy of these engines' results; I'm experimenting with a separate approach to verifying correctness across engines, but I haven't done any serious testing with this method.

That said, here are the qualitative impressions from the results:
  • The most stable engine on all games tested is the GGP-Base Prover, which had no errors on any game tested. This is at least in part due to changes made since 2013 to improve its correctness, and in particular make it handle complex types of recursion correctly. Its performance has also improved in this range of time, though not enough to make it competitive with the high-performance engines.
    • Most engines have trouble on at least a few games, and unsurprisingly, these games often overlap. Even when only checking for crashing and not for correctness bugs, this framework is quite useful for tracking down these bad cases.
  • The highest-performance game engine on nearly all games is Sancho's engine, based on propositional networks and written in Java.
    • This particular build was taken from the version used in the 2015 Tiltyard Open. A bug in propnet optimization has been found and fixed since then, so the current version may have fewer errors.
  • After Sancho, the next two fastest engines are from Alloy: one is based on propositional networks, and the other is a "compiled prover". Both are written in Java. The compiled prover generates additional Java code implementing the game rules that is then compiled and run.
    • The compiled prover engine, in particular, may be suitable for open-sourcing as a ready-to-use component for other gamers.

No comments:

Post a Comment