Data-Based Tests

May 07, 2018

There are different testing methods and one of the most popular is unit testing. Usually, to test the source code you write more code that runs the system under test and makes sure it behaves as expected. The data to test your code is defined in your tests. But what if you flip things around and make data define your tests instead of tests defining data? To understand the idea better, let’s look at some examples.

One of the simplest regression test suites is a set of input files and a set of expected output files. You can run your system under test and compare the actual output matches the expected. These actions can be executed in a short shell script. For a concrete example see a small bc test suite and specifically the test runner.

A shell script might not be sufficient for all situations. You can hit its limits when:

Output data is complex and diff comparison is too aggressive.
Output data is complex and you want more details about a mismatch, not just a binary test pass/fail.
You want to process different input files in different ways.

In this case I would recommend to use more advanced scripting. Here is an example of the same test suite for bc but written in Python. The only fancy thing it does is reading expected output from the input file itself. But you can define in a similar way a run line and execute some tests with -mathlib flag, for instance. Also instead of string comparison you can parse output as JSON and compare as objects.

This kind of tests isn’t suitable for all cases. It works the best for command-line tools. Also in my experience it is used for small-scale integration testing, so all upsides and downsides of integration testing are applicable here too. Another downside is that data has no built-in composability and can be harder to reuse compared to tests written in actual programming language.

In the real world the data-based tests are used at least in include-what-you-used where I’ve encountered them the first time and in LLVM. LLVM has more detailed documentation about its testing infrastructure and lit — tool used for integration testing. Also Mercurial test format is using the same ideas and can be better for testing interactive command-line tools. This post is a good introduction.