Data-Driven Unit Tests (or Finding Crazy Unicode Errors in Python)

This is a pattern I use when I want to test a lot of edge cases in a similar way:

import unittest
import sys
 
ENCODING = 'ascii'
 
def convert_to_string(arg):
    return arg.encode(ENCODING)
 
class ConvertToStringTest(unittest.TestCase):
    """
    Example of a test that tests the same testing logic
    for multiple cases.
    """
 
    def test_handles_unicode(self):
 
        # Check different unicode codepoints
        # to make sure we don't get an error
        for codepoint in [0, 1, 2, 127, 128, 129, 130, 
                          sys.maxunicode - 2, sys.maxunicode - 1, sys.maxunicode]:
 
            try:
                result = convert_to_string(unichr(codepoint))
 
            except:
                self.fail('Failed for codepoint {0}'.format(codepoint))
import unittest
import sys

ENCODING = 'ascii'

def convert_to_string(arg):
    return arg.encode(ENCODING)

class ConvertToStringTest(unittest.TestCase):
    """
    Example of a test that tests the same testing logic
    for multiple cases.
    """

    def test_handles_unicode(self):

        # Check different unicode codepoints
        # to make sure we don't get an error
        for codepoint in [0, 1, 2, 127, 128, 129, 130, 
                          sys.maxunicode - 2, sys.maxunicode - 1, sys.maxunicode]:

            try:
                result = convert_to_string(unichr(codepoint))

            except:
                self.fail('Failed for codepoint {0}'.format(codepoint))

The test fails unless you set ENCODING = 'utf-8'.

The for loop lets us re-use the test logic for different edge cases. If the test fails, we include the codepoint under test in the message so we can track down the error.

For numerical edge case boundaries (in this case, the codepoints 0, 128, and sys.maxunicode), I usually test the boundary plus or minus two, where mistakes are most likely to occur.

There’s a temptation here to be really thorough and change the for loop to:

        for codepoint in range(0, sys.maxunicode):
        for codepoint in range(0, sys.maxunicode):

Since the new version iterates over 17 * 2**16 values, execution time will likely increase by at least 1 second. Most of those test cases aren’t likely to find additional errors, but sometimes they do.

For example, for certain codepoints (called surrogates), under certain operating systems (Linux), in certain configurations (when you include a single surrogate, not a pair), this code will raise an exception.

import json
 
# This works:
json.loads(json.dumps(unichr(55295)))
 
# This doesn't:
json.loads(json.dumps(unichr(55296)))
import json

# This works:
json.loads(json.dumps(unichr(55295)))

# This doesn't:
json.loads(json.dumps(unichr(55296)))

The exception it raises is: ValueError: Unpaired high surrogate

Unbelievable but true: json can create an encoding that it cannot load. (As of this writing, there’s an unresolved Python bug report about this, and it will hopefully get patched soon.)

How to Teach a Robot

For the last few weeks, I’ve been planning my final project for CS 181: developing AI for a robot that finds and eats plants. There are poisonous plants and nutritious plants, but the robot has only noisy input from the world. It has to decide where to look and what to eat without knowing if its decisions are correct.

Designing such a robot is, in some surprising ways, similar to classroom teaching. For example, you can’t just feed the robot raw data from the world and expect it to learn anything — or, at least, expect it to learn quickly. You have to figure out a mapping from raw inputs to higher-level features, preferably features that make the pattern easy to learn. Or you have to provide the robot with a model, so that it has assumptions about what the world is like. But if you simplify the world too much, the robot might not perform well under “real-world” conditions, because it isn’t using all the available information or it’s making incorrect assumptions.

Very similar issues arise in teaching. Do you teach multiplication by having students memorize a table (a huge abstraction from the “real world”)? Or do you start with the “real world” — how many fingers does the group have altogether?

This is a huge debate in education. On the one hand, you have people like Steven Levy and Paul Skilton Sylvester who don’t simplify the world at all. Levy started the first day of school with a completely empty classroom — no books, no chairs, no pencils. The students then designed and built their classroom, which, incidentally, involved quite a lot of math. You have to be pretty good at measurement to build a desk. And Sylvester (who I met in grad school) had his students design a town-within-a-classroom in which the students ran businesses, took out loans, applied for jobs. The students interviewed local business people and even met the mayor of Philadelphia.

On the other hand, you have teachers like Gilbert Strang, whose Linear Algebra course I took online last summer. Professor Strang explains concepts so clearly that you can almost visualize how 9-dimensional vectors can be “close” together, or why a positive definite matrix multiplied by a vector on either side looks like a bowl, or how a space can “contain” a lower-dimensional subspace. But these are all abstractions — powerful, but far removed from my day-to-day experience.

In education, these decisions are contentious and value-laden. How you teach human beings says a lot about who you think they are and what you think they should do.

Fortunately, my robot has a very clear metric for success: survive as long as possible. Learn to avoid poison and find food. And look reasonably intelligent before the project is due next month!

Testing: Laser Versus Bug Spray

Recently, I’ve been rethinking some of the assumptions of Test Driven Development. Don’t get me wrong: I believe in creating high-coverage unit tests to verify software correctness, and I love how this helps me write better code.

On the last project I completed, I had a test suite with 91% branch coverage, mostly achieved by unit tests. I also had integration tests and UI-level acceptance tests. And the system ran with high reliability: in the two months that I operated the site, I can remember one or two minor bug fixes in the actual code (other issues arose from configuration conflicts).

However.

When you write the test first, you have zero confidence in the code because it isn’t written. So any test, no matter how specific or simple, is going to give you useful information as you implement functionality. A test failed; now it passes. That’s evidence that you’ve done something right.

Once the software is written, though, it’s a different story. You know that the software is probably mostly working — you can run it and manually verify the core functionality. Likewise, you can verify a small set of inputs using unit tests, but that mostly tells you what you already know. If a newly written unit test fails, it’s more likely that there’s a mistake in the test than the program. Information gain is much smaller relative to effort.

When you’re trying to kill bugs, and you don’t know where they are, you don’t use a laser. Lasers precisely target one small point. If a bug happens to be at that point, it’s dead. But the odds of scoring a direct hit are pretty low.

What you really want is bug spray: a way to target a large region at once, hopefully hitting as many bugs as possible.

If the laser beam is a unit test, then what is the bug spray?

The answer I’m exploring right now is: randomized inputs, over a large number of trials, with a strong oracle. Figure out a way to generate inputs that result in high coverage, then run a test over and over again. I’m not sure how effective this will turn out to be in practice, but I’m curious to find out. I will post an update when I have some results.