cat /dev/brain

Unittests in github3.py

After mostly finishing github3.py[1] I started writing the unittests for the library. I waited until the end because I knew I wanted to test directly against the API. To do that, I needed to make sure I had all the functionality that would return the proper objects to me for testing. The problem that I knew I would end up facing was the speed of my personal internet connection — ~1 MB/s down and ~400 KB/s up. My first solution was to run the tests individually until I realized that I needed a way to test the calls that require authentication. It then became obvious that a better way to test the library was needed.

But why test directly against the API?

While I was working on developing the library, I saw seemingly random changes that were in fact leaks from development. One change, which broke everything one day, was that they are planning to append parameters to the URL that automattically told the API the time of the last request. After contacting them, they reversed the change but assured me it was going to be arriving eventually. I wanted to ensure that changes like this would be caught, even though I would not be running the test suite every day. If I received an issue report from a user, I could run the test suite to get an idea as to where the problem was introduced.

But testing against the API takes so long!

It does if you do it linearly (i.e., making each request one after the other) but if you make the tests asynchronously it will move a lot faster. The problem then occurs, how do you make the tests make the requests asynchronously. Well this is the eas*ier* part. I was already requesting everything when the test classes were being constructed, so I just needed to know when they were being instantiated and I could hack on that process.

This is exactly what I did. I found that the tests were being instantiated in unittest.defaultTestLoader.loadTestFromName(). So the first thing I did was look into the multiprocessing library to see how I could load the tests as quickly as possible. I did not want to have to launch each one individually and I did not think starting all of them simultaneously would be such a great idea, so it became obvious that a pool of workers would do the job well.

First Attempt

On my first attempt, I quickly inserted into unittests.py

import unittest
import os
import re
from getpass import getpass
from multiprocessing import Pool

# ...

def load_test(test):
    return unittest.defaultTestLoader.loadTestsFromName(test)

if __name__ == "__main__":
    # ...

    pool = Pool(5)

    names = os.listdir("tests")
    regex = re.compile("(?!_+)\w+\.py$")
    join = '.'.join
    # Make a list of the names like 'tests.test_name'
    names = [join(['tests', f[:-3]]) for f in names if regex.match(f)]
    # Start running the pool
    result = pool.map_async(load_tests, names)
    # Wait for the results which will be of the form:
    #  [TestSuite(), TestSuite(), TestSuite()]
    suites = result.get(timeout=(10 * 60))
    # Now turn them into one TestSuite()
    suite = suites.pop(0)
    for s in suites:
        suite.addTests(s._tests)

    # Run the tests
    unittest.TextTestRunner(verbosity=1).run(suite)

Clearly since this was my first attempt, it failed. The reason it failed was because I was trying to make as many things an object without making a ton of classes for them. To do this, I used type() to make classes from the dictionaries provided to me. These, as it turns out, are not picklable.

Second Attempt

The fix was easy, leave these as dictionaries like I had done with other attributes that were similar on other objects. Changing these lead to some errors in the tests, but those were fixed quickly. The problem that tripped me up, however, was that when there's an Exception raised, the Pool catches it and raises it when you try to get the result. So I changed in the Commit object, how the author attribute behaves. I assumed that the dictionary would be populated so I tried accessing the 'name' value. This lead to the unittests failing with a traceback which was unhelpful and gave me: KeyError: 'name'. This baffled me until I realized where it had to be originating from.

After fixing that, the tests passed on 2.6, 2.7 and 3.2! Woohoo! They also took about 11 minutes for all three to run on Travis. Compare that to the approximately 40-minute run time for all three before that and this was a clear improvement.

Final Attempt

But wait, everything was working fine, right? Right. My final attempt was to remove the load_test() function and do

result = pool.map_async(unittest.defaultTestLoader.loadTestsFromName, names)

This does not work. Why not? The pool has to be able to pickle the function you're mapping with. When you try to do this here, you receive:

PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup
__builtin__.instancemethod failed

This means that load_tests() stays since that is picklable.

Some Last Details

  • pool = Pool(5)

    I used 5 workers for the multiprocessing.Pool object because the task at hand is not processor intensive. If I were doing something crazy on the processor, I would let it default to the number of cores on the machine or 2, whichever is greater.

  • Why not use nose2? They have the mp extension

    This only would have benefitted me if I had the requests for objects in the setUp() functions of the classes and then the tests would still be horribly slow because it would be re-requesting those objects each and every time. By requesting them once:

    • I allow for the tests to be run more than once in an hour (by keeping how many requests I make to the API low so I do not hit the ratelimit)
    • I keep the bandwidth usage down for people with caps

    Also, by using default python libraries, I do not need to introduce another dependency if even just for testing. Having learned to program in C, I'm a fan of creating as few dependencies as possible.

  • Are there better ways of doing this?

    Probably, but I have not found them yet. It seems my problem was fairly unique (which I find a little hard to believe) so if someone else has found a better way, Google did not find it ;).

[1]I still have to distinguish between stargazers and subscribers, and introduce pagination.