Over the last few months I've been assigned to work on a project called Craton . The project will be an inventory system that can integrate with most of our popular configuration management tools (Ansible, Chef, Puppet, Salt, etc.) and perform automated remediation. My team started Craton and moved it into OpenStack. This allows us to use Zuul - the OpenStack Continuous Integration Service. After writing some functional tests and letting development continue, I realized that we were not running them in our CI.
Tests, Tests, Tests
Craton has a few hundred unit tests and less than one hundred functional tests. Before this week, the functional tests were not run by OpenStack's CI. We never received any automated feedback from those tests unless the reviewers downloaded the patch and ran the tests directly. While reviewing other people's patches, I found the absence of this feedback bothersome. This led me on a brief journey to enable them in Zuul .
The functional tests are separated from normal test invocation because they add minutes to each test run. To run our unit tests we invoke tox -e py35, but to run both we instead call tox -e functional. These tests are slow because they use docker-py to build an image from our Dockerfile and create a new container for each test that is run. Running our integration tests takes roughly 1-2 minutes to build the image and approximately 15-20 seconds to start the container for each test. Further, each of our functional tests talks over the container network also incurring a time penalty. At the end of the test run, we delete the built image. We don't delete the intermediate cached images. This can lead to some different behaviour between different test environments.
After ensuring that the functional tests passed (with and without cached  intermediate images) I added a job to OpenStack CI which promptly began to fail. It turns out, docker needs to be installed in order to use it. I began to update our tests automation to do that.
Docker on Linux
If you've ever used docker on linux, you might recognize the following dance:
Install docker (or docker.io, or whatever your package archive calls it).
Ensure that the docker group exists (sudo groupadd docker).
If you're using a remotely modern distribution (Fedora 25, Ubuntu 16.04, etc.) simply installing the package does this.
Add your preferred user to the docker group. (sudo usermod -aG docker $USER)
Completely log out as said user and log back in for the group membership to be recognized by id.
Now, that last step is the problem on OpenStack CI. There's no way for us to tell the CI user to log out of the box and back in. This led me to look on the internet for what other people have done to side-step this.
For people using a PTY interactively, they found that doing:
exec su -l $USER
Would replace your current shell, and update your group permissions in that shell. I tried this on OpenStack CI. As you may have guessed, however, the CI user does not have an interactive PTY that they're using so that failed.
I then tried several variations on this theme but none worked. We faced a problem, at the time we set up and enabled docker the CI user had sudo privileges but by the time our tests are run, those privileges have been (rightfully) revoked. Even if sudo weren't revoked (which we are allowed to configure) running one's tests as root is far from ideal (and I'd go so far as to say it's a bad idea).
Let's look at the problem from a higher level for a second. What our CI user needs is permission to talk to docker's API over the system socket it creates. That means it and the socket must have the same group. Everything on the internet around this talks about using the docker group. It's the current best practice, but there are several assumptions around that solution. Those assumptions do not hold true for this case. How does a non-root user who cannot join the docker group communicate with the socket that the docker daemon creates? What if the socket's permissions were changed to allow the CI user's primary group to access it?
In our case, we're using Ubuntu 16.04 (Xenial) which includes systemd and docker's packaging uses systemd convention. I took a look at /lib/systemd/system/docker.service and noticed that it will read in extra options for the daemon service from /etc/defaults/docker. Further, the docker daemon allows you to specify a group that it will use to create the UNIX socket (e.g., docker daemon --group $GROUP). My next step was to create the /etc/defaults/docker file with the following contents via bash:
echo "DOCKER_OPTS=\"--group=$(id -gn)\"" | \ sudo tee -a /etc/defaults/docker
We then restarted the docker service:
sudo systemctl daemon-reload sudo systemctl restart docker
But still, our CI uesr couldn't communicate with the socket. I then tried starting the daemon manually in the script in the hope that it would work:
sudo systemctl stop docker docker daemon -G $(id -gn) -H fd://
But it then gave us a more informative error, conveying roughly that it:
- Needed to be started by systemd
- Required /lib/systemd/service/docker.socket
Having not looked at that file, I opened it in vim and noticed that it had its own declaration of what group to run as. The next (and last) test of our automation set-up was the following:
# tools/test-setup.sh sudo dd of=/lib/systemd/service/docker.socket << _EOF_ [Unit] Description=Docker Socket for the API PartOf=docker.service [Socket] ListenStream=/var/run/docker.sock SocketMode=0660 SocketUser=root SocketGroup=$(id -gn) [Install] WantedBy=sockets.target _EOF_ sudo systemctl daemon-reload sudo systemctl restart docker docker version
Now our CI user can access the socket, talk to the docker APIs, build the image, run the tests, and clean some of its mess up.
As you're probably aware, docker is not a security panacea. For example, just because the docker group exists doesn't exactly protect the system  from malicious users. For many, however, that threat model doesn't exist.
We still need to answer "How safe is this workaround?".
In our very particular case, each time the CI user is doing this, they're on a VM in a cloud that will be either destroyed or completely reimaged after the tests are done. When we flip the permissions around, then, we're not affecting any other OpenStack CI user who might use docker after us.
With that in mind, however, I'd strongly advise not taking this route if you're trying to quickly get docker set-up on a development environment. Upgrading docker may overwrite the docker.socket file you created. If it doesn't do that, then any updates to the docker.socket file necessary to run the daemon will break the daemon. Either way, your system ends up being broken.
In other words, this is a necessary hack for us in this one very specific instance. It may be necessary for other users with a similar CI set-up. But as a general rule, this whole thing is a very bad idea. I'm only documenting it here to save others a few hours if they happen to be in the same situation.
An old acquaintance emailed me after I published this asking if I had tried using newgrp(1). I had never seen that before. Reading the man page for it, I'm skeptical that it might work. The description reads:
> newgrp changes the current real group ID to the named group
And some research found this Unix StackExchange answer that in a bash script (like what we use for our CI) this would require the rest of our test automation to execute inside newgrp's subshell.
|||The name comes from geological terminology for the "old and stable part of the continental lithosphere". https://en.wikipedia.org/wiki/Craton|
|||Yes, I enjoy using the word "Zuul" because I enjoy the Ghostbusters franchise. [a]|
|[a]||Yes, I also enjoyed the 2016 Ghostbusters movie. It is excellent.|
|||We've found that running docker rmi $(docker images -q --filter 'dangling=true') works well for cleaning these up.|