To Virtualenv or not to Virtualenv for Docker? This is the question.

Jarek Potiuk
12 min readJan 17, 2022

EDIT: Updated 27th of January 2022, to clarify my statment that virtualenv is an antipattern for containers in cases where optimized size of the image and using build segments to achieve the optimizations are important, as well as when you need to create dynamic virtualenvs in the image, but it onther cases it might be not an anti-pattern.

This has been the result of long discussion we had in https://github.com/pypa/pip/issues/10556.

I’ve recently been involved in a discussion with pip maintainers, about using virtualenv for Docker/Container building. In short (TL;DR; — I think the approach promoted by pip tool — use virtualenv — (and apparently heavily supported by some pip maintainers) does not work well for the case where Docker containers are involved.

In my opinion thepip approach on promoting virtualenv as the only way of dealing with installs (also when building containers)— in pip 20.3.1 at the time of this writing — is rather opinionated. It is likely coming from outdated information about the needs, capabilities and optimisation goals of pip users who want to build containers. Containers nowadays, are a modern building block of application deployments — both in the cloud and on-premise. And not recognizing it and not treating seriously the use cases connected with it is rather bad idea.

In this post, I want to disect why the approach is — in my opinion — rather bad for important set of use cases (related to building containers containing Python) and what should be done to fix it. I believe recommending virtualenv as “recommended solution” in all cases — including all cases of container building — is the decision that pip maintainers should rethink.

The beginning of the story

It starts with the issue: Disable warning from pip install when executed as “root” user.”. The issue were raised and discussed by a number of people and I joined it because I found it really annoying about a warning message that pip printed (and saw that others had the problem with it). The message is (as of pip 20.3.1):

Running pip as the ‘root’ user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

It’s printed every time when you try to install packages as root user, without activating a virtualenv first. The message, as has been confirmed by one of the pip maintainers is targetted only for command line users, but in my opinion it’s very misleading and disruptive for the users who want to use pip to build container images.

Let’s look at the message in detail. I always think about warning messages as a way to communicate with my users. The reason why I want to print a warning is always because I want my user to take some action to remove the warning. Ultimately this is the only reason why warnings exist at all. In my opinion, good warning message should be:

  • actionable — they should provide the users with an action they can do to remove the warning.
  • factual — they recommendations/proposals that are included in the warning shoudl not be misleading for various kind of users (especially if there are significant groups of users who are mislead).
  • helpful — rather than intimidate the users (and essentially tell them ‘you are wrong — go figure why`, they should be kind and helpful. Users — especially when things change, should be gently guided into improving their behaviour rather than shouted at for doing it wrong.
  • provide context — when the explanation of the warning cannot be put in one/few lines, they should provide a helpful link where context and more elaborated discussion why the warning is important is explained.
  • silenceable — when there is a case where warning might be “false negative” — there should be a way to disable the warning — this is especially important for “power users” who might be able to asses the situation and decide on their own if the warning is applicable.

How this warning stacks with this criteria for the class of users who I call “Container developers”?

  • actionable and helpful — I think while the command line users get a reasonable advice, the advice is not really actionable for users who build containers. Using virtualenv is something that is rather alien to users who build containers (reasons explained below) and the link to general ‘venv’ documentation does not explain what action should be done to fix it when you are building containers. It’s also intimidating because you do not know why you’ve done something wrong.
  • factual — the message is rather confusing and I am not sure how factual it is because it is somewhat ambiguous. The first part of the message (diagnosis) tells about “using root”, while the second one explains that “using virtualenv is a recommended solution”. .The message suggests that the problem is using root, while really what it about is about “using root without virtualenv ….” (you only get the message where both conditions are met).
  • provide context — this message provide no context whatsoever (also for regular “command line users”). The dicussion led to many (sometimes angry and imatient) responses in the form of “there are thos and those discussions and those and those PEPs that will improve the situation and explaining the reasons that distribution maintainers are the main reason for this warning …. and it looked like pip maintainers expected their users to know all that. But the message and links do not contain any of that. The user is left in a dark, geting into discussion leads to very intimidating responses of the form of “you know nothing John Snow — go learn” and the attitude of those responses were really like if the user asking question was guilty of asking it in the first place. At the end of the discussion it turns out that the problem is with “subtle bugs introduced where distro-provided python packages clash with those installed with pip. )
  • silenceable — this is the biggest problem. This warning cannot be disabled. Full stop. It’s there. Always. No matter if your case is different and you do not care about the subtle bugs. The maintainers of pip decided effectively that the problem is so important, that it should never be possible to disable this warning. It almost looks like they were convinced that absolutely all cases where you use root without virtualenv it is wrong. This sounds strange when it comes about “subtle bugs” and specific clashes and especially that there are cases wher it should not matter

Why Container building is different?

Maybe surprisingly — after reading the discussions and PEP proposals and even trying to propose some adjustment, I do agree with the assesment for command line users who manage their environment via command line pip . You should definitely avoid using root and installing your packages without virtualenv. I — for one — always use virtualenv to install my packages.

There are some great tools like pipx that even allow you to automatically create virtualenv for your applications, but also for development, the first thing I do is to create a virtualenv for my case.

But …..

Building containers is different. Why? Many of the below differences com from best practices of building containers.

1. Containers are built using non-interactive tools. Some other packaging tools already recognized it and provide two variants of their tools apt and apt-get are a good examples. The first one is really targeted for the command-line interactive users, where apt-get is a tool that should be used in automated scripts. This is a recommended approach to use apt-get for building containers:

The `apt` command is meant to be pleasant for end users and does not need
to be backward compatible like apt-get(8).

The pip serves both cases. Limiting pip to only serve “interactive” users without providing non-interactive users is heavily limiting the audience of pip and the use cases it serves.

2. Containers are mostly meant to be immutable. The best practice is to build containers to be ‘ephemeral’ — in the sense that they should be created and destroyed at will whenever needed. And it means that the image should provide everything needed to start a container. Once you create the containers you are not supposed to install any more packages. You build a series of instruction in a Dockerfile for example and they prepare your environment. Then you do not install packages on “global” level any more. This means that the “Container developer” can build the ritght set of instructions without inducing the “subtle bugs”. The “subtle bugs” are only happening when you mix “distro packages” and “pip packages” and usually when you build containers, the author of the containers choses the right approach to either use this or that — but never mixes them.

3. Containers are supposed to be executable by multiple, arbitrary users. Unlike “command line” virtualenv (which belongs to single user). Containers are often executed with arbitrary (seemingly random) user id — this was introduced by OpenShift and became a de-facto standard for running containers and maintaining isolation and security between different containers. Virtualenv are meant to be “single-user” only — they are not relocatable, they do not provide the guarantees that mutliple different users will be able to use same virtualenv. Installing packages in “global” space by root user is a way to make sure that python apps installed will behave the same way for arbitrary users.

4. Power user use multi-staged images to build their python images. One of the most important optimisation goals when building containers is to optimize the resulting image for size. Many Python packages require compilation steps at installation time which requires build-essentials to be installed (essentially — various compilers). Some of the popular extensions require even more sophisticated and less popular compilers like RUST (for example cryptography package). And with upcoming python versions and OS versions, many of the packages require compilation as the packages do not have appropriate pre-compiled version available in PyPI. The build-essentials package and often many “dev” packages containing sources and headers are only needed to build the libraries needed and they are not needed at runtime. This results in the best practice where multi-staged builds are used: development stage to build the packages, and prod stage only takes the compiled packages and libraries and has no build-essentials nor dev libraries. This results in much smaller images. One simpleexample in Apache Airflow the approach might results in >200MB smaller image (> 20%) or even more — depending on optional dependencies used.

Those multi-staged build — by definition — have no mentioned above “subtle bugs” problems. The “development” images where pip install commands are executed cannot use distro-provided packages, because the whole point of those steps is to provide a folder with “comprehensive” installation that can be copied to the “runtime” image. This is best done without creating a new user — root user is perfectly ok to prepare such an environment, especially with the--user user flag that provides the guarantees that all necessary compiled libraries and package will be prepared in ${HOME}/.local folder and this folder can be simply copied to the runtime image. In this case, container developers optimise for size, not for “subtle bugs” possibility which is impossible to occur.

5. Non power users do not care about virtualenv or subtle bugs — on the other hand, non-power users want to do stuff as quickly and painlessly they want. They don’t care about multi-stage builds, they want just to install few packages and be done with it. This is an example Dockerfile:

Simple Dockerized app

Unfortunately venv does not have any tools that make it easy and obvious how ot use and activate the virtualenv inside the Dockerfile. And this is surprisingly not straightforward. The user needs to not only know the virtualenv but also make decisions where and how to create the venv and needs to know that activating the venv require few subtle steps:

Taking the example application to use virtualenv leads to this Dockerfile:

Same simple app with virtualenv

The first non-venv container is 52 MB, the second is 63 MB. Just adding a virtualenv made the Dockerfile 2x as long and resulting image ~20% bigger.

This is not something Container developers are after, especially that it brings no benefits to them — their images are immutable, contain just one application of their origin and some dependencies and increasing the size of the image by 20% just to satisfy some “subtle bugs” problem that is not applicable is a huge loss — not mentionig an environmental impact on the disk space used and traffic generated when pulling/pushing the image.

6. Virtualenv limitation for dynamic virtualenv creation

There is a class of complex applications (example of that is Apache Airflow) where — despite their mostly immutable images — they need to create virtualenvs containing the original application and some more packages added/reinstalled. In case of Apache Airflow — we have PythonVirtualenvOperator that does it. The users can create (dynamically) virtualenv that — by default will contain Airflow basic installation (same as in the image) and some extra dependencies — updated or replaced as quickly as possible. Installing whole Airflow might take minutes, also it might contain a number of optional extras, so this is not an option to replace everything from the scratch is not an option (the user would have to know the exact extras used when installing Airflow and the exact versions of all dependencies used). Instead creating a virtualenv with --system-site-packages flag comes as a great solution when Airflow is not installed in virtualenv but as system site package. This allows to effectively quickly “duplicate” existing installation and create a new virtualenv with only some dependencies changed:

Creating and using venv dynamically

This is not easy (and not possible with standard venv tools) when Airflow is installed in a virtualenv to begin with. Venv (which is now recommended and standard library built-in way of creating a virtual environment, does not have a feature to clone a virtualenv. Even the predecesor virtualenv that you can use had only an “experimental” --relocatable flag that was disabled in virtualenv 20.0. Some 3rd-party packages that allow to clone non-reloacatable virtualenv exist, but they are not working perfectly (and have some edge cases). I’ve made actual attempt was made to make use of those and have Airflow installed in the virtual environment — and despite help of some pip maintainers, I was not able to do it to a satisfactory level and abandoned it.

Is virtualenv an antipattern for container images?

Looking at all the above points, and having experiences with both approaches (and even attempting to convert Airflow to use virtualenv) — I truly belive virtualenv is an anti-pattern for container building in a number of cases. Not always, but there are valid and important cases where it is. The virtualenv is an antipattern especialy in cases where you care for the size of the images produced, and when multi-stage builds are used to achieve this optimisation. Also when you want to create dynamic virtualenvs in the image. There are — of course — cases when size of the image, or dynamic virtualenv execution is not important, then — by all means virtualenv in the image might be a good choice.

Some of the comments I received prior to writing the article directed me to hynek.me/articles/virtualenv-lives which claims quite the opposite. Of course I read the article thoroughly before writing the article, and I think at the time of writing the article (written in 2014 and updated in 2015) the author had a lot of good arguments and reasons to believe this approach was right.

However you must realise that there were many things the author could not take into account , because … they were not existing yet. The world of containers changed dramatically since 2014/2015. There were many changes since then that made the arguments much less appealing. Those are:

  • introduction of multi-stage images (in 2017) and introducing best practices for building optimized images
  • introducing and popularizing arbitrary user id as best practice by Openshift in OpenShift 4.1 in 2019
  • abandoning of — relocatable flag in virtualenv 20.0 (in 2020)
  • containers — since 2015 — became the basic building blocks of modern cloud and on-premise application deployment. While in 2014/2015 they could be treated as an afterthought — they are an absolute mainstream now and not treating seriously the use cases to build the containers is simply not a good idea.

What can be done about it?

I think the solution is rather simple and proposed by many users of pip who proposed them already in the Disable warning from pip install when executed as “root” user.”:

  • allow to disable the warning (for power users)
  • make the message more factual and helpful (and provide more context — by linking to a paragraph of documentation) and do not mention virtualenv as the only solution
  • or maybe even provide different message if pip is used in non-interactive fashion — while building containers

--

--