Limiting time and memory consumption of a program in Linux
Updated since the previous post
Limiting resource consumption of running programs is an important aspect of OS functionality. It keeps servers up even if a program bug causes it to consume the whole memory (usually it's implemented in an OS kernel--see OOM killer). Time is also a resource, and could also be limited--to prevent a badly written program from entering an infinite loop (hence making your server unavailable to users).
However, default memory and time limiting utilities in Linux suffer from inflexibility. Recently, I had to solve the problem of limiting resource consumption of a black box process in our project. In this post I explain why obvious ways don't work, describe the solution I found, and outline the ways it could be improved.
For those impatient developers, here's the link to the script that limits time and memory. It's implemented in Perl, weights 14 kilobytes, and has several additional features, such as hangup detection or collecting resource usage statistics. You may also browse the timeout GitHub project. The sources are licensed under Apache-2.0.
The others, more patient, are welcome to read the rest of this article.
Why do we need to enforce limits?
Most programs we are used to dealing with are well-written. Several years of development, stable versions, bug tracker with a lot of problems fixed — this describes the majority of the programs we work with. Good software is usually developed to do its job in a minimal and predictable time. And if we deal with persistent programs, which are designed to stay in memory, then its memory consumption is usually watched for, and the bugs about violation of the pre-specified limits are fixed... well, at lest, such issues are considered as defects :-)
However, in certain domains, some programs are not like that. They're immature, support only some narrow use-cases (while claiming that they're more generic tools), and contain a lot of bugs. The time for simple usage is more or less predictable, but no one can tell if it won't suddenly take much more time to finish a particular case. Yes, I'm talking about "research-quality software".
These programs usually attempt to solve NP-complete problems with heuristics (and we know it's a good thing). And for many such heuristics, neither time nor memory consumption of an implementation have known theoretical limits, let alone practical ones. Usually such algorithms work well for most cases, and in these cases the limits are reasonable. However, for certain cases, which can't be easily distinguished, they can easily go off of these limits.
This doesn't seem to be a big problem if your aim is just to do some research and throw your code away (it's sometimes even referred to as "throwaway code"). However, if you're trying to ship something of production quality, you can't accept such immaturity. If the program works with a particular case for too long, it may block providing service to the other customers. And if it consumes all the memory available, it may throw the whole machine to the swap hole until reaped by OOM killer.
How the solution would look like
To solve the problems listed above, we need to run programs under control, and enforce time and memory limits, terminating the program upon their violation. These limits should be set manually (in order to enable tweaking, and allow setting memory limit lower than the available maximum). The most obvious command-line interface would be:
limit -t 600s -m 1000000k command arguments ...
The commans is executed with arguments, and is constantly monitored. As soon as one of the limits is violated, a relevant message is printed to STDERR, and the program is terminated.
The biggest problem with setting these limits is that the processes we run may be "black boxes". I.e. we should be able to limit the processes, to source code of which we don't have access. We can't install signal handlers to them, or affect their code in any other way.
Another hard problem is that these processes spawn children, which do considerable fraction of the overall work. Some of these children are mature and well-written programs, while the others are the very same "research soft", which can eat all the memory with as great appetite as its parents have.
This means that we should count the time and memory consumed by all processes spawned by the one we watch for.
"But what's the problem?" you might ask, "Linux provides enough standard utils to solve this problem!" However, not all of them fulfill all the requirements to a good limiting tool. I'll enumerate them after the explanation of how the time limiting script is implemented.
The following solution to the problem is a result of several months of experimenting, which involved studying of a lot of manuals, and several "Eureka!" moments.
It looks fairly dumb. Just poll each process in the tree the process spawned for time and memory consumption several times per second, and kill them on violation. However, it was no simple task to make this actually work correctly.
Tracking memory consumption is very simple. At each moment of time, every process contains a valid information about its virtual memory. The total memory is no greater than the sum of current virtual memory values for all the controlled processes in the tree. If the sum of memory stamp sizes exceeds the pre-specified limit, just terminate the controlled processes in the tree, and print the relevant message.
However, it's not that simple for time consumption. Time consumption accounts for the history of the process tree evolution, as well as for the snapshot in a specific moment of time. By examining the CPU time for each process in the tree, we can learn for how long they had been running, but we don't know how much time was spent there in the processes that had already finished. The total time is greater than the sum of such snapshots. Even if we manually track which processes die, and store the sum of their times, we miss very short processes that are born and finished between two out polls (see Figure 1).
I found out that Linux kernel calculates, for each process, the cumulative time of its reaped children (like in man 2 times). If a process has been terminated then its total runtime is accounted in the cumulative children time of its parent; if the parent is also finished, then the time's added to its grandparent, et cetera. This way, the total CPU time of the already finished children of the processes we watch for is just a sum of their cumulative user times.
Thus, we may learn the CPU time of all the processes currently running, and of all the processes that have already finished. All we need is to sum these two metrics, and to compare the result with the specified time limit. If the limit is violated, the message is printed, and the controlled processes are terminated.
We may fetch these values accurately fetched via /proc even for the black-box processes.
Implementation — the timeout script
Of course, the bare text isn't very inspiring. It just has to be accompanied with a script. And here it is, 14kb of Perl code. The sources are licensed under Apache-2.0. They were a part of the project we develop at work, but we decided to fork this particular script as a separate project.
Being seemingly stupid, that's the only thing that works. At first glance, any person, who knows something about signals, children, wait calls and all this crunching, will say that the solution is quite dumb and inefficient. However, I couldn't find anything better, and it will be much appreciated if you suggest any improvements. Here's the list of the things I tried
What doesn't work
Ulimit (the console tool, not the system call) is the first thing a Linux user would think about. You launch it in the shell, set "soft" and "hard" limits for each resource, and children, which are invoked in this shell, are limited in memory and time.
Its limits are imposed like this:
- Time limit is set by setrlimit system call. When the soft limit is met, a SIGCPU is sent (which kills by default). This signal can be handled by a process; for example, it can decide to limit its own resource consumption on receiving it. On hard limit violation, a simple SIGKILL does its dirty job without any possible impediments.
- Memory limit is set via the same system call. Memory limit is enforced by returning ENOMEM from a violating malloc() syscall. This usually leads to an unhandled exception of sorts, and the program (or its interpreter) is forced to terminate.
- setrlimit() call only limits the process, in which it's invoked. Children of the process inherit the limit, but their accounting "starts over," whereas we are interested here in limiting sum of child and parent resource consumption.
- You can't determine if the reason to terminate the
process was a limit violation. When a memory limit is violated, we should print
a message about that, but for that we should know all possible outcomes of a no-memory exception. Shell, Perl, Ruby, Python, C and other languages print message about lack of free memory in different, non-portable ways, and we can't possibly enumerate them all.
However, setrlimit notifies the process when it exhausts its CPU time limit by sending SIGCPU. This, theoretically, can be handled, if we have the sources of a controlled process available, which we don't have—all processes are "black boxes". We also can't create a wrapper around the program, and catch this signal in a wrapper: the wrapper's time would be limited then, not the target's.
Sleep && kill
Another "easiest" solution to track time consumption is to sleep for a time equal to the limit and kill the controlled process afterwards. This only limits real time, while it's more interesting to limit CPU+sys time.
This may sound an overkill, but believe me, you will have to measure CPU time. Hardware develops in such a way that a performance boost may only be achieved by employing parallelization. And parallelization involves competition for a shared CPU between multiple threads and programs, and when that happens, the "astronomical" time elapsed becomes much greater than the CPU time, the time the process had actually been running for.
A special wait-like call, wait3, can harvest information about a deceased child. It can return CPU time spent by the child. This would present precise information about the runtime of deceased processes.
However, we can't reap grandchildren, and such information should be harvested by each process in the hierarchy then. Glibc has a capability of making program send SIGHUP to parent when it dies. However, this function should be invoked from a process that spawns children. But the processes we watch for can't be modified. So this option can't be the solution.
One more thing is that if a child process hangs up in an infinite loop, and never terminates, we can't wait for it, hence we can't detect it, and watching for such a limit will fail.
Thus, this way of tracking is not close to the solution for limiting time. Let alone that this way won't allow us to track memory consumption: it's an "instantaneous" property, not a cumulative one.
Using process groups to control children
One of the previous versions of the script used process groups to control the children, just like the shell does!. It put the spawned process into a separate process group, and its children were assigned to that group automatically. It was easy to kill them all with a single kill command, which is capable for sending a signal to the whole such group.
However, the issues arisen by that "just like the shell does" part were too severe. If your black-box process contains a shell script somewhere inside (it may be a system()-like call or a mere Bash script), it would unconditionally interfere with your process group setting, and all the processes spawned through such a shell would lurk out of our control.
The current version of the script tracks process trees by default (but you may switch to tracking process groups). OF course, it will miss the children explicitly detached from their parents (such as daemons spawned). But in an only such a case we noticed in our project (an ssh process spawning to serve a network mount), we did not actually want to track such children! So we switched to trees over process groups without hesitation.
What could be improved
Of course, the sources are not ideal. They contain several problems, which are already identified, but not fixed (due to YAGNI principle). Here's the list of them.
Terminating several processes in a group correctly
Let's first elaborate the case with process groups. The Linux command that sends signals, kill, is capable to sending signals to the whole process group. I.e., after such sending, each process in the group receives the specified signal.
However, the processes in the group we track may have internal dependencies. Assume that two processes are connected by a pipe, and the group receives SIGTERM. If there is an interval between delivery of these signals, then, after death of one process, the other may receive SIGPIPE. This signal may cause reporting an error that would look like there's an internal error in that black box. However, this problem was actually caused by the resource controlling script.
I asked a question on Stack Overflow about that, and an only solution close to the truth was to pause the process group, then send term signal, then continue execution of the processes in the group. I.e., here's the part of the script:
If there's a better solution, or it's proved that after kill syscall the next instruction in all processes would be the signal handler, please, let me know!
Terminating several processes in a tree correctly
Process trees are even more tricky to terminate. Here's how we do it:
- Get a list of all PIDs in the controlled tree
- Stop all these processes by PIDs
- Repeat 1 and 2 for the processes spawned between getting list of PIDs and sending signals
- Repeat 3 until no new processes is spawned
- Send TERM signal to all the processes in the tree
- Get list of PIDs in the tree
- Send "continue" signal to them
If this solution can be simplified, please, let me know.
Error due to insufficient frequency
The case when a grandchildren is born and finished between two measurements, was mitigated by taking cumulative children time into account. However, the controlled process is still free to go insane between two wakeups of the polling script. This is especially important on an overloaded system, where timeout script has to compete for its CPU share with other processes. Perhaps, making the controlled process more nice would help.
Perl quirks or lack of knowledge?
The actual source contains more tricks undisclosed here. For instance, Perl has issues with ALARM signal, and I had to work this around the way I found in the article linked.
However, I'm not sure if I worked this around correctly, especially in the part of terminating the processes after the script detects limit violations. It involves the usual sequence of SIGTERM-sleep-SIGKILL, but that sleep sometimes never ends, leading to hangups in the slave's SIGTERM handlers sometimes.
In this small article, I shared what I experienced during the development of a script that limits time and memory consumption of a "black-box" process and its children. The result was a "dumb" but a sophisticated solution implemented in Perl, and reasoning why there's not many options to choose from. Further directions, in which this script could be improved, are also listed. I also linked the GitHub project for the timeout script.
I hope, that will help you in your affairs. Apache license allows you to use the script in commercial products in return of just mentioning the usage of the script and its original copyright owners.
Log in and make a comment >>