Opened 3 years ago
Last modified 2 years ago
#64368 closed defect
cmake @3.22.1_0: "Error running link command: Invalid argument" when run under trace mode — at Version 5
Reported by: | chrstphrchvz (Christopher Chavez) | Owned by: | michaelld (Michael Dickens) |
---|---|---|---|
Priority: | Normal | Milestone: | |
Component: | ports | Version: | 2.7.1 |
Keywords: | tracemode | Cc: | ryandesign (Ryan Carsten Schmidt) |
Port: | cmake |
Description (last modified by chrstphrchvz (Christopher Chavez))
I observe that when building ports under trace mode, the file descriptors given to processes can reach/exceed the default value of FD_SETSIZE
, which is 1024 when not manually defined before including <sys/select.h>. CMake—and possibly other programs—are compiled without manually defining FD_SETSIZE
on macOS, and without defining _DARWIN_UNLIMITED_SELECT
or _DARWIN_C_SOURCE
. This leads to at least two problems using CMake under trace mode: Error running link command: Invalid argument
being output; and out-of-bounds memory accesses, one effect being corruption of the Invalid argument
message.
More detailed description:
If a program is compiled without defining _DARWIN_UNLIMITED_SELECT
(on macOS 10.5 Leopard and later, and before including <sys/time.h> for more recent macOS versions) or _DARWIN_C_SOURCE
(unsure which macOS versions), then calling select()
for more than 1024 file descriptors fails, and errno
is set to EINVAL
.
The select()
call in kwsysProcessWaitForPipe()
in ProcessUNIX.c (inherited from KWSys—Kitware System Library) is affected by this limitation: select()
will fail when max+1
exceeds 1024. One use of the affected call is when running cmake -E cmake_link_script
. If a trace mode build uses enough file descriptors to cause the select()
call to fail, then strerror(errno)
is used to get the error message (for EINVAL
, it is "Invalid argument"
), which CMake eventually prints out as:
Error running link command: Invalid argument
Using the FD_CLR()
/FD_SET()
/FD_ISSET()
macros with too high of a file descriptor is known to cause out-of-bounds accesses (on various platforms—not just macOS), as sizeof(fd_set)
is directly dependent on FD_SETSIZE
. On macOS, these macros implement bounds checking, but only for an fd_set
residing on the stack, and only on very recent macOS versions. The relevant fd_set
in CMake is a member of a structure allocated on the heap, and so corruption/bogus reads do occur for file descriptors which are too high: one example is corruption to the error message buffer located only a few slots after the fd_set
in the same structure, causing CMake to instead output e.g.
Error running link command: Invalid `rgument
or
Error running link command: In^Valid argument
where ^V
is the ASCII SYN character.
Some programs (e.g. Tcl, in tclUnixPort.h) manually define FD_SETSIZE
to OPEN_MAX
(from <sys/syslimits.h>). Currently, CMake only defines FD_SETSIZE
manually (in ProcessUNIX.c) for Cygwin.
I observe that defining FD_SETSIZE
to OPEN_MAX
and defining _DARWIN_UNLIMITED_SELECT
in ProcessUNIX.c prevents select()
from failing with EINVAL
when called for more than 1024 file descriptors.
-
Source/kwsys/ProcessUNIX.c
old new 40 40 #if defined(__CYGWIN__) 41 41 /* Increase the file descriptor limit for select() before including 42 42 related system headers. (Default: 64) */ 43 43 # define FD_SETSIZE 16384 44 #elif defined(__APPLE__) 45 /* Increase the file descriptor limit for select() before including 46 related system headers. (Default: 1024) */ 47 # define _DARWIN_UNLIMITED_SELECT 48 # include <limits.h> /* OPEN_MAX */ 49 # define FD_SETSIZE OPEN_MAX 44 50 #endif 45 51 46 52 #include <assert.h> /* assert */
I believe that both of these definitions are necessary to avoid issues. I observe that if _DARWIN_UNLIMITED_SELECT
is defined but FD_SETSIZE
is left at its default value, the fd_set
structure is still not large enough to avoid out-of-bounds accesses by FD_SET()
/FD_CLR()
/FD_ISSET()
; and the select()
call, which previously would fail, now erroneously returns early (because fd_set
is not large enough to indicate which file to block on), causing CMake to poll/busy-wait rather than block as intended, but without causing the build to fail. (I wonder if it is possible for the opposite to happen, where select()
erroneously blocks on a file descriptor that fd_set
is too small to describe, and cause the build to stall indefinitely; this would resemble unresolved trace mode issues I have noticed and reported for other ports.)
The port I used to investigate this issue is mysql8
, but I hesitate to claim that making these changes to CMake would fix #63455, at least for anyone other than myself, since I’ve already led others who are much more knowledgable to believe that that issue is due to outdated libtool. I also don’t know if this fixes #55086.
I don’t know whether these changes are something upstream CMake/KWSys should incorporate, because I don’t know whether this is necessarily a bug in CMake, i.e. whether CMake (particularly when running cmake_link_script
or other command needing select()
) is normally able to get a file descriptor 1024 or higher, and encounter this issue outside of trace mode. But I do find that it is possible, at least on certain recent macOS versions, to naïvely compile a simple program which successfully opens more than 1024 files without defining FD_SETSIZE
or _DARWIN_UNLIMITED_SELECT
.
I don’t know if file descriptors growing to unusually large values under trace mode is expected behavior. If it isn’t—e.g. due to file descriptor leak in trace mode itself, or not cleaning up after programs with file descriptor leaks—then I don’t know whether it is causing “Too many open files” errors I sometimes observe under trace mode, but have neglected to report. I also don’t know if this issue should instead indicate something missing from the design of trace mode, like abstraction of file descriptor values. And I don’t know if this issue should be considered grave enough to condemn previous trace mode builds because FD_SET()
/FD_CLR()
/FD_ISSET()
may have been silently causing programs to misbehave due to invalid memory accesses.
One idea for making this issue easier to reproduce and find affected programs would be to reserve file descriptors under 1024 at the beginning of trace mode, forcing programs to use file descriptors 1024 and higher. Also note that parallel building—and the number of processors if enabled—are factors in triggering this issue, as these affects the order in which steps in a build are done, and in turn how large the file descriptor values are at a certain point in the build.
Change History (5)
comment:1 Changed 3 years ago by chrstphrchvz (Christopher Chavez)
Description: | modified (diff) |
---|
comment:2 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
Cc: | ryandesign added |
---|
comment:3 Changed 3 years ago by neverpanic (Clemens Lang)
comment:4 Changed 3 years ago by neverpanic (Clemens Lang)
Maybe we should just give up on persisting sockets in darwintrace code, and open new ones whenever socket communication is needed. That would probably avoid this entire problem, since our interposing functions all run to completion, so there should never be an open socket left over. Maybe I'll find a few minutes to benchmark how bad this would be from a performance point of view.
comment:5 Changed 3 years ago by chrstphrchvz (Christopher Chavez)
Description: | modified (diff) |
---|
Adjusted patch: POSIX would say to include <limits.h> for OPEN_MAX
(which is also where Tcl actually picks it up from), not <sys/syslimits.h>.
Submitted upstream: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/6909
Trace mode should only use one file descriptor per thread in addition to what is being used without trace mode. I doubt that CMake's linking tests would usually use 1024 file descriptors, so that means there's probably a leak somewhere, possibly related to dynamically creating and destroying threads.
I'm in favor of sending this patch upstream, since from what I understand this will also lead to memory corruption if a CMake build ever tried to legitimately open more than 1024 files, right?
Thanks for the extensive debugging. I had seen those error messages before and traced them to the invocation of select(), but couldn't figure out what was wrong.