Opened 7 years ago
Closed 3 years ago
#54370 closed defect (fixed)
gmic @2.0.2: buildbot build gets cancelled because 20 minutes go by with no output
Reported by: | ryandesign (Ryan Carsten Schmidt) | Owned by: | Schamschula (Marius Schamschula) |
---|---|---|---|
Priority: | Normal | Milestone: | |
Component: | ports | Version: | |
Keywords: | haspatch | Cc: | Blokkendoos (Johan), mascguy (Christopher Nielsen), cooljeanius (Eric Gallager) |
Port: | gmic |
Description
The buildbot is configured to assume that the build system will produce new output at least once every 20 minutes, but building gmic.cpp takes longer than that:
[ 25%] Building CXX object CMakeFiles/gmic.dir/src/gmic.cpp.o /usr/bin/clang++ -I/opt/local/include -I/opt/local/include/opencv -I/opt/local/include/GraphicsMagick -I/opt/local/include/OpenEXR -I/opt/local/include/gimp-2.0 -I/opt/local/include/gdk-pixbuf-2.0 -I/opt/local/include/libpng16 -I/opt/local/include/cairo -I/opt/local/include/glib-2.0 -I/opt/local/lib/glib-2.0/include -I/opt/local/include/pixman-1 -I/opt/local/include/freetype2 -I/opt/local/include/gtk-2.0 -I/opt/local/lib/gtk-2.0/include -I/opt/local/include/pango-1.0 -I/opt/local/include/gio-unix-2.0 -I/opt/local/include/atk-1.0 -I/opt/local/include/harfbuzz -pipe -Os -stdlib=libc++ -O3 -mtune=generic -Dgmic_build -Dcimg_use_vt100 -Dgmic_is_parallel -Dcimg_use_abort -mmacosx-version-min=10.8 -std=c++11 -stdlib=libc++ -Wno-error=c++11-narrowing -Wc++11-extensions -fpermissive -Dcimg_use_zlib -Dcimg_display=1 -Dcimg_appname=\"gmic\" -Dcimg_use_xshm -Dcimg_use_fftw3 -Dcimg_use_fftw3_singlethread -arch x86_64 -mmacosx-version-min=10.12 -fPIE -Dcimg_use_opencv -Dcimg_use_magick -Dcimg_use_tiff -Dcimg_use_png -Dcimg_use_jpeg -Dcimg_use_openexr -Dgmic_main -o CMakeFiles/gmic.dir/src/gmic.cpp.o -c /opt/local/var/macports/build/_opt_bblocal_var_buildworker_ports_build_ports_science_gmic/gmic/work/gmic-2.0.2/src/gmic.cpp In file included from /opt/local/var/macports/build/_opt_bblocal_var_buildworker_ports_build_ports_science_gmic/gmic/work/gmic-2.0.2/src/gmic.cpp:2023: In file included from /opt/local/var/macports/build/_opt_bblocal_var_buildworker_ports_build_ports_science_gmic/gmic/work/gmic-2.0.2/src/gmic.h:174: In file included from /opt/local/var/macports/build/_opt_bblocal_var_buildworker_ports_build_ports_science_gmic/gmic/work/gmic-2.0.2/src/./CImg.h:540: In file included from /opt/local/include/OpenEXR/ImfRgbaFile.h:52: In file included from /opt/local/include/OpenEXR/ImfRgba.h:45: /opt/local/include/OpenEXR/half.h:462:2: warning: 'register' storage class specifier is deprecated and incompatible with C++1z [-Wdeprecated-register] register int e = (x.i >> 23) & 0x000001ff; ^~~~~~~~~ /opt/local/include/OpenEXR/half.h:473:6: warning: 'register' storage class specifier is deprecated and incompatible with C++1z [-Wdeprecated-register] register int m = x.i & 0x007fffff; ^~~~~~~~~ In file included from /opt/local/var/macports/build/_opt_bblocal_var_buildworker_ports_build_ports_science_gmic/gmic/work/gmic-2.0.2/src/gmic.cpp:2023: In file included from /opt/local/var/macports/build/_opt_bblocal_var_buildworker_ports_build_ports_science_gmic/gmic/work/gmic-2.0.2/src/gmic.h:174: In file included from /opt/local/var/macports/build/_opt_bblocal_var_buildworker_ports_build_ports_science_gmic/gmic/work/gmic-2.0.2/src/./CImg.h:543: In file included from /opt/local/include/OpenEXR/ImfMatrixAttribute.h:50: /opt/local/include/OpenEXR/ImathMatrix.h:2530:5: warning: 'register' storage class specifier is deprecated and incompatible with C++1z [-Wdeprecated-register] register const T * IMATH_RESTRICT ap = &a.x[0][0]; ^~~~~~~~~ /opt/local/include/OpenEXR/ImathMatrix.h:2531:5: warning: 'register' storage class specifier is deprecated and incompatible with C++1z [-Wdeprecated-register] register const T * IMATH_RESTRICT bp = &b.x[0][0]; ^~~~~~~~~ /opt/local/include/OpenEXR/ImathMatrix.h:2532:5: warning: 'register' storage class specifier is deprecated and incompatible with C++1z [-Wdeprecated-register] register T * IMATH_RESTRICT cp = &c.x[0][0]; ^~~~~~~~~ /opt/local/include/OpenEXR/ImathMatrix.h:2534:5: warning: 'register' storage class specifier is deprecated and incompatible with C++1z [-Wdeprecated-register] register T a0, a1, a2, a3; ^~~~~~~~~ /opt/local/include/OpenEXR/ImathMatrix.h:2534:5: warning: 'register' storage class specifier is deprecated and incompatible with C++1z [-Wdeprecated-register] register T a0, a1, a2, a3; ^~~~~~~~~ /opt/local/include/OpenEXR/ImathMatrix.h:2534:5: warning: 'register' storage class specifier is deprecated and incompatible with C++1z [-Wdeprecated-register] register T a0, a1, a2, a3; ^~~~~~~~~ /opt/local/include/OpenEXR/ImathMatrix.h:2534:5: warning: 'register' storage class specifier is deprecated and incompatible with C++1z [-Wdeprecated-register] register T a0, a1, a2, a3; ^~~~~~~~~ command timed out: 1200 seconds without output running ['./mpbb/mpbb', '--prefix', '/opt/local', 'install-port', 'gmic'], attempting to kill
The purpose of the 20-minute timeout is to ensure the buildbot doesn't get stuck forever waiting for a build system that has gotten into an infinite loop. We could increase the timeout, but how far would we need to increase it to accommodate gmic's build?
We may want to bring this matter to the attention of the gmic developers and see if they have any thoughts on how to make the gmic build system produce more frequent output so that it would not run into this timeout. This would not only be for our buildbot's benefit: users might also rightly wonder if their computer has gotten "stuck" when a build system doesn't produce output for a long time.
Attachments (1)
Change History (36)
comment:1 Changed 7 years ago by Schamschula (Marius Schamschula)
comment:2 follow-up: 7 Changed 7 years ago by Schamschula (Marius Schamschula)
Issue reported upstream:
comment:3 Changed 7 years ago by ryandesign (Ryan Carsten Schmidt)
I ran a build of gmic 2.0.2 on my 2.7GHz Core i7 MacBook Pro with Touch Bar, timing how long each step took:
sudo port -dns install gmic -universal 2>&1 | ts -i "%H:%M:%.S" | tee gmic.log
It looks like the 14,500-line source file gmic.cpp gets compiled four times, into four different things:
- the gmic executable, which took 14m29s to build
- the gmic_gimp_gtk gimp plugin, which took 16m8s to build
- the libgmic.a static library, which took 14m15s to build
- the libgmic.dylib dynamic library, which took 14m5s to build
The buildbot worker machines are slower than my laptop—they're Xserves from 2009 with either 2.26 or 2.66GHz Xeon Nehalem processors—which is why they take longer.
I looked at Activity Monitor while some of those products were being built, and the single clang process got up to 3.75GB of memory used, so it's understandable that trying to build all four in parallel would exceed the available memory of even a Mac with 16GB RAM, since the OS takes some of the RAM. Our 64-bit buildbot workers have 8GB RAM each.
I suppose the size of this source file explains why it would take so long to build and takes so much memory. I'm not sure if there would be a way for the developers to split this file up and compile it in smaller sections.
I don't know anything about gmic but if code can be used in several ways—e.g. as an executable, and a plugin, and a library—it seems like the majority of the code would be shared and would only need to be compiled once.
Changed 7 years ago by ryandesign (Ryan Carsten Schmidt)
debug log with elapsed time for each line
comment:4 Changed 6 years ago by ryandesign (Ryan Carsten Schmidt)
comment:5 Changed 6 years ago by Blokkendoos (Johan)
Cc: | Blokkendoos added |
---|
comment:6 Changed 5 years ago by ryandesign (Ryan Carsten Schmidt)
Problem still happens even with an hour-long timeout.
comment:7 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
Replying to Schamschula:
Issue reported upstream:
That apparently wasn't the right place to report it, and I've now re-filed it as:
https://github.com/dtschump/gmic/issues/325
I recommend we disable parallel building in this port to work around the problem.
comment:8 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
Keywords: | haspatch added |
---|
comment:9 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:10 Changed 3 years ago by mascguy (Christopher Nielsen)
Hey Ryan, I asked Marius to re-enable parallel builds, as it cuts build times in half. And that's important, given that the license precludes use of buildbot binaries.
Is there any other way to fix this...?
comment:11 Changed 3 years ago by mascguy (Christopher Nielsen)
Cc: | mascguy added |
---|
comment:12 follow-up: 13 Changed 3 years ago by mascguy (Christopher Nielsen)
Can a port override the timeout at present? If not, are there any alternative mechanisms, such as generating dummy output at regular intervals...?
comment:13 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
Replying to mascguy:
Is there any other way to fix this...?
My preferred fix would be to implement #62554. Then a port could override MacPorts' expectation that a compile job won't exceed 1GB RAM. For example, gmic might set that new variable to 4 instead of 1 (assuming we write the formula to leave some RAM for the OS, or if we don't, then gmic could set the variable to e.g. 4.1). If your computer has more than 8 GB RAM then you would still build in parallel, and on our build VMs with 8GB they would build with just one job.
The largest issues have already been resolved by splitting the gmic port into several subports.
Replying to mascguy:
Can a port override the timeout at present?
No. The timeout is set in the buildbot configuration. It is unreasonable for any software build to go for over an hour without printing anything, so I would not want to spend any time implementing a way for ports to override this timeout. The timeout should no longer happen for gmic now that it has been split into subports.
If not, are there any alternative mechanisms, such as generating dummy output at regular intervals...?
It's not about printing dummy output, it's about not wanting to use more RAM than the system has available, to avoid swapping which is what slows down the build and, in the case of the Sierra buildbot worker just before I revisited this ticket a few days ago, caused the worker VM to completely crash and require a hard reboot. At least, it is my guess that excessive memory/swap use might have exposed a bug in Sierra that caused the crash. Or the VM crash could have been coincidental and not have had anything to do with the memory use. Some of the VMs do crash like this from time to time.
comment:14 Changed 3 years ago by mascguy (Christopher Nielsen)
Ryan, relative to resource use, gmic
only compiles two source files. So even with parallel builds enabled, there are only two clang instances executing simultaneously.
Granted, those two processes do consume a considerable amount of memory; they peak at 1.3 GB/each, based on local testing. But with only two running - and with 8 GB of RAM - shouldn't we be fine?
comment:15 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
See comment:3. Observing the buildbot worker while building this port, they peaked at almost 4GB RAM each.
comment:16 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
Resolution: | fixed |
---|---|
Status: | closed → reopened |
With 3.0.2, even with the port split into subports, this problem is happening again, observed on the 10.8, 10.9, and 10.10 workers. The 10.8 and 10.10 workers currently only have 6 GB RAM available due to the total number of VMs we now run. The 10.9 worker currently has 9 GB, more than the usual 8 GB. I can try temporarily giving these workers more memory to get the builds done.
comment:17 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
I gave the 10.8 buildbot worker 12 GB RAM and started a gmic build. When the clang process reached around 10 GB RAM usage, it started eating up gigabytes of swap and I canceled the build and rebooted the worker with 18 GB RAM and started a new build. This time the clang process reached 15.74 GB before it started hitting the swap space. I saw the clang process use as much as 15.99 GB RAM and 6 GB swap. The build was canceled after 1 hour 3 minutes because of no output seen for 1 hour.
I gave the 10.9 worker 15 GB RAM and I observed the clang process using 12.47 GB at its peak. The build was canceled after 1 hour 58 minutes because of no output seen for 1 hour.
The gmic developers closed my bug report with the opinion that the clang compiler is flawed and that they will not make any accommodations for it. They say that old versions of g++ had similar problems but that new versions don't. They suggest that compiling without optimization flags may reduce memory usage.
Trying to compile gmic using g++ would require that g++ supports -stdlib=libc++
. Did that make it into gcc11 or is it coming for gcc12? I don't think MacPorts base knows that any g++ supports -stdlib
yet so until then it would have to be manually added by the Portfile.
Our builds on newer OS versions have already completed so I don't know if newer clangs use less memory. Perhaps someone on macOS 12 could check how much memory clang ends up using during the build. If it's considerably less, then maybe making the port use a newer MacPorts clang is the answer. And I think it may be. While build times are not a great indicator, since different VMs are on different servers with different loads, here's the latest build times:
- 10.6 i386: MacPorts clang 11: 42 minutes
- 10.6 x86_64: MacPorts clang 11: 46 minutes
- 10.7: MacPorts clang 12: 51 minutes
- 10.8: Xcode 5.1.1 clang 503.0.40: failed after an hour (the retry with 18 GB RAM failed after 1 hour 3)
- 10.9: Xcode 6.2 clang 600.0.57: failed after an hour (the retry with 15 GB RAM failed after 1 hour 58)
- 10.10: Xcode 7.2.1 clang 700.1.81: failed after an hour
- 10.11: Xcode 8.2.1 clang 800.0.42.1: 1 hour 58 minutes
- 10.12: Xcode 9.2 clang 900.0.39.2: 1 hour 27 minutes
- 10.13: Xcode 9.4.1 clang 902.0.39.2: 42 minutes
- 10.14: Xcode 10.3 clang 1001.0.46.4: 49 minutes
- 10.15: Xcode 11.7 clang 1103.0.32.62: 34 minutes
- 11 x86_64: Xcode 12.5.1 clang 1205.0.22.11: 33 minutes
- 12 x86_64: Xcode 13.1 clang 1300.0.29.3: 46 minutes
So what would you think about blacklisting {clang < 902}
?
comment:18 Changed 3 years ago by mascguy (Christopher Nielsen)
The excessive memory usage is odd, based on monitoring of local gmic
builds on 10.13, 10.14, and 10.15: While I see the Clang process consume plenty of memory - sometimes peaking at a few gigabytes - I've never seen it exceed 2 or 3gb.
Local build testing on older macOS releases may be needed, to better diagnose this.
Presently buried due to my day job, but might be able to find some time over the coming days. But in the interim, if any toolchain-savvy folks have any thoughts, please chime in!
comment:19 follow-up: 20 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
Great, then my proposal to blacklist {clang <902}
should fix the problem.
comment:20 Changed 3 years ago by mascguy (Christopher Nielsen)
Replying to ryandesign:
Great, then my proposal to blacklist
{clang <902}
should fix the problem.
Sounds like a good approach Ryan, and hopefully that will resolve this.
But if the issue persists, I'm willing to assist with diagnosis on older macOS releases, when I can find some time.
comment:21 follow-up: 22 Changed 3 years ago by Schamschula (Marius Schamschula)
Will do.
FYI: I build all subports of each new release on my 2013 iMac i7 which has 24 GB of RAM under Big Sur in less total time than some of the above listed time above.
comment:22 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
Replying to Schamschula:
Will do.
[5dee8ca8177e6e230f76c77c708f37fb7b75de27/macports-ports]
So far so good. clang memory usage hasn't gone above 2 GB yet.
FYI: I build all subports of each new release on my 2013 iMac i7 which has 24 GB of RAM under Big Sur in less total time than some of the above listed time above.
Sure but your iMac is 4 years newer than our Xserves so it's probably a bit faster. And the buildbot / mpbb infrastructure adds a little overhead.
comment:23 follow-up: 24 Changed 3 years ago by mascguy (Christopher Nielsen)
On a semi-related note...
If the excessive memory usage is fixed by using newer Clangs, we may be able to re-enable parallel builds: While the compilation is extremely memory-intensive, only two Clang processes run simultaneously.
Though along with the blacklisting change, reducing the optimization level probably wouldn't hurt either. And if it's currently set to our default of optimizing for size (-Os
), we might want to switch that to -O1
instead.
Thoughts?
comment:24 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
Replying to mascguy:
If the excessive memory usage is fixed by using newer Clangs, we may be able to re-enable parallel builds: While the compilation is extremely memory-intensive, only two Clang processes run simultaneously.
Yes I agree parallel builds could be reenabled. Would be a good idea to add a comment to the Portfile that the blacklisting is to avoid excessive memory use by older clangs, with the URL reference for the upstream bug report that's currently near the disabling of parallel builds.
I would probably add some code to set build.jobs
to half its value (or minimum 1), since build.jobs
is normally set with the assumption that a compiler will use up to 1 GB memory.
reducing the optimization level probably wouldn't hurt either. And if it's currently set to our default of optimizing for size (
-Os
), we might want to switch that to-O1
instead.
I don't know whether that's desirable. If it is, the revision should be increased.
comment:25 Changed 3 years ago by mascguy (Christopher Nielsen)
From what I can see, it looks like aggressive optimization - via -O3
- is being used. (While base is adding our default of -Os
, a subsequent -O3
is being added from somewhere else. It's not being specified by the portfile, so presumably it's coming from the Makefile, and/or CMake?)
/usr/bin/clang++ -Dcimg_appname=\"gmic\" -Dcimg_display=1 -Dcimg_use_curl -Dcimg_use_fftw3 -Dcimg_use_heif -Dcimg_use_jpeg -Dcimg_use_magick -Dcimg_use_openexr -Dcimg_use_png -Dcimg_use_tiff -Dcimg_use_zlib -Dlibgmic_EXPORTS -isystem gmic/work/gmic-3.0.2/src -isystem /opt/local/include -isystem /opt/local/include/GraphicsMagick -isystem /opt/local/include/OpenEXR -pipe -Os -DNDEBUG -I/opt/local/include -stdlib=libc++ -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -O3 -DNDEBUG -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -mmacosx-version-min=12.0 -fPIC -Dgmic_core -Dcimg_use_vt100 -Dgmic_is_parallel -Dcimg_use_abort -Wno-error=c++11-narrowing -Wc++11-extensions -fpermissive -std=gnu++11 -MD -MT CMakeFiles/libgmic.dir/src/gmic.cpp.o -MF CMakeFiles/libgmic.dir/src/gmic.cpp.o.d -o CMakeFiles/libgmic.dir/src/gmic.cpp.o -c gmic/work/gmic-3.0.2/src/gmic.cpp
Using -O1
might help, both in terms of memory consumption, as well as compilation time. Hard to say without testing, but it can't hurt!
comment:26 follow-up: 27 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
Well, it could hurt, in that lower optimization levels might mean slower code.
comment:27 follow-up: 28 Changed 3 years ago by mascguy (Christopher Nielsen)
Replying to ryandesign:
Well, it could hurt, in that lower optimization levels might mean slower code.
True, but hard to say without some before-and-after benchmarking.
Perhaps -O2
might be a good compromise to start with?
comment:28 Changed 3 years ago by mascguy (Christopher Nielsen)
Replying to mascguy:
Perhaps
-O2
might be a good compromise to start with?
It looks like the builds have succeeded, so perhaps there's no need to change the optimization level.
Should we go ahead and try re-enabling parallel builds, now that we've resolved the excessive memory usage issue?
comment:29 Changed 3 years ago by cooljeanius (Eric Gallager)
Cc: | cooljeanius added |
---|
comment:30 follow-up: 31 Changed 3 years ago by cooljeanius (Eric Gallager)
maybe instead of switching the entire -O
level, find the specific -f
pass that's slowing down the build and then switch it off with -fno-
instead? I think there are people in the GCC community that have compile flag bisectors/reducers; not sure if they also work for clang, though...
comment:31 follow-up: 32 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)
Replying to cooljeanius:
maybe instead of switching the entire
-O
level
It's Marius's port so he can make further changes if he likes but as far as I'm concerned the matter was resolved by preventing the use of Apple Clang < 902.
comment:32 Changed 3 years ago by mascguy (Christopher Nielsen)
Replying to ryandesign:
It's Marius's port so he can make further changes if he likes but as far as I'm concerned the matter was resolved by preventing the use of Apple Clang < 902.
For sure, and we needn't change the optimization level at this point.
However, given that the license precludes publishing binaries - and the fact that this port doesn't build very quickly, unless a user is running on newer hardware - I'd like to suggest that we re-enable parallel builds.
Marius, are you okay with that?
comment:34 Changed 3 years ago by Schamschula (Marius Schamschula)
comment:35 Changed 3 years ago by mascguy (Christopher Nielsen)
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
This is no longer an issue. Closing as fixed.
Unfortunately, there are several places where nothing seems to happen. However, the build succeeds.