#67336 closed defect (fixed)
Filesystem cache bug can cause BSD tar to create corrupted archives on Catalina and later
Reported by: | catap (Kirill A. Korinsky) | Owned by: | jmroot (Joshua Root) |
---|---|---|---|
Priority: | High | Milestone: | MacPorts 2.10.3 |
Component: | base | Version: | 2.8.1 |
Keywords: | catalina, bigsur, monterey, ventura, sonoma, sequoia | Cc: | mascguy (Christopher Nielsen), Dave-Allured (Dave Allured), ryandesign (Ryan Carsten Schmidt), jmroot (Joshua Root) |
Port: |
Description
While I'm working on updating pari to the latest version, I discovered a bug with broken tar on Ventura x86.
Long story short: port build a libpari-gmp.dylib
that starts as:
0000000 facf feed 0007 0100 0003 0000 0006 0000 0000010 0010 0000 0650 0000 0085 0010 0000 0000 0000020 0019 0000 01d8 0000 5f5f 4554 5458 0000 0000030 0000 0000 0000 0000 0000 0000 0000 0000 0000040 c000 006c 0000 0000 0000 0000 0000 0000 0000050 c000 006c 0000 0000 0005 0000 0005 0000 0000060 0005 0000 0000 0000 5f5f 6574 7478 0000 0000070 0000 0000 0000 0000 5f5f 4554 5458 0000 0000080 0000 0000 0000 0000 4498 0000 0000 0000 0000090 a78c 0065 0000 0000 4498 0000 0002 0000
After it is installed into the system, the same files look like this:
0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0600000 49d2 34f7 48cc 0489 48cf c6ff 3948 75f3 0600010 48d5 c1ff 3948 75d9 48c8 458b 49b0 0089 0600020 ff49 49c7 df39 850f feae ffff 8348 02fb 0600030 8b4c a075 8b4c 8865 1b72 bf41 0001 0000 0600040 2949 4bdf 3c8b e8fc faf1 ffa4 894b fc04 0600050 ff49 75c7 48ee 7d8b 4cc0 f689 8148 88c4 0600060 0000 5b00 5c41 5d41 5e41 5f41 e95d fa61 0600070 ffff 4855 e589 5741 5641 5441 4853 ec83
which means the file is corrupt and has tons of zeros at the beginning.
After some digging, I've discovered that adding to the portfile:
depends_build port:gnutar post-destroot { system -W ${destroot} "/usr/bin/tar -cvf - . 2>/dev/null | wc" system -W ${destroot} "${prefix}/bin/gtar -cvf - . 2>/dev/null | wc" }
returns
DEBUG: system -W /opt/local/var/macports/build/_Users_runner_work_macports-ports_macports-ports_ports_math_pari/pari/work/destroot: /usr/bin/tar -cvf - . 2>/dev/null | wc 141230 797436 6000640 DEBUG: system -W /opt/local/var/macports/build/_Users_runner_work_macports-ports_macports-ports_ports_math_pari/pari/work/destroot: /opt/local/bin/gtar -cvf - . 2>/dev/null | wc 156802 971295 13424640
=> output are quite different, and system tar breaks an archive.
This bug also exists in the current version of pari. This means that the binary archive is literally broken.
The question is: how many other ports are affected?
Change History (55)
comment:1 follow-up: 2 Changed 19 months ago by ryandesign (Ryan Carsten Schmidt)
comment:2 follow-up: 9 Changed 19 months ago by catap (Kirill A. Korinsky)
Replying to ryandesign:
Surely this is not a general bug in tar on Ventura, else it would have been reported by now.
Unfortunately I'd like to disagree with you. After knowing that it is tar on macOS, I've discovered that it is well known issue.
For example: https://github.com/actions/runner-images/issues/2619
comment:3 follow-up: 4 Changed 19 months ago by jmroot (Joshua Root)
That issue is from early 2021 and references a report from 2020, so it can't possibly be specific to Ventura. Anyway, the tl;dr appears to be that bsdtar handles sparse files differently than gnutar.
comment:4 Changed 19 months ago by catap (Kirill A. Korinsky)
Replying to jmroot:
That issue is from early 2021 and references a report from 2020, so it can't possibly be specific to Ventura. Anyway, the tl;dr appears to be that bsdtar handles sparse files differently than gnutar.
I disagree with you. I've found at least one example of such a binary package. When I use /usr/bin/tar
from MacOS 13, I can't decompress it either:
catap@Mac-mini pari % curl -O https://packages.macports.org/pari/pari-2.13.4_0.darwin_22.x86_64.tbz2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1220k 100 1220k 0 0 2685k 0 --:--:-- --:--:-- --:--:-- 2706k catap@Mac-mini pari % /usr/bin/tar --version bsdtar 3.5.3 - libarchive 3.5.3 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.8 catap@Mac-mini pari % /usr/bin/tar xzf pari-2.13.4_0.darwin_22.x86_64.tbz2 catap@Mac-mini pari % hexdump opt/local/lib/libpari.dylib | head 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0600000 0001 5500 5741 5641 5541 5441 5053 8949 0600010 e8f7 da3c ffff 8949 48c6 1d8d 8e28 000a 0600020 2b8b 03c7 0001 0000 20bf 0000 e800 9f7a 0600030 0000 8949 89c4 852b 75ed 4819 058d 8e0a 0600040 000a 3883 7400 8b0d c738 0000 0000 e800 0600050 9fbe 0000 854d 75e4 bf0c 0020 0000 c031 0600060 dfe8 f5ce 49ff 44c7 0824 000b 0000 8b49 0600070 4807 e8c1 8339 02f8 4f74 bd48 0000 0000 catap@Mac-mini pari %
=> It seems that /usr/bin/tar
can't decode what it has encoded :)
comment:5 Changed 19 months ago by catap (Kirill A. Korinsky)
BTW, /usr/bin/tar
is bsdtar and it seems that it may create a file that it can't read.
comment:6 Changed 19 months ago by jmroot (Joshua Root)
I didn't say it didn't affect Ventura, I said it wasn't specific to Ventura.
comment:7 follow-up: 8 Changed 19 months ago by jmroot (Joshua Root)
Does the suggestion to run /usr/sbin/purge
first do anything for you? That would suggest the bug is actually in the OS.
comment:8 follow-up: 15 Changed 19 months ago by catap (Kirill A. Korinsky)
Replying to jmroot:
Does the suggestion to run
/usr/sbin/purge
first do anything for you? That would suggest the bug is actually in the OS.
I may easy reproduce it via MacPort's Github but I have no idea how can I call sudo /usr/sbin/purge
from Portfile's system call.
So, I can't say does it help or not.
comment:9 follow-up: 10 Changed 19 months ago by ryandesign (Ryan Carsten Schmidt)
Replying to catap:
Replying to ryandesign:
Surely this is not a general bug in tar on Ventura, else it would have been reported by now.
Unfortunately I'd like to disagree with you. After knowing that it is tar on macOS, I've discovered that it is well known issue.
For example: https://github.com/actions/runner-images/issues/2619
What I meant is: we have built tens of thousands of ports on the Ventura buildbot workers and this problem either hasn't occurred there or hasn't caused issues. For example, if it were the case that extracting any port's archive, or specific ports' archives, always failed, then we would expect some ports to fail to install on the buildbot because of it, but we either haven't seen that happen, or we haven't realized that some failures may be attributable to this issue. In any case the problem doesn't seem to affect all users all of the time. It is not a 100% reproducible bug.
Thanks for the link. I've read it and some other related links and it does sound like an OS bug and that lack of reproducibility is commonly reported. Seems that Homebrew has applied the workaround that optionally calls purge
around the code that creates the archives:
https://github.com/Homebrew/brew/commit/36dac3d41feafa09f82dedb63e51fabf5f0ea358
Maybe we should do the same in MacPorts base.
I may have encountered a form of this issue myself with the mongodb port. I hadn't reported it because it wasn't consistently reproducible and I could not figure out whether it was a bug in the mongodb build system or something else, and it takes so long to build that it is difficult to debug. But on one of my Macs with Catalina, and on another Mac with Monterey, I have occasionally encountered sections of the mongod
binary being replaced with zeroes, making the file unusable.
comment:10 Changed 19 months ago by catap (Kirill A. Korinsky)
Replying to ryandesign:
Replying to catap:
Replying to ryandesign:
Surely this is not a general bug in tar on Ventura, else it would have been reported by now.
Unfortunately I'd like to disagree with you. After knowing that it is tar on macOS, I've discovered that it is well known issue.
For example: https://github.com/actions/runner-images/issues/2619
What I meant is: we have built tens of thousands of ports on the Ventura buildbot workers and this problem either hasn't occurred there or hasn't caused issues. For example, if it were the case that extracting any port's archive, or specific ports' archives, always failed, then we would expect some ports to fail to install on the buildbot because of it, but we either haven't seen that happen, or we haven't realized that some failures may be attributable to this issue. In any case the problem doesn't seem to affect all users all of the time. It is not a 100% reproducible bug.
I accidentally found this problem in the pari port, and ecgeb, which depends on pari, has no artifact for Ventura x86 either: https://ports.macports.org/port/ecgen/details/ :)
My guess is that it happens, but nobody really cares enough to report it. New users facing this problem have probably given up on MacPorts.
Thanks for the link. I've read it and some other related links and it does sound like an OS bug and that lack of reproducibility is commonly reported. Seems that Homebrew has applied the workaround that optionally calls
purge
around the code that creates the archives:https://github.com/Homebrew/brew/commit/36dac3d41feafa09f82dedb63e51fabf5f0ea358
Maybe we should do the same in MacPorts base.
I am not sure if it is possible to do this within MacPorts, because this call requires sudo
and MacPorts can be used without it.
Honestly, I haven't seen a good solution for this: switching to gnutar means forcing users to install it, and if we decide to go that way, let's switch to xar or xz, which will save disk space and traffic :)
Anyway, as a workaround, switching some rare ports to xar seems to be the best possible option, to be honest.
I may have encountered a form of this issue myself with the mongodb port. I hadn't reported it because it wasn't consistently reproducible and I could not figure out whether it was a bug in the mongodb build system or something else, and it takes so long to build that it is difficult to debug. But on one of my Macs with Catalina, and on another Mac with Monterey, I have occasionally encountered sections of the
mongod
binary being replaced with zeroes, making the file unusable.
Shall that port also be switched to xar?
comment:11 follow-up: 12 Changed 19 months ago by ryandesign (Ryan Carsten Schmidt)
We do not have a mechanism for switching the compression method used for archives.
comment:12 follow-up: 19 Changed 19 months ago by catap (Kirill A. Korinsky)
Replying to ryandesign:
We do not have a mechanism for switching the compression method used for archives.
Do you mean in general? Because
set portarchivetype xar
switched compression method to xar
for a specified port.
comment:13 Changed 19 months ago by ryandesign (Ryan Carsten Schmidt)
I did mean in general. I had never heard of being able to set portarchivetype
selectively for individual ports and don't know what the consequences of that might be.
comment:14 Changed 19 months ago by ryandesign (Ryan Carsten Schmidt)
Keywords: | catalina bigsur monterey ventura added |
---|---|
Summary: | tar on Ventura is broken, what shall we do? → BSD tar can create corrupted archives on Catalina, Big Sur, Monterey, Ventura |
Version: | → 2.8.1 |
I have filed FB12160429 with Apple about this issue and noted that it is a duplicate of FB7378125 which someone else filed in 2019. If more people report the problem to Apple they are more likely to fix it sooner. If you include one of these feedback IDs in your report that will help Apple group the related reports together.
comment:15 follow-up: 16 Changed 19 months ago by ryandesign (Ryan Carsten Schmidt)
Replying to catap:
Replying to jmroot:
Does the suggestion to run
/usr/sbin/purge
first do anything for you? That would suggest the bug is actually in the OS.I may easy reproduce it via MacPort's Github but I have no idea how can I call
sudo /usr/sbin/purge
from Portfile's system call.So, I can't say does it help or not.
To test that, you could add this to the Portfile:
pre-install { system "/usr/sbin/purge" }
The install phase runs as root and is the phase in which the tar archive is created.
comment:16 Changed 19 months ago by catap (Kirill A. Korinsky)
Replying to ryandesign:
Replying to catap:
Replying to jmroot:
Does the suggestion to run
/usr/sbin/purge
first do anything for you? That would suggest the bug is actually in the OS.I may easy reproduce it via MacPort's Github but I have no idea how can I call
sudo /usr/sbin/purge
from Portfile's system call.So, I can't say does it help or not.
To test that, you could add this to the Portfile:
pre-install { system "/usr/sbin/purge" }The install phase runs as root and is the phase in which the tar archive is created.
It fixes an issue.
comment:17 Changed 19 months ago by mascguy (Christopher Nielsen)
Cc: | mascguy added |
---|
comment:18 Changed 19 months ago by mascguy (Christopher Nielsen)
Priority: | High → Normal |
---|
comment:19 follow-up: 20 Changed 19 months ago by l2dy (Zero King)
Replying to catap:
Replying to ryandesign:
We do not have a mechanism for switching the compression method used for archives.
Do you mean in general? Because
set portarchivetype xarswitched compression method to
xar
for a specified port.
After your commit [4db7b54a39a4321cc9258a53231abcacca4998a6/macports-ports], the port always installs itself from source because the default archive_type is tbz2
and MacPorts base couldn't find .xar archives.
comment:20 Changed 19 months ago by catap (Kirill A. Korinsky)
Replying to l2dy:
Replying to catap:
Replying to ryandesign:
We do not have a mechanism for switching the compression method used for archives.
Do you mean in general? Because
set portarchivetype xarswitched compression method to
xar
for a specified port.After your commit [4db7b54a39a4321cc9258a53231abcacca4998a6/macports-ports], the port always installs itself from source because the default archive_type is
tbz2
and MacPorts base couldn't find .xar archives.
Well.. at least it allows to build an archive which can be installed. Before this port might be broken, randomly :(
Do you have better suggestion?
comment:21 follow-up: 22 Changed 19 months ago by ryandesign (Ryan Carsten Schmidt)
I do not think we should be attempting to switch the archive type of individual ports to avoid this problem. As I said, we never foresaw the need for such a thing so there is no sanctioned way to do that in MacPorts base. portarchivetype
is not in the documented list of options you can set in Portfiles; it's documented as something you can set in macports.conf. As I said, I didn't know what the consequences switching portarchivetype
in ports might be, and now we see that a consequence is that the user cannot receive a binary from our servers.
Further research tells me the issue is very intermittent, possibly relates to the creation of APFS sparse files, and that LLVM tools create such sparse files. I'm not sure what was meant by "LLVM tools", but if it means anything that uses LLVM, that would include clang, and we've certainly seen the problem with compiled software created by clang like the mongod
binary and the pari library. The issue has also been seen on systems as far back as Catalina. I did not find a report of the problem on Mojave or High Sierra, but they can use APFS too, so the problem might affect them as well. Therefore, the issue has the potential to affect thousdands of our ports, at random, on several years worth of OS versions. And yet you've just brought the problem to our attention now. That means the problem occurs infrequently enough that we didn't notice it before and that just rebuilding the port, without any further changes, would probably have built successfully, although of course even if it rebuilt successfully on our buildbot servers then, it might build incorrectly for some users, though hopefully most users receive our binary instead of building from source.
In any case, we need a general solution in MacPorts base so that we don't sprinkle worarounds into thousands of ports, especially not a workaround that removes the ability to receive a binary. Running /usr/sbin/purge
to purge the disk cache before creating the archive is one solution that people say work. It requires root, but that's not a problem for most MacPorts installations. Another workaround that doesn't need root is running /bin/sleep 10
; this seems to give the disk cache enough time to sort itself out without a purge. MacPorts base could use one or the other of those, depending on if it's a root installation of MacPorts or not. We could probably limit it only to APFS filesystems, and maybe only those ports where supported_archs
is not noarch
, however it sounds like sparse files are increasingly common on macOS so even noarch
ports might be creating them. There is an API on macOS 11 and later for determining if a file is sparse, so at least on macOS 11 and later we could traverse the destroot and only use purge
or sleep
if there were any sparse files.
If we are finding that certain ports are for some reason much more susceptible to the problem, we could add the workaround into those ports until a new release of MacPorts takes place.
As you mentioned in your original report, only BSD tar seems to be affected by this; GNU tar is not. I haven't tried to read the code of either of them, but if there is some code that GNU tar is running that avoids the issue, such as perhaps some preprocessing on all the files that it will add to the archive that coincidentally causes the disk cache to figure itself out, maybe we can use similar code in MacPorts base to avoid the need for either purge
or sleep
.
comment:22 follow-up: 24 Changed 19 months ago by ryandesign (Ryan Carsten Schmidt)
Replying to ryandesign:
LLVM tools create such sparse files.
at least on macOS 11 and later we could traverse the destroot and only use
purge
orsleep
if there were any sparse files.
On Monterey x86_64, reverting your change to portarchivetype
, I tried destrooting pari and, every second or two, scanning the work directory for sparse files with Sparsity. (I scanned frequently since we know the problem fixes itself within 10 seconds.) Aside from a log file in the configure phase, I didn't see any sparse files. I also tried reverting your change that blacklists Xcode clang; no change. I haven't seen the problem with the pari tbz2 archive on my system. If you consistently see the problem on your system, you could try scanning with Sparsity to see if we can confirm the theory that the problem occurs when there are sparse files.
comment:23 follow-up: 25 Changed 19 months ago by ryandesign (Ryan Carsten Schmidt)
Replying to ryandesign:
if there is some code that GNU tar is running that avoids the issue, such as perhaps some preprocessing on all the files that it will add to the archive that coincidentally causes the disk cache to figure itself out, maybe we can use similar code in MacPorts base to avoid the need for either
purge
orsleep
.
If this code consistently shows you the problem:
Replying to catap:
post-destroot { system -W ${destroot} "/usr/bin/tar -cvf - . 2>/dev/null | wc" system -W ${destroot} "${prefix}/bin/gtar -cvf - . 2>/dev/null | wc" }
then you can test this theory by reversing the order so that you call GNU tar first, then BSD tar. Is the BSD tar output then correct or is it still wrong?
comment:24 Changed 19 months ago by catap (Kirill A. Korinsky)
Replying to ryandesign:
On Monterey x86_64, reverting your change to
portarchivetype
, I tried destrooting pari and, every second or two, scanning the work directory for sparse files with Sparsity. (I scanned frequently since we know the problem fixes itself within 10 seconds.) Aside from a log file in the configure phase, I didn't see any sparse files. I also tried reverting your change that blacklists Xcode clang; no change. I haven't seen the problem with the pari tbz2 archive on my system. If you consistently see the problem on your system, you could try scanning with Sparsity to see if we can confirm the theory that the problem occurs when there are sparse files.
Unfortunately I'm able to reproduce it via Github CI. I not able to reproduce it at any of my system. But the same port had the same issue on the same macOS 13 on build bots.
I feel that this port / ability of reproduce the issue is good way for us to track it and fix it. Frankly speaking I don't like idea to flush caches. I feel that it covers the issue.
comment:25 follow-up: 26 Changed 19 months ago by catap (Kirill A. Korinsky)
Replying to ryandesign:
Replying to catap:
post-destroot { system -W ${destroot} "/usr/bin/tar -cvf - . 2>/dev/null | wc" system -W ${destroot} "${prefix}/bin/gtar -cvf - . 2>/dev/null | wc" }then you can test this theory by reversing the order so that you call GNU tar first, then BSD tar. Is the BSD tar output then correct or is it still wrong?
I've added:
depends_build port:gnutar post-destroot { system -W ${destroot} "${prefix}/bin/gtar -cvf - . 2>/dev/null | wc" system -W ${destroot} "/usr/bin/tar -cvf - . 2>/dev/null | wc" }
and have an output:
DEBUG: Executing proc-post-org.macports.destroot-destroot-1 DEBUG: system -W /opt/local/var/macports/build/_Users_runner_work_macports-ports_macports-ports_ports_math_pari/pari/work/destroot: /opt/local/bin/gtar -cvf - . 2>/dev/null | wc 156802 971295 13424640 DEBUG: system -W /opt/local/var/macports/build/_Users_runner_work_macports-ports_macports-ports_ports_math_pari/pari/work/destroot: /usr/bin/tar -cvf - . 2>/dev/null | wc 141230 797436 6000640
=> it doesn't change anything.
Thus, run two tars haven't fix or overstep an issue.
comment:26 follow-up: 27 Changed 19 months ago by ryandesign (Ryan Carsten Schmidt)
Replying to catap:
Unfortunately I'm able to reproduce it via Github CI. I not able to reproduce it at any of my system. But the same port had the same issue on the same macOS 13 on build bots.
Yes I know pari-2.13.4_0.darwin_22.x86_64.tbz2 experienced the problem on the buildbot but none of the other builds of that version did, even though we know this issue can also affect earlier OS versions, and no previous pari versions were on the packages server, so we don't have enough data to say whether this was a one-time random occurrence or something that happens more regularly.
The Darwin 22 x86_64 buildbot worker is different from the others in that it is running on a MacBook Pro with an external hard drive (until I can figure out how to get it installed on one of the Xserves). Certainly we know that APFS is not optimal on hard drives and perhaps this bug is more common with hard drives. I don't know what hardware the GitHub Actions CI runners use. However, the problem is not exclusive to hard drives; the two Macs I saw the problem on with my mongodb builds were running off their internal SSDs.
I feel that this port / ability of reproduce the issue is good way for us to track it and fix it.
Agreed, it is!
Frankly speaking I don't like idea to flush caches. I feel that it covers the issue.
Hard to say at this point since we don't know what the issue is. If the issue is that there is a bug in the OS disk cache code such that it can contain stale data, then flushing the cache on OS versions where that bug exists is a reasonable workaround.
Several years ago Homebrew implemented the feature that formulas could opt in to flushing the cache and they haven't removed that feature which suggests that this workaround must be working sufficiently well.
Replying to catap:
Thus, run two tars haven't fix or overstep an issue.
Thanks, that's interesting!
The articles I read about sparse files said that the OS may create them when it thinks it would save space, even if the program creating the file didn't request that. And reading back a sparse file in the normal way would deliver the data in the normal way, even it was stored sparsely on disk. The program wouldn't ever need to know that a file had been sparse. I assumed that tar (either BSD or GNU) would simply read the original files, not care or know whether they were sparse, and write them to the archive. However, after searching through the libarchive issue tracker, I see that there have been bug reports about its handling of sparse files before, so libarchive must contain code to identify sparse files and handle them in a special way. Perhaps that code is buggy and we should report the problem to libarchive. Or maybe they will have a better idea about how to track down what's happening.
I've also now learned that GNU tar has support for special handling of sparse files too. But whereas libarchive seems to do this by default, GNU tar doesn't unless you use the --sparse
flag. You could test again with ${prefix}/bin/gtar -cvf - --sparse . 2>/dev/null | wc
and see if that also shows the problem.
comment:27 follow-up: 29 Changed 19 months ago by mascguy (Christopher Nielsen)
Replying to ryandesign:
GNU tar has support for special handling of sparse files] too. But whereas libarchive seems to do this by default, GNU tar doesn't unless you use the
--sparse
flag. You could test again with${prefix}/bin/gtar -cvf - --sparse . 2>/dev/null | wc
and see if that also shows the problem.
If that works, my vote would be to use GNU tar, rather than the call to purge
. (I suppose it would be less wonky if the latter is baked into base. Even then, it feels a bit... well... hacky...
comment:28 follow-up: 30 Changed 18 months ago by ryandesign (Ryan Carsten Schmidt)
The command that MacPorts uses to extract an archive is determined by base. It was never intended for ports to override it, and I don't think it's possible for them to do so.
Yes, working around OS bugs sometimes requires hacks.
I wasn't sure how to test for this myself, but I understand now that I can just go to my fork of macports-ports, make a branch, edit the pari portfile to include the testing code, commit it, and GitHub Actions automatically runs builds for me.
Armed with this knowledge, I did some tests of my own. I saw the problem on macOS 12 and 13 but not on macOS 11.
I tested whether using GNU tar's --sparse
flag caused it to experience the bug too. In my test runs, it did. That confirms for me that it is an OS filesystem bug, not a bug in libarchive.
I tested whether running purge
before archiving fixed the problem. In my test runs, it did.
I tested whether using sleep 10
before archiving fixed the problem. In my test runs, it didn't.
comment:29 follow-up: 31 Changed 18 months ago by catap (Kirill A. Korinsky)
Replying to mascguy:
Replying to ryandesign:
GNU tar has support for special handling of sparse files] too. But whereas libarchive seems to do this by default, GNU tar doesn't unless you use the
--sparse
flag. You could test again with${prefix}/bin/gtar -cvf - --sparse . 2>/dev/null | wc
and see if that also shows the problem.If that works, my vote would be to use GNU tar, rather than the call to
purge
. (I suppose it would be less wonky if the latter is baked into base. Even then, it feels a bit... well... hacky...
The issue that we need to setup gnutar. And if we're switching to ship that, why not to switch to xz which should decrease bandwish / diskusage? :)
comment:30 Changed 18 months ago by catap (Kirill A. Korinsky)
Replying to ryandesign:
Armed with this knowledge, I did some tests of my own. I saw the problem on macOS 12 and 13 but not on macOS 11.
I tested whether using GNU tar's
--sparse
flag caused it to experience the bug too. In my test runs, it did. That confirms for me that it is an OS filesystem bug, not a bug in libarchive.I tested whether running
purge
before archiving fixed the problem. In my test runs, it did.I tested whether using
sleep 10
before archiving fixed the problem. In my test runs, it didn't.
Seems that the only way to solve it is purge inside base.
But it won't help if someone runs port
without sudo
:(
comment:31 Changed 18 months ago by ryandesign (Ryan Carsten Schmidt)
Replying to catap:
The issue that we need to setup gnutar. And if we're switching to ship that, why not to switch to xz which should decrease bandwish / diskusage? :)
As I've explained several times already, we don't have a mechanism for switching the archive compression method, at least not within a major OS version. It is specified in the user's macports.conf, in the portarchivetype
option, and all the user's installed ports use archives of this type. There's no code in MacPorts base to, for example, recompress all the user's installed archives with a new compression method if the user changes portarchivetype
after having ports installed, so changing it when ports are installed will just break everything. We also don't have anything in place to convert the vast number of tbz2 archives on the packages server to txz. Obviously a script could be written to do that. But we would either have to keep all archives available in both formats during the transition period, modifying mpbb as well to upload both archive types, or there would be a period of time during the transition where some users would not receive archives, either because we have converted all the archives and they haven't upgraded MacPorts base yet or because we haven't converted all the archives and they have upgraded MacPorts base already. Or we would have to modify MacPorts base so that it would attempt downloading archives in both formats.
Fortunately portarchivetype
is commented out in the default macports.conf, so if we decided to change the default value (for example to txz
) starting from macOS 14, even users who upgraded from an older system and retained their old macports.conf would get the new archive type.
Bundling gnutar with MacPorts base is potentially problematic as its license is GPL while MacPorts' license is BSD.
Bundling xz is potentially problematic because its license is public domain which is legally ambiguous.
See #47255 and #52000 and #56237 for prior discussion of these issues.
bsdtar from libarchive 3.6.0 and later has a --no-sparse-read
flag to turn off its handling of sparse files, which should fix the problem. Unfortunately even macOS Ventura 13.3 still has an older version of bsdtar so this flag is not available. I don't know if this functionality can be accessed from the tar
command line program in prior versions (perhaps with some environment variable). The functionality is present in earlier libarchive versions so if MacPorts base were to create archives by calling libarchive functions instead of by running the tar
executable then we could disable this feature and thereby avoid the problem.
Replying to catap:
Seems that the only way to solve it is purge inside base.
But it won't help if someone runs
port
withoutsudo
:(
Yes, I've mentioned the lack of a workaround for non-root use cases in my Apple feedback report. So far they have not responded to it.
comment:32 Changed 16 months ago by catap (Kirill A. Korinsky)
comment:33 Changed 7 weeks ago by Dave-Allured (Dave Allured)
Cc: | Dave-Allured added |
---|
comment:34 Changed 7 weeks ago by ryandesign (Ryan Carsten Schmidt)
Cc: | ryandesign jmroot added |
---|---|
Keywords: | sonoma sequoia added |
Priority: | Normal → High |
This continues to affect us; see https://lists.macports.org/pipermail/macports-users/2024-October/053091.html
Josh, can we apply the workaround from the pari port globally in MacPorts base?
pre-install { if {[getuid] == 0 && [file exists /usr/sbin/purge]} { system "/usr/sbin/purge" } }
That will only fix root MacPorts installations but that's better than nothing and at least avoids us creating corrupted archives on our build servers.
We could make the fix more specific and check first if the filesystem where the files to be tarred are located is APFS.
comment:35 Changed 7 weeks ago by jmroot (Joshua Root)
Purging the disk cache for the whole system is a pretty big hammer. Do you know if the ffmpeg file was bad in the port image or only in the archive that was deployed?
Since #69555 we no longer store port images as archives on APFS systems, however we still create the image by running the destroot through tar in order to get HFS compression. Also that means the files go through tar twice on the buildbot, since an archive has to be created from the image to be deployed onto the packages server. For the latter step, we could arrange for gnutar to be installed in a different prefix on all the APFS build workers.
comment:36 Changed 7 weeks ago by jmroot (Joshua Root)
Also, it's not clear to me from the preceding discussion whether the bug happens when creating the archive or when extracting it. If it's on the creation side, we could possibly switch to tclllib's tar module for that. If it's on the extraction side, we would need to add a HFS compression implementation or lose that feature.
comment:37 Changed 7 weeks ago by ryandesign (Ryan Carsten Schmidt)
To reiterate: The files are correct on disk before being archived. For a time after the file is created, the disk cache erroneously tells anyone who asks that the file contains holes where it does not contain holes anymore. BSD tar uses this information to skip those areas of the file when archiving, so the file is corrupted at the time of archiving; nothing can be done at extract time to recover the missing information. BSD tar has no option to tell it not to do this until version 3.6.0 and even macOS 15 does not yet include a version that new. GNU tar can also do this and experiences the same problem if told to do so but does not do so by default. Clearing the disk cache works around the problem but can only be done as root.
I didn't know about tcllib's tar module. The page I found about it doesn't say much. If it can create archives that do not attempt to represent files sparsely, then using it instead of the system's tar to create archives would work around the problem.
comment:38 follow-up: 44 Changed 7 weeks ago by jmroot (Joshua Root)
OK, so things to try in no particular order:
- ask the OS if there are holes in the files the same way that bsdtar does, and only apply workarounds if it says yes
- fsync on the files
- fcntl F_FULLFSYNC
- fcntl F_NOCACHE
- add
--no-read-sparse
when supported
Do we have any simplified test case that can be used to figure out which of these actually helps?
comment:39 follow-up: 40 Changed 7 weeks ago by ryandesign (Ryan Carsten Schmidt)
I'm not aware of a simplified test case. Per above comments, Kirill found that the pari port always exhibits the problem on GitHub Actions, so we did test builds in which we added code to the pre-install
block, for example tarring the destroot with BSD tar and GNU tar with various options before and after running purge
and checking the size of the resulting archives as a way to measure whether the problem had occurred. Additional tests could be run that way.
--no-read-sparse
is exactly the option added to tar
in libarchive 3.6.0 that I was talking about, which is not available in any version of macOS since it still uses an older version of libarchive.
comment:40 follow-up: 41 Changed 7 weeks ago by jmroot (Joshua Root)
Replying to ryandesign:
I'm not aware of a simplified test case. Per above comments, Kirill found that the pari port always exhibits the problem on GitHub Actions, so we did test builds in which we added code to the
pre-install
block, for example tarring the destroot with BSD tar and GNU tar with various options before and after runningpurge
and checking the size of the resulting archives as a way to measure whether the problem had occurred. Additional tests could be run that way.
The macports-ports CI is not terribly suited to testing modifications to base. Ideally we'd want something that can be added to base's test suite.
--no-read-sparse
is exactly the option added totar
in libarchive 3.6.0 that I was talking about, which is not available in any version of macOS since it still uses an older version of libarchive.
Yes. We could use it if the libarchive port is installed and on future OS versions.
comment:41 Changed 7 weeks ago by ryandesign (Ryan Carsten Schmidt)
Replying to jmroot:
The macports-ports CI is not terribly suited to testing modifications to base. Ideally we'd want something that can be added to base's test suite.
Indeed. But I don't know how we'd write a test for what seems to be an intermittent issue in most cases.
Yes. We could use it if the libarchive port is installed and on future OS versions.
But we need a fix that works all the time. If we can use tcllib to make the tar file, as you suggest, then we should just do that all the time; it's more confusing and error-prone when some users get one code path and other users get a different code path. Using tcllib might potentially also solve our problem of newer versions of tar creating archives that emit errors when untarred on Tiger (#70622).
comment:42 Changed 7 weeks ago by jmroot (Joshua Root)
One unusual thing about pari that may make it more susceptible to this is that it links in the destroot phase and creates the dylib directly in the destroot, rather than the more usual process of copying from the build dir to the destroot.
comment:43 Changed 7 weeks ago by ryandesign (Ryan Carsten Schmidt)
Ah! Good catch. The mongodb port where I often observed this does that too.
comment:44 Changed 6 weeks ago by jmroot (Joshua Root)
Replying to jmroot:
- fsync on the files
- fcntl F_FULLFSYNC
- fcntl F_NOCACHE
Well, those don't help, but it seems that cloning the file does. You don't even have to use the clone, just creating it also fixes the original.
The way libarchive find holes is by using lseek
with SEEK_HOLE
and SEEK_DATA
, and those are giving wrong results initially, saying that there is a hole starting at 0 and the first data is at position 6291456 out of 7865736. After cloning, it changes to say that the first hole is at position 4096 and the first data is at 0.
comment:45 Changed 6 weeks ago by jmroot (Joshua Root)
BTW, I was able to reproduce the problem with pari on the macos-12 and macos-13 runners, but not on macos-14. I don't know if there's any difference due to the OS version, or just faster hardware masking the problem, or something else.
comment:46 Changed 6 weeks ago by jmroot (Joshua Root)
Owner: | set to jmroot |
---|---|
Resolution: | → fixed |
Status: | new → closed |
comment:47 Changed 6 weeks ago by jmroot (Joshua Root)
Please test to the extent that you are able. I'm still completely unable to repro with pari locally, and it would be good for many reasons to have a minimal reproducer.
comment:48 Changed 6 weeks ago by ryandesign (Ryan Carsten Schmidt)
Thanks for working on this and for uncovering these new specifics!
Do you think your fix would have prevented the ffmpeg instance of the problem as well? As far as I can tell, ffmpeg uses the normal build strategy of compiling, then copying into the destdir using install
. The ffmpeg instance of the problem did occur on the old macOS 15 x86_64 build machine using the external USB hard drive so slow storage could have played a role.
comment:49 Changed 6 weeks ago by jmroot (Joshua Root)
It's impossible to say for sure since we still don't know the details of how the bug happens, but I'd say there's a good chance. My working theory is that cloning the inode causes the information used by lseek to be refreshed internally.
comment:50 Changed 6 weeks ago by jmroot (Joshua Root)
Milestone: | → MacPorts Future |
---|
comment:51 follow-up: 52 Changed 5 weeks ago by ryandesign (Ryan Carsten Schmidt)
Summary: | BSD tar can create corrupted archives on Catalina, Big Sur, Monterey, Ventura → APFS cache bug can cause BSD tar to create corrupted archives on Catalina and later |
---|
comment:52 Changed 4 weeks ago by jmroot (Joshua Root)
Replying to ryandesign: Do we know for sure that this can only happen with APFS volumes? Those may just happen to be the default on the affected OS versions.
comment:53 Changed 4 weeks ago by jmroot (Joshua Root)
Milestone: | MacPorts Future → MacPorts 2.10.3 |
---|
comment:54 Changed 4 weeks ago by ryandesign (Ryan Carsten Schmidt)
I do not know for sure. I assumed it only affects APFS because it relates to sparse files and HFS Plus doesn't have any support for sparse files. And I haven't seen any reports of the problem on HFS Plus.
comment:55 Changed 4 weeks ago by jmroot (Joshua Root)
Summary: | APFS cache bug can cause BSD tar to create corrupted archives on Catalina and later → Filesystem cache bug can cause BSD tar to create corrupted archives on Catalina and later |
---|
OK. Given the way the kernel code is structured with the VFS and UBC at higher levels, it seems unlikely that the bug is in the actual APFS layer. It does seem likely that it could have been introduced when adding generic support in higher layers for features that APFS implements. But you'd probably be hard pressed to find anyone using any other filesystem for the MacPorts build dir on these OS versions, and we've seen few enough instances of the problem on APFS.
Given that, for better or worse, this ticket is now one of the few sources of information on the issue, it's probably better to use a more generic term to avoid misleading anyone.
Surely this is not a general bug in tar on Ventura, else it would have been reported by now.