Opened 9 years ago
Closed 8 years ago
#48486 closed defect (fixed)
lion buildslave registry damaged
Reported by: | nerdling (Jeremy Lavergne) | Owned by: | admin@… |
---|---|---|---|
Priority: | High | Milestone: | |
Component: | server/hosting | Version: | |
Keywords: | Cc: | ryandesign (Ryan Carsten Schmidt), neverpanic (Clemens Lang), mojca (Mojca Miklavec), pixilla (Bradley Giesbrecht), daniel@… | |
Port: |
Description
I received build failures due to registry corruption. It seems all builds for Lion are failing: https://build.macports.org/builders/buildports-lion-x86_64
sqlite error: database disk image is malformed (11) while executing query: SELECT dependencies.id FROM ports port INNER JOIN dependencies USING(name) INNER JOIN ports dependent USING(id) WHERE port.id=27240 ORDER BY dependent.name,dependent.epoch, dependent.version, dependent.revision,dependent.variants while executing "$port dependents" (procedure "receipt_sqlite::list_dependents" line 16) invoked from within "${macports::registry.format}::list_dependents $name $version $revision $variants" (procedure "registry::list_dependents" line 3) invoked from within "registry::list_dependents $pvals(name)" (procedure "portlist_sortdependents" line 7) invoked from within "portlist_sortdependents $portlist" (procedure "action_deactivate" line 6) invoked from within "$action_proc $action $portlist [array get global_options]" (procedure "process_cmd" line 103) invoked from within "process_cmd $remaining_args" invoked from within "if { [llength $remaining_args] > 0 } { # If there are remaining arguments, process those as a command set exit_status [process_cmd $remaining..." (file "/opt/local/bin/port" line 5268)
Change History (25)
comment:1 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
Priority: | Normal → High |
---|
comment:2 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
Josh suggested that what we probably need to do is restore the tenseven-slave virtual machine to a snapshot taken before the MacPorts registry corruption began. I'm not sure exactly when that was, but this ticket was filed on August 1, 2015 so we would need to go earlier than that.
After getting the vm back up and running with the old snapshot, its disk size should be increased so that the corruption doesn't happen again.
comment:4 Changed 9 years ago by keith_dart@…
The registry.db file seems to be corrupted. There is plenty of disk space on the disk. The Disk Utility says the file system is OK. So something else corrupted the registry.db file.
sqlite> SELECT dependencies.id FROM ports port INNER JOIN dependencies USING(name) INNER JOIN ports dependent USING(id) WHERE port.id=87723; 71874 71896 84522 87394 87622 87724 Error: database disk image is malformed
I can't find any information on whether or not this can be rebuilt. Anyone know?
comment:5 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
How much disk space is free? Some ports, like clang, need 10-20GB free, and the free space on the builders does decrease over time as more and more ports get added to MacPorts (since all ports stay installed (though inactive) on the builders).
I don't know how one would rebuild a corrupted sqlite database. Looking around on the internet, one suggestion is to try to dump the data to a new database, like:
sqlite3 registry.db ".dump" | sqlite3 registry.db.new mv registry.db registry.db.old mv registry.db.new registry.db
But this might not work, depending on the nature of the corruption.
The next best solution would be to restore the virtual machine to a state before the registry corruption occurred (before August 1, 2015).
If we can't repair the registry and can't restore the machine, the next idea would be to delete the registry.db file and the contents of the /opt/local/var/macports/software directory, which should make MacPorts think no ports are installed. If no ports were active when the corruption occurred, I think this should work. If the corruption occurred in the middle of a build, when ports were active, we will soon learn about it through activation failure messages.
comment:6 follow-up: 9 Changed 9 years ago by keith_dart@…
I was wondering if ports had some admin command that could scan the ports directory and make a new registry.db. Usually these kinds of DBs are a kind of cache. Maybe it's automatic? If I just remove it would it be rebuilt?
There are no vSphere level snapshots for this machine. There is currently 47 GB available on this disk:
Filesystem Size Used Avail Capacity Mounted on /dev/disk0s1 100Gi 53Gi 47Gi 54% /
comment:7 Changed 9 years ago by keith_dart@…
I went ahead and moved the registry.db aside and ran the ports tool. It does indeed create a new one, but it's mostly empty. This may be sufficient to fix this problem.
comment:8 follow-up: 11 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
Thanks. It looks like there are, unfortunately, activation failures, at least for ncurses, but that's not likely to be the only one.
Could you attach at list of all files in /opt/local/bin and /opt/local/lib on this build server? I can try to match them up to a list of ports that will need to be force-activated.
comment:9 follow-up: 10 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
Replying to keith_dart@…:
I was wondering if ports had some admin command that could scan the ports directory and make a new registry.db. Usually these kinds of DBs are a kind of cache. Maybe it's automatic? If I just remove it would it be rebuilt?
No, the registry is not a cache; it is the authoritative record of what MacPorts has installed. There is no MacPorts command to rebuild it.
There are no vSphere level snapshots for this machine.
I think we can get the current setup working again with a few forced activations and deactivations. But just so I know: are there any backups of any part of this server? I just want to know that if the server's disk were to become unrecoverably corrupted, we'd have a way to bring a working system back online. The same would apply to the other servers that MacPorts uses.
comment:10 Changed 9 years ago by keith_dart@…
Replying to ryandesign@…:
I think we can get the current setup working again with a few forced activations and deactivations. But just so I know: are there any backups of any part of this server? I just want to know that if the server's disk were to become unrecoverably corrupted, we'd have a way to bring a working system back online. The same would apply to the other servers that MacPorts uses.
As far as I know all we have are the virtual disks backed by a Netapp filer (that, of course, has RAID5). So the storage shouldn't completely go away, but there are no offline backups. If we had time we might possibly set up another VM as a Time Machine server.
comment:11 Changed 9 years ago by keith_dart@…
Replying to ryandesign@…:
Thanks. It looks like there are, unfortunately, activation failures, at least for ncurses, but that's not likely to be the only one.
Could you attach at list of all files in /opt/local/bin and /opt/local/lib on this build server? I can try to match them up to a list of ports that will need to be force-activated.
tenseven-slave:~ buildbot$ ls /opt/local/bin/ 7z hist2workspace ncursesw5-config.mp_1421686163 rttmSmooth.pl 7za hubscr.pl ncursesw6-config rttmSort.pl 7zr hwloc-annotate netpbm-config rttmValidator.pl acomp.pl hwloc-assembler npth-config sbcl align2html.pl hwloc-assembler-remote nspr-config sc_stats asclite hwloc-bind optabgen sclite asn1Coding hwloc-calc par2 sctkUnit asn1Decoding hwloc-compress-dir par2create sendmail2snpp asn1Parser hwloc-diff par2repair sendpage bon_csv2html hwloc-distances par2verify sendpage-db byacc hwloc-distrib pdcp setxrd.csh captoinfo hwloc-info pdsh setxrd.sh chfilt.pl hwloc-ls port slatreport.pl chromedriver hwloc-patch port-tclsh snpp clear hwloc-ps portf spkr2sad.pl cm2rem.tcl idgen portindex spring compile-et.pl impcnvgen portmirror srm cs2cs infobot prerr.properties ssdeep csrfilt.sh infocmp proj ssh2rpd ctmValidator.pl infotocap proofserv stm2rttm.pl daemondo invgeod proofserv.exe stmValidator.pl def_art.pl invproj qd-config svd diffstat ld-xcode red tabs dmd logwatch rem tanweenFilt.pl dshbak lstopo rem2ps tbbvars.csh ecl lstopo-no-graphics remind tbbvars.sh ecl-config lzip reset tcping ed md-eval.pl rfilter1 thisroot.csh email2page memprobe rmkdepend thisroot.sh ffe mergectm2rttm.pl root tic flawfinder mkpwd root-config tkremind g2root mp3cat root.exe toe generate_randfile mp3dirclean rootcling tput geod mp3http rootn.exe tset geos-config mp3log roots uriparse glewinfo mp3log-conf roots.exe utf_filt.pl h2root mp3stream-conf rover visualinfo hadd nad2bin rpdcp hamzaNorm.pl ncurses6-config rttm2ctm.pl tenseven-slave:~ buildbot$ tenseven-slave:~ buildbot$ ls /opt/local/lib ao libmenu.6.dylib libpnm.dylib argus-monitor libmenu.a libppm.dylib ckport libmenu.dylib libproj.9.dylib ecl-16.0.0 libmenu.dylib.mp_1421686163 libproj.a gdk-pixbuf-2.0 libmenuw.5.dylib.mp_1421686163 libproj.dylib gimp libmenuw.6.dylib libproj.la gio libmenuw.a libqd.a gtk-2.0 libmenuw.a.mp_1421686163 libqd.la libGLEW.1.13.0.dylib libmenuw.dylib libshp.1.3.0.dylib libGLEW.1.13.dylib libmenuw.dylib.mp_1421686163 libshp.1.dylib libGLEW.a libncurses++.6.dylib libshp.dylib libGLEW.dylib libncurses++.a libsvd.a libGLEWmx.1.13.0.dylib libncurses++.dylib libtasn1.6.dylib libGLEWmx.1.13.dylib libncurses++w.6.dylib libtasn1.a libGLEWmx.a libncurses++w.a libtasn1.dylib libGLEWmx.dylib libncurses++w.a.mp_1421686163 libtasn1.la libao.4.dylib libncurses++w.dylib libtbb.dylib libao.dylib libncurses.6.dylib libtbbmalloc.dylib libao.la libncurses.a libtbbmalloc_proxy.dylib libcurses.a libncurses.dylib libtermcap.dylib libcurses.dylib libncurses.dylib.mp_1421686163 libtermcap.dylib.mp_1421686163 libecl.16.0.0.dylib libncursesw.5.dylib.mp_1421686163 libtermkey.1.11.0.dylib libecl.16.0.dylib libncursesw.6.dylib libtermkey.1.dylib libecl.16.dylib libncursesw.a libtermkey.a libecl.dylib libncursesw.a.mp_1421686163 libtermkey.dylib libform.6.dylib libncursesw.dylib libtermkey.la libform.a libncursesw.dylib.mp_1421686163 liburiparser.1.dylib libform.dylib libnetpbm.11.72.dylib liburiparser.a libform.dylib.mp_1421686163 libnetpbm.11.dylib liburiparser.dylib libformw.5.dylib.mp_1421686163 libnetpbm.a liburiparser.la libformw.6.dylib libnetpbm.dylib nspr libformw.a libnpth.0.dylib ocaml libformw.a.mp_1421686163 libnpth.dylib octave libformw.dylib libnpth.la omake libformw.dylib.mp_1421686163 libpanel.6.dylib omnixmp libfuzzy.2.dylib libpanel.a p7zip libfuzzy.a libpanel.dylib pdsh libfuzzy.dylib libpanel.dylib.mp_1421686163 perl5 libfuzzy.la libpanelw.5.dylib.mp_1421686163 php4 libgeos-3.5.0.dylib libpanelw.6.dylib pkgconfig libgeos.a libpanelw.a python2.4 libgeos.dylib libpanelw.a.mp_1421686163 python2.5 libgeos.la libpanelw.dylib ruby libgeos_c.1.dylib libpanelw.dylib.mp_1421686163 sbcl libgeos_c.a libpbm.dylib terminfo libgeos_c.dylib libpgm.dylib terminfo.mp_1421686163 libgeos_c.la libpixman-1.0.dylib xymon libhwloc.5.dylib libpixman-1.a yorick libhwloc.dylib libpixman-1.dylib libhwloc.la libpixman-1.la tenseven-slave:~ buildbot$
comment:12 Changed 9 years ago by pixilla (Bradley Giesbrecht)
This is where having 'port contents' in the sqlite registry would be a treat.
comment:13 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
Well... we may need to do this in several steps. Can you start by running:
sudo port -f activate ncurses && sudo port -f deactivate ncurses
Then I'll kick off some more builds and give you a list of some more ports to do that to.
comment:14 follow-up: 15 Changed 9 years ago by jmroot (Joshua Root)
Replying to keith_dart@…:
I went ahead and moved the registry.db aside and ran the ports tool. It does indeed create a new one, but it's mostly empty. This may be sufficient to fix this problem.
This makes MacPorts forget all the ports that are installed, but that's OK, this is just a build server. As well as removing the now-unregistered files that Ryan is having you do, you can also rm -r /opt/local/var/macports/software/*
to remove the archives of the installed ports. They will be rebuilt or re-downloaded as needed.
It would be best to shut down the buildbot-slave program while doing all this, as it will be installing ports every time someone commits a change.
comment:15 follow-up: 16 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
Replying to jmr@…:
Replying to keith_dart@…:
I went ahead and moved the registry.db aside and ran the ports tool. It does indeed create a new one, but it's mostly empty. This may be sufficient to fix this problem.
This makes MacPorts forget all the ports that are installed, but that's OK, this is just a build server. As well as removing the now-unregistered files that Ryan is having you do, you can also
rm -r /opt/local/var/macports/software/*
to remove the archives of the installed ports. They will be rebuilt or re-downloaded as needed.
That should have been done when registry.db was removed. Removing it now, without also again removing registry.db, will cause problems. We could remove registry.db again, and the contents of this folder.
It would be best to shut down the buildbot-slave program while doing all this, as it will be installing ports every time someone commits a change.
I was planning to use the buildbot web site to invoke builds of certain ports to see what activation failures occurred next. An alternative would be to shut down the buildbot-slave, and have Keith install ports on the command line. Keith, we can chat over iMessage or something else if you like which might be quicker.
comment:16 Changed 9 years ago by jmroot (Joshua Root)
Replying to ryandesign@…:
That should have been done when registry.db was removed. Removing it now, without also again removing registry.db, will cause problems. We could remove registry.db again, and the contents of this folder.
Sure. The problem is just that those archives will never go away if the corresponding port versions are never installed again.
Take care of the unregistered files first, then the archives and registry should be removed at the same time. EDIT: And of course that needs to be done at a time when all installed ports are inactive, to avoid leaving behind more unregistered files. Which is why the slave should be disconnected from the master while doing that part.
comment:17 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
Josh, what are your thoughts on the best way to identify and remove the unregistered files? My plan was going to be to first remove the registry and old archives. Then for each unregistered file we find, identify the port it came from (by looking on my system or elsewhere), then force-activate the port that provided it, then deactivate it. Then later remove the .mp_* files.
comment:18 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
Keith and I are chatting about this problem. The Lion builder has been taken offline while we work.
comment:19 follow-up: 20 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
We removed registry.db and the software directory. Based on the list of files in /opt/local/bin/ and /opt/local/lib/ shown above, we force-installed these ports:
ncurses dmd geos libnetpbm glew libao libpixman lzip nspr libtasn1 ssdeep shapelib proj p7zip tbb uriparser npth mp3cat par2 ecl hwloc flawfinder byacc root5
We then deactivated active ports and deleted the .mp_* files. Several unregistered files remained. Noting that Lion build 30101 had an exception shortly before the problem began, and was for root6, we then force-installed root6. While this runs, we are concluding our efforts for today and will resume later.
If there are still unregistered files after installing root6 and its dependencies, I think I'll write a script to go through all the ports modified since the problem began around revision r139082 and force install each one, one by one, deactivating ports in between.
We also have the option of just nuking MacPorts and reinstalling it, but I know that sources.conf was customized, and I don't know what other customizations there might be, though I suppose we could just diff the conf files. I'm also wary of nuking because there always seem to be some ports that install files in unusual locations, which won't get nuked.
comment:20 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
Replying to ryandesign@…:
If there are still unregistered files after installing root6 and its dependencies, I think I'll write a script to go through all the ports modified since the problem began around revision r139082 and force install each one, one by one, deactivating ports in between.
I've written this script and it's been running since Thursday 11/19/2015. It will take some time to work through all ~2200 revisions in question. I'm checking on it from time to time to make sure it's making progress.
comment:21 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
Today, 25% of the way through, although my script was still running, it was failing to write to its logfile, repeating the error "Result too large". The logfile was not too large, and there was enough disk space. This, combined with the still unexplained corruption of the MacPorts registry that began this ticket, makes me wonder about the integrity of the filesystem. Multiple runs of Disk Utility showed no errors. I have stopped my script for now and will run other disk verification programs later.
comment:22 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)
I ran Disk Warrior on the Lion builder's disk a week or two ago. It reported 239,306 files with an ID that was repaired; 27,277 folders with an incorrect item count; and 61,059 folders that couldn't be found. I wasn't sure if I should believe these errors, so I ran Disk Utility and Disk Warrior on the Snow Leopard builder's disk today. As with the Lion builder's disk, Disk Utility showed no problems, while Disk Warrior showed 2 files with a duplicate ID; 237,465 files with an ID that was repaired; 14,714 folders with an incorrect item count; and 32,975 folders that couldn't be found; and also that there was not enough contiguous free disk space to safely install a fixed replacement directory.
Rather than try to use Disk Warrior to fix this, since we are low on disk space anyway (#48173), I think I'll clone the old disk to a new larger disk, which should result in a new fixed directory in the process. If the old disk directory really is damaged to the point that some files can't be read, I guess I'll find out during the cloning process.
comment:25 Changed 8 years ago by ryandesign (Ryan Carsten Schmidt)
Resolution: | → fixed |
---|---|
Status: | new → closed |
All build workers were rebuilt for the new build system, so we once again have a working 10.7 builder.
Similar/same problem as we have had for two months on the Snow Leopard buildslave; see #47976. We are still waiting on the Mac OS Forge admin to react to this issue.
Joshua suggested that the problem might have been caused by a lack of disk space, which we know has been a problem; see #48173.