Opened 9 years ago

Closed 8 years ago

#48486 closed defect (fixed)

lion buildslave registry damaged

Reported by: nerdling (Jeremy Lavergne) Owned by: admin@…
Priority: High Milestone:
Component: server/hosting Version:
Keywords: Cc: ryandesign (Ryan Carsten Schmidt), neverpanic (Clemens Lang), mojca (Mojca Miklavec), pixilla (Bradley Giesbrecht), daniel@…
Port:

Description

I received build failures due to registry corruption. It seems all builds for Lion are failing: https://build.macports.org/builders/buildports-lion-x86_64

sqlite error: database disk image is malformed (11) while executing query: SELECT dependencies.id FROM ports port INNER JOIN dependencies USING(name) INNER JOIN ports dependent USING(id) WHERE port.id=27240 ORDER BY dependent.name,dependent.epoch, dependent.version, dependent.revision,dependent.variants while executing "$port dependents" (procedure "receipt_sqlite::list_dependents" line 16) invoked from within "${macports::registry.format}::list_dependents $name $version $revision $variants" (procedure "registry::list_dependents" line 3) invoked from within "registry::list_dependents $pvals(name)" (procedure "portlist_sortdependents" line 7) invoked from within "portlist_sortdependents $portlist" (procedure "action_deactivate" line 6) invoked from within "$action_proc $action $portlist [array get global_options]" (procedure "process_cmd" line 103) invoked from within "process_cmd $remaining_args" invoked from within "if { [llength $remaining_args] > 0 } { # If there are remaining arguments, process those as a command set exit_status [process_cmd $remaining..." (file "/opt/local/bin/port" line 5268)

Change History (25)

comment:1 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Priority: NormalHigh

Similar/same problem as we have had for two months on the Snow Leopard buildslave; see #47976. We are still waiting on the Mac OS Forge admin to react to this issue.

Joshua suggested that the problem might have been caused by a lack of disk space, which we know has been a problem; see #48173.

comment:2 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Josh suggested that what we probably need to do is restore the tenseven-slave virtual machine to a snapshot taken before the MacPorts registry corruption began. I'm not sure exactly when that was, but this ticket was filed on August 1, 2015 so we would need to go earlier than that.

After getting the vm back up and running with the old snapshot, its disk size should be increased so that the corruption doesn't happen again.

comment:3 Changed 9 years ago by mojca (Mojca Miklavec)

Cc: mojca@… added

Cc Me!

comment:4 Changed 9 years ago by keith_dart@…

The registry.db file seems to be corrupted. There is plenty of disk space on the disk. The Disk Utility says the file system is OK. So something else corrupted the registry.db file.

    sqlite> SELECT dependencies.id FROM ports port INNER JOIN dependencies USING(name) INNER JOIN ports dependent USING(id) WHERE port.id=87723;
    71874
    71896  
    84522
    87394
    87622
    87724
    Error: database disk image is malformed

I can't find any information on whether or not this can be rebuilt. Anyone know?

comment:5 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

How much disk space is free? Some ports, like clang, need 10-20GB free, and the free space on the builders does decrease over time as more and more ports get added to MacPorts (since all ports stay installed (though inactive) on the builders).

I don't know how one would rebuild a corrupted sqlite database. Looking around on the internet, one suggestion is to try to dump the data to a new database, like:

sqlite3 registry.db ".dump" | sqlite3 registry.db.new
mv registry.db registry.db.old
mv registry.db.new registry.db

But this might not work, depending on the nature of the corruption.

The next best solution would be to restore the virtual machine to a state before the registry corruption occurred (before August 1, 2015).

If we can't repair the registry and can't restore the machine, the next idea would be to delete the registry.db file and the contents of the /opt/local/var/macports/software directory, which should make MacPorts think no ports are installed. If no ports were active when the corruption occurred, I think this should work. If the corruption occurred in the middle of a build, when ports were active, we will soon learn about it through activation failure messages.

comment:6 Changed 9 years ago by keith_dart@…

I was wondering if ports had some admin command that could scan the ports directory and make a new registry.db. Usually these kinds of DBs are a kind of cache. Maybe it's automatic? If I just remove it would it be rebuilt?

There are no vSphere level snapshots for this machine. There is currently 47 GB available on this disk:

Filesystem      Size   Used  Avail Capacity  Mounted on
/dev/disk0s1   100Gi   53Gi   47Gi    54%    /

comment:7 Changed 9 years ago by keith_dart@…

I went ahead and moved the registry.db aside and ran the ports tool. It does indeed create a new one, but it's mostly empty. This may be sufficient to fix this problem.

Last edited 9 years ago by keith_dart@… (previous) (diff)

comment:8 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Thanks. It looks like there are, unfortunately, activation failures, at least for ncurses, but that's not likely to be the only one.

Could you attach at list of all files in /opt/local/bin and /opt/local/lib on this build server? I can try to match them up to a list of ports that will need to be force-activated.

comment:9 in reply to:  6 ; Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to keith_dart@…:

I was wondering if ports had some admin command that could scan the ports directory and make a new registry.db. Usually these kinds of DBs are a kind of cache. Maybe it's automatic? If I just remove it would it be rebuilt?

No, the registry is not a cache; it is the authoritative record of what MacPorts has installed. There is no MacPorts command to rebuild it.

There are no vSphere level snapshots for this machine.

I think we can get the current setup working again with a few forced activations and deactivations. But just so I know: are there any backups of any part of this server? I just want to know that if the server's disk were to become unrecoverably corrupted, we'd have a way to bring a working system back online. The same would apply to the other servers that MacPorts uses.

comment:10 in reply to:  9 Changed 9 years ago by keith_dart@…

Replying to ryandesign@…:

I think we can get the current setup working again with a few forced activations and deactivations. But just so I know: are there any backups of any part of this server? I just want to know that if the server's disk were to become unrecoverably corrupted, we'd have a way to bring a working system back online. The same would apply to the other servers that MacPorts uses.

As far as I know all we have are the virtual disks backed by a Netapp filer (that, of course, has RAID5). So the storage shouldn't completely go away, but there are no offline backups. If we had time we might possibly set up another VM as a Time Machine server.

comment:11 in reply to:  8 Changed 9 years ago by keith_dart@…

Replying to ryandesign@…:

Thanks. It looks like there are, unfortunately, activation failures, at least for ncurses, but that's not likely to be the only one.

Could you attach at list of all files in /opt/local/bin and /opt/local/lib on this build server? I can try to match them up to a list of ports that will need to be force-activated.

tenseven-slave:~ buildbot$ ls /opt/local/bin/
7z				hist2workspace			ncursesw5-config.mp_1421686163	rttmSmooth.pl
7za				hubscr.pl			ncursesw6-config		rttmSort.pl
7zr				hwloc-annotate			netpbm-config			rttmValidator.pl
acomp.pl			hwloc-assembler			npth-config			sbcl
align2html.pl			hwloc-assembler-remote		nspr-config			sc_stats
asclite				hwloc-bind			optabgen			sclite
asn1Coding			hwloc-calc			par2				sctkUnit
asn1Decoding			hwloc-compress-dir		par2create			sendmail2snpp
asn1Parser			hwloc-diff			par2repair			sendpage
bon_csv2html			hwloc-distances			par2verify			sendpage-db
byacc				hwloc-distrib			pdcp				setxrd.csh
captoinfo			hwloc-info			pdsh				setxrd.sh
chfilt.pl			hwloc-ls			port				slatreport.pl
chromedriver			hwloc-patch			port-tclsh			snpp
clear				hwloc-ps			portf				spkr2sad.pl
cm2rem.tcl			idgen				portindex			spring
compile-et.pl			impcnvgen			portmirror			srm
cs2cs				infobot				prerr.properties		ssdeep
csrfilt.sh			infocmp				proj				ssh2rpd
ctmValidator.pl			infotocap			proofserv			stm2rttm.pl
daemondo			invgeod				proofserv.exe			stmValidator.pl
def_art.pl			invproj				qd-config			svd
diffstat			ld-xcode			red				tabs
dmd				logwatch			rem				tanweenFilt.pl
dshbak				lstopo				rem2ps				tbbvars.csh
ecl				lstopo-no-graphics		remind				tbbvars.sh
ecl-config			lzip				reset				tcping
ed				md-eval.pl			rfilter1			thisroot.csh
email2page			memprobe			rmkdepend			thisroot.sh
ffe				mergectm2rttm.pl		root				tic
flawfinder			mkpwd				root-config			tkremind
g2root				mp3cat				root.exe			toe
generate_randfile		mp3dirclean			rootcling			tput
geod				mp3http				rootn.exe			tset
geos-config			mp3log				roots				uriparse
glewinfo			mp3log-conf			roots.exe			utf_filt.pl
h2root				mp3stream-conf			rover				visualinfo
hadd				nad2bin				rpdcp
hamzaNorm.pl			ncurses6-config			rttm2ctm.pl
tenseven-slave:~ buildbot$
tenseven-slave:~ buildbot$ ls /opt/local/lib
ao					libmenu.6.dylib				libpnm.dylib
argus-monitor				libmenu.a				libppm.dylib
ckport					libmenu.dylib				libproj.9.dylib
ecl-16.0.0				libmenu.dylib.mp_1421686163		libproj.a
gdk-pixbuf-2.0				libmenuw.5.dylib.mp_1421686163		libproj.dylib
gimp					libmenuw.6.dylib			libproj.la
gio					libmenuw.a				libqd.a
gtk-2.0					libmenuw.a.mp_1421686163		libqd.la
libGLEW.1.13.0.dylib			libmenuw.dylib				libshp.1.3.0.dylib
libGLEW.1.13.dylib			libmenuw.dylib.mp_1421686163		libshp.1.dylib
libGLEW.a				libncurses++.6.dylib			libshp.dylib
libGLEW.dylib				libncurses++.a				libsvd.a
libGLEWmx.1.13.0.dylib			libncurses++.dylib			libtasn1.6.dylib
libGLEWmx.1.13.dylib			libncurses++w.6.dylib			libtasn1.a
libGLEWmx.a				libncurses++w.a				libtasn1.dylib
libGLEWmx.dylib				libncurses++w.a.mp_1421686163		libtasn1.la
libao.4.dylib				libncurses++w.dylib			libtbb.dylib
libao.dylib				libncurses.6.dylib			libtbbmalloc.dylib
libao.la				libncurses.a				libtbbmalloc_proxy.dylib
libcurses.a				libncurses.dylib			libtermcap.dylib
libcurses.dylib				libncurses.dylib.mp_1421686163		libtermcap.dylib.mp_1421686163
libecl.16.0.0.dylib			libncursesw.5.dylib.mp_1421686163	libtermkey.1.11.0.dylib
libecl.16.0.dylib			libncursesw.6.dylib			libtermkey.1.dylib
libecl.16.dylib				libncursesw.a				libtermkey.a
libecl.dylib				libncursesw.a.mp_1421686163		libtermkey.dylib
libform.6.dylib				libncursesw.dylib			libtermkey.la
libform.a				libncursesw.dylib.mp_1421686163		liburiparser.1.dylib
libform.dylib				libnetpbm.11.72.dylib			liburiparser.a
libform.dylib.mp_1421686163		libnetpbm.11.dylib			liburiparser.dylib
libformw.5.dylib.mp_1421686163		libnetpbm.a				liburiparser.la
libformw.6.dylib			libnetpbm.dylib				nspr
libformw.a				libnpth.0.dylib				ocaml
libformw.a.mp_1421686163		libnpth.dylib				octave
libformw.dylib				libnpth.la				omake
libformw.dylib.mp_1421686163		libpanel.6.dylib			omnixmp
libfuzzy.2.dylib			libpanel.a				p7zip
libfuzzy.a				libpanel.dylib				pdsh
libfuzzy.dylib				libpanel.dylib.mp_1421686163		perl5
libfuzzy.la				libpanelw.5.dylib.mp_1421686163		php4
libgeos-3.5.0.dylib			libpanelw.6.dylib			pkgconfig
libgeos.a				libpanelw.a				python2.4
libgeos.dylib				libpanelw.a.mp_1421686163		python2.5
libgeos.la				libpanelw.dylib				ruby
libgeos_c.1.dylib			libpanelw.dylib.mp_1421686163		sbcl
libgeos_c.a				libpbm.dylib				terminfo
libgeos_c.dylib				libpgm.dylib				terminfo.mp_1421686163
libgeos_c.la				libpixman-1.0.dylib			xymon
libhwloc.5.dylib			libpixman-1.a				yorick
libhwloc.dylib				libpixman-1.dylib
libhwloc.la				libpixman-1.la
tenseven-slave:~ buildbot$

comment:12 Changed 9 years ago by pixilla (Bradley Giesbrecht)

This is where having 'port contents' in the sqlite registry would be a treat.

comment:13 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Well... we may need to do this in several steps. Can you start by running:

sudo port -f activate ncurses && sudo port -f deactivate ncurses

Then I'll kick off some more builds and give you a list of some more ports to do that to.

comment:14 Changed 9 years ago by jmroot (Joshua Root)

Replying to keith_dart@…:

I went ahead and moved the registry.db aside and ran the ports tool. It does indeed create a new one, but it's mostly empty. This may be sufficient to fix this problem.

This makes MacPorts forget all the ports that are installed, but that's OK, this is just a build server. As well as removing the now-unregistered files that Ryan is having you do, you can also rm -r /opt/local/var/macports/software/* to remove the archives of the installed ports. They will be rebuilt or re-downloaded as needed.

It would be best to shut down the buildbot-slave program while doing all this, as it will be installing ports every time someone commits a change.

comment:15 in reply to:  14 ; Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to jmr@…:

Replying to keith_dart@…:

I went ahead and moved the registry.db aside and ran the ports tool. It does indeed create a new one, but it's mostly empty. This may be sufficient to fix this problem.

This makes MacPorts forget all the ports that are installed, but that's OK, this is just a build server. As well as removing the now-unregistered files that Ryan is having you do, you can also rm -r /opt/local/var/macports/software/* to remove the archives of the installed ports. They will be rebuilt or re-downloaded as needed.

That should have been done when registry.db was removed. Removing it now, without also again removing registry.db, will cause problems. We could remove registry.db again, and the contents of this folder.

It would be best to shut down the buildbot-slave program while doing all this, as it will be installing ports every time someone commits a change.

I was planning to use the buildbot web site to invoke builds of certain ports to see what activation failures occurred next. An alternative would be to shut down the buildbot-slave, and have Keith install ports on the command line. Keith, we can chat over iMessage or something else if you like which might be quicker.

comment:16 in reply to:  15 Changed 9 years ago by jmroot (Joshua Root)

Replying to ryandesign@…:

That should have been done when registry.db was removed. Removing it now, without also again removing registry.db, will cause problems. We could remove registry.db again, and the contents of this folder.

Sure. The problem is just that those archives will never go away if the corresponding port versions are never installed again.

Take care of the unregistered files first, then the archives and registry should be removed at the same time. EDIT: And of course that needs to be done at a time when all installed ports are inactive, to avoid leaving behind more unregistered files. Which is why the slave should be disconnected from the master while doing that part.

Last edited 9 years ago by jmroot (Joshua Root) (previous) (diff)

comment:17 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Josh, what are your thoughts on the best way to identify and remove the unregistered files? My plan was going to be to first remove the registry and old archives. Then for each unregistered file we find, identify the port it came from (by looking on my system or elsewhere), then force-activate the port that provided it, then deactivate it. Then later remove the .mp_* files.

comment:18 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Keith and I are chatting about this problem. The Lion builder has been taken offline while we work.

comment:19 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

We removed registry.db and the software directory. Based on the list of files in /opt/local/bin/ and /opt/local/lib/ shown above, we force-installed these ports:

ncurses dmd geos libnetpbm glew libao libpixman lzip nspr libtasn1 ssdeep shapelib proj p7zip tbb uriparser npth mp3cat par2 ecl hwloc flawfinder byacc root5

We then deactivated active ports and deleted the .mp_* files. Several unregistered files remained. Noting that Lion build 30101 had an exception shortly before the problem began, and was for root6, we then force-installed root6. While this runs, we are concluding our efforts for today and will resume later.

If there are still unregistered files after installing root6 and its dependencies, I think I'll write a script to go through all the ports modified since the problem began around revision r139082 and force install each one, one by one, deactivating ports in between.

We also have the option of just nuking MacPorts and reinstalling it, but I know that sources.conf was customized, and I don't know what other customizations there might be, though I suppose we could just diff the conf files. I'm also wary of nuking because there always seem to be some ports that install files in unusual locations, which won't get nuked.

comment:20 in reply to:  19 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to ryandesign@…:

If there are still unregistered files after installing root6 and its dependencies, I think I'll write a script to go through all the ports modified since the problem began around revision r139082 and force install each one, one by one, deactivating ports in between.

I've written this script and it's been running since Thursday 11/19/2015. It will take some time to work through all ~2200 revisions in question. I'm checking on it from time to time to make sure it's making progress.

comment:21 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Today, 25% of the way through, although my script was still running, it was failing to write to its logfile, repeating the error "Result too large". The logfile was not too large, and there was enough disk space. This, combined with the still unexplained corruption of the MacPorts registry that began this ticket, makes me wonder about the integrity of the filesystem. Multiple runs of Disk Utility showed no errors. I have stopped my script for now and will run other disk verification programs later.

comment:22 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

I ran Disk Warrior on the Lion builder's disk a week or two ago. It reported 239,306 files with an ID that was repaired; 27,277 folders with an incorrect item count; and 61,059 folders that couldn't be found. I wasn't sure if I should believe these errors, so I ran Disk Utility and Disk Warrior on the Snow Leopard builder's disk today. As with the Lion builder's disk, Disk Utility showed no problems, while Disk Warrior showed 2 files with a duplicate ID; 237,465 files with an ID that was repaired; 14,714 folders with an incorrect item count; and 32,975 folders that couldn't be found; and also that there was not enough contiguous free disk space to safely install a fixed replacement directory.

Rather than try to use Disk Warrior to fix this, since we are low on disk space anyway (#48173), I think I'll clone the old disk to a new larger disk, which should result in a new fixed directory in the process. If the old disk directory really is damaged to the point that some files can't be read, I guess I'll find out during the cloning process.

comment:23 Changed 9 years ago by pixilla (Bradley Giesbrecht)

Cc: pixilla@… added

Cc Me!

comment:24 Changed 9 years ago by daniel@…

Cc: daniel@… added

Cc Me!

comment:25 Changed 8 years ago by ryandesign (Ryan Carsten Schmidt)

Resolution: fixed
Status: newclosed

All build workers were rebuilt for the new build system, so we once again have a working 10.7 builder.

Note: See TracTickets for help on using tickets.