Context Navigation

← Previous Ticket
Next Ticket →

#60178 closed defect (wontfix)

Don't use SSDs for buildbot workers

Reported by:	ryandesign (Ryan Carsten Schmidt)	Owned by:	admin@…
Priority:	Normal	Milestone:
Component:	server/hosting	Version:
Keywords:		Cc:	fhgwright (Fred Wright)
Port:

Description

In comment:ticket:60112:4 Fred Wright mentioned something that I'd like to explore in a separate ticket:

BTW, one should never rely on SSDs for primary storage. Not only is flash storage inherently less reliable than magnetic storage, but flash reliability is one of the few technological parameters that's actually been getting *worse* with "advances", rather than better. And a system whose primary purpose is to run builds is probably going to stress the flash write limits a lot.

Change History (7)

comment:1 Changed 5 years ago by danielluke (Daniel J. Luke)

Is that critique based on experience or data? I've had significantly better experience with the reliability of SSDs vs traditional disks both at home and work. (SSDs are generally over provisioned enough to make up for the lower number of P/E cycles as they've moved from SLC -> MLC -> TLC -> QLC). It should be possible to monitor the 'wear' on the SSD in the build server and determine if it's going to 'use up' whatever drive is chosen too quickly or not.

comment:2 Changed 5 years ago by fhgwright (Fred Wright)

At the basic level, flash devices (as well as similar devices such as EPROMs and EEPROMs) rely on very well-insulated static charges to store bits, whereas HDDs use magnetic domains, which are fundamentally more stable. Both have been squeezing the bits more and more to get higher densities, but the HDDs have done a better job of improving the ECC to match than the SSDs.

Part of my time at Google was in the flash storage area, so we had a lot of data on this issue, both operationally and from vendors. We never used flash devices for primary storage, but only as performance-improving caches for data stored on HDDs.

Assuming wear leveling, excess capacity directly translates into a lower P/E rate, but that's a rather expensive approach to take with devices that are already substantially more expensive per bit than HDDs. Whether "SMART" monitoring can provide adequate early warning of failures remains to be seen, but it obviously didn't in the case of the failure that prompted the original ticket. And of course, it's especially important to ensure that an SSD-based system never swaps.

Granted, the performance of HDD-based Macs leaves a lot to be desired, but I suspect that a lot of this is due to poor scheduling and/or poor prioritization. Apple would probably rather sell SSDs than improve the software. And their "Fusion Drive" approach seems to be a much dumber way to use SSDs to improve HDD performance than the cache approach.

comment:3 Changed 5 years ago by kencu (Ken)

we have a nice port called gsmartmoncontrol that I've been using. It generates output that looks interesting, although I don't know exactly how to tell if everything is ideal or not. Here's the output from my current 980GB SSD that I use all day, every day. Does everything look OK? I thought it did....but I don't know all about what I'm looking at...

smartctl 7.0 2018-12-30 r4883 [Darwin 10.8.0 i386] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
Device Model:     Micron_M500_MTFDDAK960MAV
Serial Number:    14200C21D111
LU WWN Device Id: 5 00a075 10c21d111
Firmware Version: MU03
User Capacity:    960,197,124,096 bytes [960 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Mar 11 16:39:55 2020 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		( 4470) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  74) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x0035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   100   000    -    117
  5 Reallocate_NAND_Blk_Cnt PO--CK   100   100   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    14650
 12 Power_Cycle_Count       -O--CK   100   100   000    -    31263
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Ave_Block-Erase_Count   -O--CK   095   095   000    -    150
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    240
180 Unused_Reserve_NAND_Blk PO--CK   000   000   000    -    16523
183 SATA_Interfac_Downshift -O--CK   100   100   000    -    0
184 Error_Correction_Count  -O--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   061   050   000    -    39 (Min/Max -10/50)
196 Reallocated_Event_Count -O--CK   100   100   000    -    16
197 Current_Pending_Sector  -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
202 Percent_Lifetime_Remain P---CK   095   095   000    -    5
206 Write_Error_Rate        -OSR--   100   100   000    -    0
210 Success_RAIN_Recov_Cnt  -O--CK   100   100   000    -    0
246 Total_Host_Sector_Write -O--CK   100   100   ---    -    134002525540
247 Host_Program_Page_Count -O--CK   100   100   ---    -    4200826760
248 FTL_Program_Page_Count  -O--CK   100   100   ---    -    3462708285
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

ATA_READ_LOG_EXT (addr=0x00:0x00, page=0, n=1) failed: 48-bit ATA commands not implemented
Read GP Log Directory failed

SMART Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00           SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O     51  Comprehensive SMART error log
0x04           SL  R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x09           SL  R/W      1  Selective self-test log
0x30           SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f      SL  R/W     16  Host vendor specific log
0xa0           SL  VS     208  Device vendor specific log
0xa1-0xbf      SL  VS       1  Device vendor specific log
0xc0           SL  VS      80  Device vendor specific log
0xc1-0xdf      SL  VS       1  Device vendor specific log
0xe0           SL  R/W      1  SCT Command/Status
0xe1           SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log (GP Log 0x03) not supported

SMART Error Log Version: 1
No Errors Logged

SMART Extended Self-test Log (GP Log 0x07) not supported

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Vendor (0xff)       Completed without error       00%     14135         -
# 2  Vendor (0xff)       Completed without error       00%      7305         -
# 3  Vendor (0xff)       Completed without error       00%      2921         -
# 4  Vendor (0xff)       Completed without error       00%      2286         -
# 5  Vendor (0xff)       Completed without error       00%        36         -
# 6  Vendor (0xff)       Completed without error       00%        33         -
# 7  Vendor (0xff)       Completed without error       00%        26         -
# 8  Vendor (0xff)       Completed without error       00%        17         -
# 9  Vendor (0xff)       Completed without error       00%        16         -
#10  Vendor (0xff)       Completed without error       00%        16         -
#11  Vendor (0xff)       Completed without error       00%        16         -
#12  Vendor (0xff)       Completed without error       00%        15         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       1 (0x0001)
Device State:                        Active (0)
Current Temperature:                    39 Celsius
Power Cycle Min/Max Temperature:     28/39 Celsius
Lifetime    Min/Max Temperature:     -10/50 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/70 Celsius
Min/Max Temperature Limit:           -5/75 Celsius
Temperature History Size (Index):    478 (473)

Index    Estimated Time   Temperature Celsius
 474    2020-03-11 08:42    38  *******************
 475    2020-03-11 08:43    38  *******************
 476    2020-03-11 08:44    38  *******************
 477    2020-03-11 08:45    39  ********************
   0    2020-03-11 08:46    35  ****************
 ...    ..( 21 skipped).    ..  ****************
  22    2020-03-11 09:08    35  ****************
  23    2020-03-11 09:09    36  *****************
 ...    ..(  3 skipped).    ..  *****************
  27    2020-03-11 09:13    36  *****************
  28    2020-03-11 09:14    37  ******************
 ...    ..(  7 skipped).    ..  ******************
  36    2020-03-11 09:22    37  ******************
  37    2020-03-11 09:23    38  *******************
 ...    ..(  6 skipped).    ..  *******************
  44    2020-03-11 09:30    38  *******************
  45    2020-03-11 09:31    37  ******************
 ...    ..(  6 skipped).    ..  ******************
  52    2020-03-11 09:38    37  ******************
  53    2020-03-11 09:39    38  *******************
 ...    ..( 11 skipped).    ..  *******************
  65    2020-03-11 09:51    38  *******************
  66    2020-03-11 09:52     ?  -
  67    2020-03-11 09:53     ?  -
  68    2020-03-11 09:54    34  ***************
  69    2020-03-11 09:55    34  ***************
  70    2020-03-11 09:56    35  ****************
 ...    ..(  2 skipped).    ..  ****************
  73    2020-03-11 09:59    35  ****************
  74    2020-03-11 10:00    36  *****************
  75    2020-03-11 10:01     ?  -
  76    2020-03-11 10:02     ?  -
  77    2020-03-11 10:03    28  *********
  78    2020-03-11 10:04    28  *********
  79    2020-03-11 10:05    29  **********
  80    2020-03-11 10:06    29  **********
  81    2020-03-11 10:07    30  ***********
  82    2020-03-11 10:08    30  ***********
  83    2020-03-11 10:09    30  ***********
  84    2020-03-11 10:10    31  ************
 ...    ..( 15 skipped).    ..  ************
 100    2020-03-11 10:26    31  ************
 101    2020-03-11 10:27    32  *************
 ...    ..(  6 skipped).    ..  *************
 108    2020-03-11 10:34    32  *************
 109    2020-03-11 10:35    33  **************
 ...    ..( 14 skipped).    ..  **************
 124    2020-03-11 10:50    33  **************
 125    2020-03-11 10:51    34  ***************
 ...    ..( 45 skipped).    ..  ***************
 171    2020-03-11 11:37    34  ***************
 172    2020-03-11 11:38    35  ****************
 ...    ..( 24 skipped).    ..  ****************
 197    2020-03-11 12:03    35  ****************
 198    2020-03-11 12:04    34  ***************
 ...    ..( 17 skipped).    ..  ***************
 216    2020-03-11 12:22    34  ***************
 217    2020-03-11 12:23    35  ****************
 ...    ..( 24 skipped).    ..  ****************
 242    2020-03-11 12:48    35  ****************
 243    2020-03-11 12:49    36  *****************
 ...    ..(  8 skipped).    ..  *****************
 252    2020-03-11 12:58    36  *****************
 253    2020-03-11 12:59    35  ****************
 254    2020-03-11 13:00    36  *****************
 ...    ..(  6 skipped).    ..  *****************
 261    2020-03-11 13:07    36  *****************
 262    2020-03-11 13:08    35  ****************
 ...    ..(  9 skipped).    ..  ****************
 272    2020-03-11 13:18    35  ****************
 273    2020-03-11 13:19    36  *****************
 ...    ..( 14 skipped).    ..  *****************
 288    2020-03-11 13:34    36  *****************
 289    2020-03-11 13:35    37  ******************
 ...    ..(  8 skipped).    ..  ******************
 298    2020-03-11 13:44    37  ******************
 299    2020-03-11 13:45    38  *******************
 ...    ..(  2 skipped).    ..  *******************
 302    2020-03-11 13:48    38  *******************
 303    2020-03-11 13:49    37  ******************
 ...    ..( 28 skipped).    ..  ******************
 332    2020-03-11 14:18    37  ******************
 333    2020-03-11 14:19    36  *****************
 ...    ..(  3 skipped).    ..  *****************
 337    2020-03-11 14:23    36  *****************
 338    2020-03-11 14:24    35  ****************
 339    2020-03-11 14:25    35  ****************
 340    2020-03-11 14:26    36  *****************
 ...    ..(  4 skipped).    ..  *****************
 345    2020-03-11 14:31    36  *****************
 346    2020-03-11 14:32    37  ******************
 ...    ..( 56 skipped).    ..  ******************
 403    2020-03-11 15:29    37  ******************
 404    2020-03-11 15:30    38  *******************
 ...    ..(  4 skipped).    ..  *******************
 409    2020-03-11 15:35    38  *******************
 410    2020-03-11 15:36    39  ********************
 ...    ..(  8 skipped).    ..  ********************
 419    2020-03-11 15:45    39  ********************
 420    2020-03-11 15:46    38  *******************
 ...    ..( 10 skipped).    ..  *******************
 431    2020-03-11 15:57    38  *******************
 432    2020-03-11 15:58    39  ********************
 433    2020-03-11 15:59    39  ********************
 434    2020-03-11 16:00    39  ********************
 435    2020-03-11 16:01    38  *******************
 ...    ..( 20 skipped).    ..  *******************
 456    2020-03-11 16:22    38  *******************
 457    2020-03-11 16:23    39  ********************
 ...    ..(  2 skipped).    ..  ********************
 460    2020-03-11 16:26    39  ********************
 461    2020-03-11 16:27    38  *******************
 ...    ..( 11 skipped).    ..  *******************
 473    2020-03-11 16:39    38  *******************

SCT Error Recovery Control command not supported

Device Statistics (SMART Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4           31263  ---  Lifetime Power-On Resets
0x01  0x010  4           14650  ---  Power-on Hours
0x01  0x018  6    134002525540  ---  Logical Sectors Written
0x01  0x020  6      1366798069  ---  Number of Write Commands
0x01  0x028  6     60333822136  ---  Logical Sectors Read
0x01  0x030  6      1593816513  ---  Number of Read Commands
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4           31468  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              39  ---  Current Temperature
0x05  0x010  1              35  ---  Average Short Term Temperature
0x05  0x018  1              34  ---  Average Long Term Temperature
0x05  0x020  1              50  ---  Highest Temperature
0x05  0x028  1             -10  ---  Lowest Temperature
0x05  0x030  1              47  ---  Highest Average Short Term Temperature
0x05  0x038  1              26  ---  Lowest Average Short Term Temperature
0x05  0x040  1              38  ---  Highest Average Long Term Temperature
0x05  0x048  1              32  ---  Lowest Average Long Term Temperature
0x05  0x050  4               -  ---  Time in Over-Temperature
0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
0x05  0x060  4               1  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               0  ---  Number of Hardware Resets
0x06  0x010  4               0  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               2  N--  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

ATA_READ_LOG_EXT (addr=0x11:0x00, page=0, n=1) failed: 48-bit ATA commands not implemented
Read SATA Phy Event Counters failed

comment:4 follow-up: 5 Changed 5 years ago by danielluke (Daniel J. Luke)

Fred - do you know of any published data (this 2016 Usenix paper is nice, but I'd be interested in newer data - https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf ).

Ken - That drive is telling you it has 95% of its life left (whether or not that translates into something useful or not depends on a lot of other factors).

comment:5 in reply to: 4 Changed 5 years ago by fhgwright (Fred Wright)

Replying to danielluke:

Fred - do you know of any published data (this 2016 Usenix paper is nice, but I'd be interested in newer data - https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf ).

No, my information is only internal, and somewhat older since I left Google in 2013.

I suspect that the replacement rate for HDDs would have gone down if anyone had implemented my suggestion to add surface scans to the automated repair workflow. A simple reinstall without a surface scan may miss many bad blocks.

It's also worth noting that, in the Google server context, almost all data is part of a higher-level redundant storage system, where bad blocks (or even entire failed machines) don't result in any real data loss. Home and small-business systems don't usually have that luxury.

comment:6 in reply to: description Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

I have been meaning to reply to this ticket since creating it.

Replying to fhgwright:

one should never rely on SSDs for primary storage

That assertion needs at least some qualification. Apple has been shipping Macs with SSDs as primary storage for years; clearly they think it can be relied upon. And VMware specifically supports using SSDs as primary VM storage, so they clearly think it is acceptable to use that in some cases. We can certainly discuss whether using it in our case was the right choice.

Our previous buildbot setup was at macOS forge where two Xserves (for buildbot workers) and six HP servers (for everything else) connected via 10Gbit Ethernet to two expensive NetApp filers each with dozens of hard disks in a RAID. In mid 2016 when we needed to set up our own hardware to replace what we were losing due to the closure of macOS forge, I lent MacPorts my four Xserves to use as buildbot machines. I initially tried to find a RAID shared storage solution, some kind of low-end version of what the NetApp filers had offered us, but I could find nothing that was remotely affordable. And MacPorts does not have a revenue stream, is not a legal organization, and is not set up to accept donations. We do have some funds from our participation in Google Summer or Code, but at the time we did not consider using those funds to purchase a RAID; I don't think the funds would have been sufficient in any case. Instead I decided to forgo shared storage and use local storage on each Xserve. I believed that trying to host three heavy-duty VMs off a single hard disk would be a tremendous reduction of performance compared to what we were used to from the NetApp filers, so I purchased SSDs for the three Xserves we would use as workers. (The fourth Xserve already had an Apple SSD and an Apple hard disk RAID and is used as the buildmaster and fileserver.) I have been very pleased with the performance of the worker VMs running on these SSDs.

It is likely that our buildbot workers incur more writes to these SSDs than the average workload would. It is possible that the VMs may not see their virtual disks as SSDs, or that even if they do macOS may not be issuing TRIM commands to them (since it has been Apple's custom to only issue TRIM commands to original Apple SSDs), or that any TRIM commands the VMs are issuing to those virtual disks are not being passed through to the physical SSDs, which would have a write amplification effect and further increase the wear on the SSDs and reduce their life expectancy.

After losing one of our VM hosts' SSDs in February 2020, I recreated the VMs on hard disks. I used a separate hard disk for each VM and I guess the performance has not been terrible but I haven't actually looked at the build times to compare them with the VMs that are still on SSD. I have definitely noticed slower startup times.

In the mean time, in May 2020 we lost a second SSD in a manner very similar to the first. I haven't rebuilt those VMs yet.

I was using Samsung sm951 AHCI 512GB SSDs in m.2 PCIe adapters. Their endurance is rated at 150TBW. They were put into service in mid-2016, so they've had 3½–4 years of use. Maybe with the way that we are using them they've simply been used up. Or maybe they failed for unrelated reasons.

Despite the trouble we had with these SSDs, I intend to get replacement SSDs and to transition back off of the temporary hard disks. Many current 512GB NVMe SSDs are rated 300TBW on the basis of which they should last twice as long as the old ones. Unlike last time, I intend to overprovision the new SSDs to extend their life even further. And I will investigate using trimforce or other similar options to force macOS to issue TRIM commands to reduce write amplification. And maybe we can find ways to modify our buildbot / mpbb setup to reduce writes.

I'm not expecting NVMe to give us much of a speed boost over AHCI PCIe, especially since we're limited to the Xserve's PCIe 2.0 bus, it's just that AHCI PCIe SSDs are impossible to find now; they're all either AHCI SATA, which is slow, or NVMe PCIe. Xserve EFI firmware is not compatible with NVMe but I've found instructions for modifying the firmware to make it NVMe compatible so that's what I'm intending to do at the moment.

comment:7 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Resolution:	→ wontfix
Status:	new → closed

Let me follow up on this and close it out. While I appreciate the advice and perspective, Fred, I've decided to continue to use SSDs for the build machines. You mentioned your experience was from 2013 or earlier, and despite your claim that it does not, my hope and assumption is that SSD reliability and durability improves over time. Hard disks on the other hand are definitely destined to fail as they get older, and the Xserves that we use as build servers can only accommodate certain specific hard disk models in their internal drive bays—hard disk models which are over eleven years old at this point. Even though I have several compatible hard disks and most of them still work, I imagine that they will fail eventually and that compatible replacement hard disks will get harder to find.

One of the servers has been running off such hard disks (one per VM) since its SSD died last year. This does work, but is definitely slower. In addition, hard disks use more power and run hot, and I'd like to reduce my servers' power consumption and heat generation where I can, so I intend to buy a replacement SSD for this server eventually, after other current server issues are addressed.

I did already purchase ADATA XPG SX8200 Pro NVMe PCIe SSDs for the other three servers. Turns out Xserve firmware was not incompatible with NVMe; it was only incompatible with the NVMe version of the Samsung sm951. I have overprovisioned the new SSDs to hopefully increase their life. Each VM has reserved its full amount of RAM from the host. (When I first set up these servers in 2016 I had overbooked the RAM somewhat, not considering the impact it would have on the SSDs.) It's still possible that macOS within each VM may need to swap from time to time for extremely large builds (i.e. tensorflow). I could purchase more RAM.

These new SSDs seem to be working fine. They cost a fraction of what the sm951s cost in 2016. If they last as long as the sm951s did, by the time they need replacing again (if we're even still using the Xserves then) the monetary replacement cost will be negligible. The biggest cost is really the downtime and the effort to restore each system to a new disk. Maybe I can improve my backup process to make restoration easier in the future.

The last of the old SSDs died in February 2021: the third and final sm951 died but is still readable and 2 of the 3 VMs have been restored onto the new SSD so far, and the old Apple SSD that shipped with the Xserve in 2009 and housed the buildmaster died and is being temporarily served from a hard disk until it can be moved to the new SSD.

Note: See TracTickets for help on using tickets.

Download in other formats: