Opened 5 years ago
Closed 4 years ago
#60178 closed defect (wontfix)
Don't use SSDs for buildbot workers
Reported by: | ryandesign (Ryan Carsten Schmidt) | Owned by: | admin@… |
---|---|---|---|
Priority: | Normal | Milestone: | |
Component: | server/hosting | Version: | |
Keywords: | Cc: | fhgwright (Fred Wright) | |
Port: |
Description
In comment:ticket:60112:4 Fred Wright mentioned something that I'd like to explore in a separate ticket:
BTW, one should never rely on SSDs for primary storage. Not only is flash storage inherently less reliable than magnetic storage, but flash reliability is one of the few technological parameters that's actually been getting *worse* with "advances", rather than better. And a system whose primary purpose is to run builds is probably going to stress the flash write limits a lot.
Change History (7)
comment:1 Changed 5 years ago by danielluke (Daniel J. Luke)
comment:2 Changed 5 years ago by fhgwright (Fred Wright)
At the basic level, flash devices (as well as similar devices such as EPROMs and EEPROMs) rely on very well-insulated static charges to store bits, whereas HDDs use magnetic domains, which are fundamentally more stable. Both have been squeezing the bits more and more to get higher densities, but the HDDs have done a better job of improving the ECC to match than the SSDs.
Part of my time at Google was in the flash storage area, so we had a lot of data on this issue, both operationally and from vendors. We never used flash devices for primary storage, but only as performance-improving caches for data stored on HDDs.
Assuming wear leveling, excess capacity directly translates into a lower P/E rate, but that's a rather expensive approach to take with devices that are already substantially more expensive per bit than HDDs. Whether "SMART" monitoring can provide adequate early warning of failures remains to be seen, but it obviously didn't in the case of the failure that prompted the original ticket. And of course, it's especially important to ensure that an SSD-based system never swaps.
Granted, the performance of HDD-based Macs leaves a lot to be desired, but I suspect that a lot of this is due to poor scheduling and/or poor prioritization. Apple would probably rather sell SSDs than improve the software. And their "Fusion Drive" approach seems to be a much dumber way to use SSDs to improve HDD performance than the cache approach.
comment:3 Changed 5 years ago by kencu (Ken)
we have a nice port called gsmartmoncontrol that I've been using. It generates output that looks interesting, although I don't know exactly how to tell if everything is ideal or not. Here's the output from my current 980GB SSD that I use all day, every day. Does everything look OK? I thought it did....but I don't know all about what I'm looking at...
smartctl 7.0 2018-12-30 r4883 [Darwin 10.8.0 i386] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs Device Model: Micron_M500_MTFDDAK960MAV Serial Number: 14200C21D111 LU WWN Device Id: 5 00a075 10c21d111 Firmware Version: MU03 User Capacity: 960,197,124,096 bytes [960 GB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Mar 11 16:39:55 2020 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM level is: 254 (maximum performance) Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, NOT FROZEN [SEC1] === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 4470) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 74) minutes. Conveyance self-test routine recommended polling time: ( 3) minutes. SCT capabilities: (0x0035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-K 100 100 000 - 117 5 Reallocate_NAND_Blk_Cnt PO--CK 100 100 000 - 0 9 Power_On_Hours -O--CK 100 100 000 - 14650 12 Power_Cycle_Count -O--CK 100 100 000 - 31263 171 Program_Fail_Count -O--CK 100 100 000 - 0 172 Erase_Fail_Count -O--CK 100 100 000 - 0 173 Ave_Block-Erase_Count -O--CK 095 095 000 - 150 174 Unexpect_Power_Loss_Ct -O--CK 100 100 000 - 240 180 Unused_Reserve_NAND_Blk PO--CK 000 000 000 - 16523 183 SATA_Interfac_Downshift -O--CK 100 100 000 - 0 184 Error_Correction_Count -O--CK 100 100 000 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 194 Temperature_Celsius -O---K 061 050 000 - 39 (Min/Max -10/50) 196 Reallocated_Event_Count -O--CK 100 100 000 - 16 197 Current_Pending_Sector -O--CK 100 100 000 - 0 198 Offline_Uncorrectable ----CK 100 100 000 - 0 199 UDMA_CRC_Error_Count -O--CK 100 100 000 - 0 202 Percent_Lifetime_Remain P---CK 095 095 000 - 5 206 Write_Error_Rate -OSR-- 100 100 000 - 0 210 Success_RAIN_Recov_Cnt -O--CK 100 100 000 - 0 246 Total_Host_Sector_Write -O--CK 100 100 --- - 134002525540 247 Host_Program_Page_Count -O--CK 100 100 --- - 4200826760 248 FTL_Program_Page_Count -O--CK 100 100 --- - 3462708285 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning ATA_READ_LOG_EXT (addr=0x00:0x00, page=0, n=1) failed: 48-bit ATA commands not implemented Read GP Log Directory failed SMART Log Directory Version 1 [multi-sector log support] Address Access R/W Size Description 0x00 SL R/O 1 Log Directory 0x01 SL R/O 1 Summary SMART error log 0x02 SL R/O 51 Comprehensive SMART error log 0x04 SL R/O 255 Device Statistics log 0x06 SL R/O 1 SMART self-test log 0x09 SL R/W 1 Selective self-test log 0x30 SL R/O 9 IDENTIFY DEVICE data log 0x80-0x9f SL R/W 16 Host vendor specific log 0xa0 SL VS 208 Device vendor specific log 0xa1-0xbf SL VS 1 Device vendor specific log 0xc0 SL VS 80 Device vendor specific log 0xc1-0xdf SL VS 1 Device vendor specific log 0xe0 SL R/W 1 SCT Command/Status 0xe1 SL R/W 1 SCT Data Transfer SMART Extended Comprehensive Error Log (GP Log 0x03) not supported SMART Error Log Version: 1 No Errors Logged SMART Extended Self-test Log (GP Log 0x07) not supported SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Vendor (0xff) Completed without error 00% 14135 - # 2 Vendor (0xff) Completed without error 00% 7305 - # 3 Vendor (0xff) Completed without error 00% 2921 - # 4 Vendor (0xff) Completed without error 00% 2286 - # 5 Vendor (0xff) Completed without error 00% 36 - # 6 Vendor (0xff) Completed without error 00% 33 - # 7 Vendor (0xff) Completed without error 00% 26 - # 8 Vendor (0xff) Completed without error 00% 17 - # 9 Vendor (0xff) Completed without error 00% 16 - #10 Vendor (0xff) Completed without error 00% 16 - #11 Vendor (0xff) Completed without error 00% 16 - #12 Vendor (0xff) Completed without error 00% 15 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. SCT Status Version: 3 SCT Version (vendor specific): 1 (0x0001) Device State: Active (0) Current Temperature: 39 Celsius Power Cycle Min/Max Temperature: 28/39 Celsius Lifetime Min/Max Temperature: -10/50 Celsius Under/Over Temperature Limit Count: 0/0 SCT Temperature History Version: 2 Temperature Sampling Period: 1 minute Temperature Logging Interval: 1 minute Min/Max recommended Temperature: 0/70 Celsius Min/Max Temperature Limit: -5/75 Celsius Temperature History Size (Index): 478 (473) Index Estimated Time Temperature Celsius 474 2020-03-11 08:42 38 ******************* 475 2020-03-11 08:43 38 ******************* 476 2020-03-11 08:44 38 ******************* 477 2020-03-11 08:45 39 ******************** 0 2020-03-11 08:46 35 **************** ... ..( 21 skipped). .. **************** 22 2020-03-11 09:08 35 **************** 23 2020-03-11 09:09 36 ***************** ... ..( 3 skipped). .. ***************** 27 2020-03-11 09:13 36 ***************** 28 2020-03-11 09:14 37 ****************** ... ..( 7 skipped). .. ****************** 36 2020-03-11 09:22 37 ****************** 37 2020-03-11 09:23 38 ******************* ... ..( 6 skipped). .. ******************* 44 2020-03-11 09:30 38 ******************* 45 2020-03-11 09:31 37 ****************** ... ..( 6 skipped). .. ****************** 52 2020-03-11 09:38 37 ****************** 53 2020-03-11 09:39 38 ******************* ... ..( 11 skipped). .. ******************* 65 2020-03-11 09:51 38 ******************* 66 2020-03-11 09:52 ? - 67 2020-03-11 09:53 ? - 68 2020-03-11 09:54 34 *************** 69 2020-03-11 09:55 34 *************** 70 2020-03-11 09:56 35 **************** ... ..( 2 skipped). .. **************** 73 2020-03-11 09:59 35 **************** 74 2020-03-11 10:00 36 ***************** 75 2020-03-11 10:01 ? - 76 2020-03-11 10:02 ? - 77 2020-03-11 10:03 28 ********* 78 2020-03-11 10:04 28 ********* 79 2020-03-11 10:05 29 ********** 80 2020-03-11 10:06 29 ********** 81 2020-03-11 10:07 30 *********** 82 2020-03-11 10:08 30 *********** 83 2020-03-11 10:09 30 *********** 84 2020-03-11 10:10 31 ************ ... ..( 15 skipped). .. ************ 100 2020-03-11 10:26 31 ************ 101 2020-03-11 10:27 32 ************* ... ..( 6 skipped). .. ************* 108 2020-03-11 10:34 32 ************* 109 2020-03-11 10:35 33 ************** ... ..( 14 skipped). .. ************** 124 2020-03-11 10:50 33 ************** 125 2020-03-11 10:51 34 *************** ... ..( 45 skipped). .. *************** 171 2020-03-11 11:37 34 *************** 172 2020-03-11 11:38 35 **************** ... ..( 24 skipped). .. **************** 197 2020-03-11 12:03 35 **************** 198 2020-03-11 12:04 34 *************** ... ..( 17 skipped). .. *************** 216 2020-03-11 12:22 34 *************** 217 2020-03-11 12:23 35 **************** ... ..( 24 skipped). .. **************** 242 2020-03-11 12:48 35 **************** 243 2020-03-11 12:49 36 ***************** ... ..( 8 skipped). .. ***************** 252 2020-03-11 12:58 36 ***************** 253 2020-03-11 12:59 35 **************** 254 2020-03-11 13:00 36 ***************** ... ..( 6 skipped). .. ***************** 261 2020-03-11 13:07 36 ***************** 262 2020-03-11 13:08 35 **************** ... ..( 9 skipped). .. **************** 272 2020-03-11 13:18 35 **************** 273 2020-03-11 13:19 36 ***************** ... ..( 14 skipped). .. ***************** 288 2020-03-11 13:34 36 ***************** 289 2020-03-11 13:35 37 ****************** ... ..( 8 skipped). .. ****************** 298 2020-03-11 13:44 37 ****************** 299 2020-03-11 13:45 38 ******************* ... ..( 2 skipped). .. ******************* 302 2020-03-11 13:48 38 ******************* 303 2020-03-11 13:49 37 ****************** ... ..( 28 skipped). .. ****************** 332 2020-03-11 14:18 37 ****************** 333 2020-03-11 14:19 36 ***************** ... ..( 3 skipped). .. ***************** 337 2020-03-11 14:23 36 ***************** 338 2020-03-11 14:24 35 **************** 339 2020-03-11 14:25 35 **************** 340 2020-03-11 14:26 36 ***************** ... ..( 4 skipped). .. ***************** 345 2020-03-11 14:31 36 ***************** 346 2020-03-11 14:32 37 ****************** ... ..( 56 skipped). .. ****************** 403 2020-03-11 15:29 37 ****************** 404 2020-03-11 15:30 38 ******************* ... ..( 4 skipped). .. ******************* 409 2020-03-11 15:35 38 ******************* 410 2020-03-11 15:36 39 ******************** ... ..( 8 skipped). .. ******************** 419 2020-03-11 15:45 39 ******************** 420 2020-03-11 15:46 38 ******************* ... ..( 10 skipped). .. ******************* 431 2020-03-11 15:57 38 ******************* 432 2020-03-11 15:58 39 ******************** 433 2020-03-11 15:59 39 ******************** 434 2020-03-11 16:00 39 ******************** 435 2020-03-11 16:01 38 ******************* ... ..( 20 skipped). .. ******************* 456 2020-03-11 16:22 38 ******************* 457 2020-03-11 16:23 39 ******************** ... ..( 2 skipped). .. ******************** 460 2020-03-11 16:26 39 ******************** 461 2020-03-11 16:27 38 ******************* ... ..( 11 skipped). .. ******************* 473 2020-03-11 16:39 38 ******************* SCT Error Recovery Control command not supported Device Statistics (SMART Log 0x04) Page Offset Size Value Flags Description 0x01 ===== = = === == General Statistics (rev 2) == 0x01 0x008 4 31263 --- Lifetime Power-On Resets 0x01 0x010 4 14650 --- Power-on Hours 0x01 0x018 6 134002525540 --- Logical Sectors Written 0x01 0x020 6 1366798069 --- Number of Write Commands 0x01 0x028 6 60333822136 --- Logical Sectors Read 0x01 0x030 6 1593816513 --- Number of Read Commands 0x04 ===== = = === == General Errors Statistics (rev 1) == 0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors 0x04 0x010 4 31468 --- Resets Between Cmd Acceptance and Completion 0x05 ===== = = === == Temperature Statistics (rev 1) == 0x05 0x008 1 39 --- Current Temperature 0x05 0x010 1 35 --- Average Short Term Temperature 0x05 0x018 1 34 --- Average Long Term Temperature 0x05 0x020 1 50 --- Highest Temperature 0x05 0x028 1 -10 --- Lowest Temperature 0x05 0x030 1 47 --- Highest Average Short Term Temperature 0x05 0x038 1 26 --- Lowest Average Short Term Temperature 0x05 0x040 1 38 --- Highest Average Long Term Temperature 0x05 0x048 1 32 --- Lowest Average Long Term Temperature 0x05 0x050 4 - --- Time in Over-Temperature 0x05 0x058 1 70 --- Specified Maximum Operating Temperature 0x05 0x060 4 1 --- Time in Under-Temperature 0x05 0x068 1 0 --- Specified Minimum Operating Temperature 0x06 ===== = = === == Transport Statistics (rev 1) == 0x06 0x008 4 0 --- Number of Hardware Resets 0x06 0x010 4 0 --- Number of ASR Events 0x06 0x018 4 0 --- Number of Interface CRC Errors 0x07 ===== = = === == Solid State Device Statistics (rev 1) == 0x07 0x008 1 2 N-- Percentage Used Endurance Indicator |||_ C monitored condition met ||__ D supports DSN |___ N normalized value ATA_READ_LOG_EXT (addr=0x11:0x00, page=0, n=1) failed: 48-bit ATA commands not implemented Read SATA Phy Event Counters failed
comment:4 follow-up: 5 Changed 5 years ago by danielluke (Daniel J. Luke)
Fred - do you know of any published data (this 2016 Usenix paper is nice, but I'd be interested in newer data - https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf ).
Ken - That drive is telling you it has 95% of its life left (whether or not that translates into something useful or not depends on a lot of other factors).
comment:5 Changed 5 years ago by fhgwright (Fred Wright)
Replying to danielluke:
Fred - do you know of any published data (this 2016 Usenix paper is nice, but I'd be interested in newer data - https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf ).
No, my information is only internal, and somewhat older since I left Google in 2013.
I suspect that the replacement rate for HDDs would have gone down if anyone had implemented my suggestion to add surface scans to the automated repair workflow. A simple reinstall without a surface scan may miss many bad blocks.
It's also worth noting that, in the Google server context, almost all data is part of a higher-level redundant storage system, where bad blocks (or even entire failed machines) don't result in any real data loss. Home and small-business systems don't usually have that luxury.
comment:6 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)
I have been meaning to reply to this ticket since creating it.
Replying to fhgwright:
one should never rely on SSDs for primary storage
That assertion needs at least some qualification. Apple has been shipping Macs with SSDs as primary storage for years; clearly they think it can be relied upon. And VMware specifically supports using SSDs as primary VM storage, so they clearly think it is acceptable to use that in some cases. We can certainly discuss whether using it in our case was the right choice.
Our previous buildbot setup was at macOS forge where two Xserves (for buildbot workers) and six HP servers (for everything else) connected via 10Gbit Ethernet to two expensive NetApp filers each with dozens of hard disks in a RAID. In mid 2016 when we needed to set up our own hardware to replace what we were losing due to the closure of macOS forge, I lent MacPorts my four Xserves to use as buildbot machines. I initially tried to find a RAID shared storage solution, some kind of low-end version of what the NetApp filers had offered us, but I could find nothing that was remotely affordable. And MacPorts does not have a revenue stream, is not a legal organization, and is not set up to accept donations. We do have some funds from our participation in Google Summer or Code, but at the time we did not consider using those funds to purchase a RAID; I don't think the funds would have been sufficient in any case. Instead I decided to forgo shared storage and use local storage on each Xserve. I believed that trying to host three heavy-duty VMs off a single hard disk would be a tremendous reduction of performance compared to what we were used to from the NetApp filers, so I purchased SSDs for the three Xserves we would use as workers. (The fourth Xserve already had an Apple SSD and an Apple hard disk RAID and is used as the buildmaster and fileserver.) I have been very pleased with the performance of the worker VMs running on these SSDs.
It is likely that our buildbot workers incur more writes to these SSDs than the average workload would. It is possible that the VMs may not see their virtual disks as SSDs, or that even if they do macOS may not be issuing TRIM commands to them (since it has been Apple's custom to only issue TRIM commands to original Apple SSDs), or that any TRIM commands the VMs are issuing to those virtual disks are not being passed through to the physical SSDs, which would have a write amplification effect and further increase the wear on the SSDs and reduce their life expectancy.
After losing one of our VM hosts' SSDs in February 2020, I recreated the VMs on hard disks. I used a separate hard disk for each VM and I guess the performance has not been terrible but I haven't actually looked at the build times to compare them with the VMs that are still on SSD. I have definitely noticed slower startup times.
In the mean time, in May 2020 we lost a second SSD in a manner very similar to the first. I haven't rebuilt those VMs yet.
I was using Samsung sm951 AHCI 512GB SSDs in m.2 PCIe adapters. Their endurance is rated at 150TBW. They were put into service in mid-2016, so they've had 3½–4 years of use. Maybe with the way that we are using them they've simply been used up. Or maybe they failed for unrelated reasons.
Despite the trouble we had with these SSDs, I intend to get replacement SSDs and to transition back off of the temporary hard disks. Many current 512GB NVMe SSDs are rated 300TBW on the basis of which they should last twice as long as the old ones. Unlike last time, I intend to overprovision the new SSDs to extend their life even further. And I will investigate using trimforce or other similar options to force macOS to issue TRIM commands to reduce write amplification. And maybe we can find ways to modify our buildbot / mpbb setup to reduce writes.
I'm not expecting NVMe to give us much of a speed boost over AHCI PCIe, especially since we're limited to the Xserve's PCIe 2.0 bus, it's just that AHCI PCIe SSDs are impossible to find now; they're all either AHCI SATA, which is slow, or NVMe PCIe. Xserve EFI firmware is not compatible with NVMe but I've found instructions for modifying the firmware to make it NVMe compatible so that's what I'm intending to do at the moment.
comment:7 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Let me follow up on this and close it out. While I appreciate the advice and perspective, Fred, I've decided to continue to use SSDs for the build machines. You mentioned your experience was from 2013 or earlier, and despite your claim that it does not, my hope and assumption is that SSD reliability and durability improves over time. Hard disks on the other hand are definitely destined to fail as they get older, and the Xserves that we use as build servers can only accommodate certain specific hard disk models in their internal drive bays—hard disk models which are over eleven years old at this point. Even though I have several compatible hard disks and most of them still work, I imagine that they will fail eventually and that compatible replacement hard disks will get harder to find.
One of the servers has been running off such hard disks (one per VM) since its SSD died last year. This does work, but is definitely slower. In addition, hard disks use more power and run hot, and I'd like to reduce my servers' power consumption and heat generation where I can, so I intend to buy a replacement SSD for this server eventually, after other current server issues are addressed.
I did already purchase ADATA XPG SX8200 Pro NVMe PCIe SSDs for the other three servers. Turns out Xserve firmware was not incompatible with NVMe; it was only incompatible with the NVMe version of the Samsung sm951. I have overprovisioned the new SSDs to hopefully increase their life. Each VM has reserved its full amount of RAM from the host. (When I first set up these servers in 2016 I had overbooked the RAM somewhat, not considering the impact it would have on the SSDs.) It's still possible that macOS within each VM may need to swap from time to time for extremely large builds (i.e. tensorflow). I could purchase more RAM.
These new SSDs seem to be working fine. They cost a fraction of what the sm951s cost in 2016. If they last as long as the sm951s did, by the time they need replacing again (if we're even still using the Xserves then) the monetary replacement cost will be negligible. The biggest cost is really the downtime and the effort to restore each system to a new disk. Maybe I can improve my backup process to make restoration easier in the future.
The last of the old SSDs died in February 2021: the third and final sm951 died but is still readable and 2 of the 3 VMs have been restored onto the new SSD so far, and the old Apple SSD that shipped with the Xserve in 2009 and housed the buildmaster died and is being temporarily served from a hard disk until it can be moved to the new SSD.
Is that critique based on experience or data? I've had significantly better experience with the reliability of SSDs vs traditional disks both at home and work. (SSDs are generally over provisioned enough to make up for the lower number of P/E cycles as they've moved from SLC -> MLC -> TLC -> QLC). It should be possible to monitor the 'wear' on the SSD in the build server and determine if it's going to 'use up' whatever drive is chosen too quickly or not.