Opened 5 years ago
Closed 5 years ago
#59497 closed defect (fixed)
openssh @8.1p1: sshd only works in debug mode
Reported by: | davidfavor (David Favor) | Owned by: | Mihai Moldovan <ionic@…> |
---|---|---|---|
Priority: | Normal | Milestone: | |
Component: | ports | Version: | 2.6.2 |
Keywords: | Cc: | Ionic (Mihai Moldovan) | |
Port: | openssh |
Description
Recent upgrade of openssh began producing odd sshd behavior...
This works as expected...
/opt/local/sbin/sshd -p 22 -f /opt/local/etc/ssh/sshd_config -d -E /var/log/sshd.log
This fails...
/opt/local/sbin/sshd -p 22 -f /opt/local/etc/ssh/sshd_config -E /var/log/sshd.log
Log files shows only...
reseed_prngs: RAND_bytes failed [preauth]
The sshd process continues to run, just refuses any connections with the reseed_prngs error message.
Be great if someone can mention how to fix this.
Thanks!
Change History (17)
comment:1 Changed 5 years ago by davidfavor (David Favor)
comment:2 Changed 5 years ago by davidfavor (David Favor)
Maybe this is the problem...
imac> /opt/local/sbin/sshd -v /opt/local/sbin/sshd: illegal option -- v OpenSSH_8.1p1, OpenSSL 1.1.1d 10 Sep 2019 imac> /opt/local/sbin/sshd -T -f /opt/local/etc/ssh/sshd_config sshd: no hostkeys available -- exiting.
comment:3 Changed 5 years ago by jmroot (Joshua Root)
Cc: | Ionic added |
---|---|
Port: | openssh added |
Summary: | sshd only works in debug mode → openssh @8.1p1: sshd only works in debug mode |
comment:4 Changed 5 years ago by davidfavor (David Favor)
Output using DEBUG3 in config file...
debug1: fd 8 clearing O_NONBLOCK debug1: Forked child 12919. debug3: send_rexec_state: entering fd = 11 config len 378 debug3: ssh_msg_send: type 0 debug3: send_rexec_state: done debug1: rexec start in 8 out 8 newsock 8 pipe 10 sock 11 debug1: inetd sockets after dupping: 5, 5 debug3: BSM audit: connection from 192.168.1.226 port 58625 debug3: BSM audit: iptype 4 machine ID e201a8c0 00000000 00000000 00000000 Connection from 192.168.1.226 port 58625 on 192.168.1.226 port 22 debug1: Local version string SSH-2.0-OpenSSH_8.1 debug1: Remote protocol version 2.0, remote software version OpenSSH_8.1 debug1: match: OpenSSH_8.1 pat OpenSSH* compat 0x04000000 debug2: fd 5 setting O_NONBLOCK debug3: ssh_sandbox_init: preparing Darwin sandbox debug2: Network child is on pid 12920 debug3: preauth child monitor started debug3: ssh_sandbox_child: starting Darwin sandbox [preauth] reseed_prngs: RAND_bytes failed [preauth] debug1: do_cleanup [preauth] debug1: monitor_read_log: child log fd closed debug3: mm_request_receive entering debug1: do_cleanup debug1: Killing privsep child 12920 debug1: audit_event: unhandled event 12
comment:5 Changed 5 years ago by davidfavor (David Favor)
Completely nuked openssh + removed /opt/local/etc/ssh + reinstalled.
Same problem occurs.
comment:6 Changed 5 years ago by Ionic (Mihai Moldovan)
Resolution: | → invalid |
---|---|
Status: | new → closed |
This does not seem to be a packaging issue.
The openssh
port does not ship default config files - only example files. If needed, you are supposed to copy and edit them.
It also does not generate host keys after installation - which is why your instance of sshd seems to fail.
You will need to generate host keys, e.g., via /opt/local/bin/ssh-keygen -A
.
Closing as invalid.
comment:7 Changed 5 years ago by davidfavor (David Favor)
Did a key generation + unload + load.
Same problem exists.
Note: I've been using MacPorts for years. Installed on many machines. I've always installed openssh + did an initial port load openssh, then sshd simply worked. Maybe this has changed.
If there are requires openssh setup steps, someone point me to the related URL, as I can't seem to find any sshd setup conversation anywhere.
Thanks!
comment:8 Changed 5 years ago by Ionic (Mihai Moldovan)
Resolution: | invalid |
---|---|
Status: | closed → reopened |
Hm, or maybe something is broken.
I've generated the host keys, synced the sshd_config
file with sshd_config.example
and loaded the service.
That did start up, and sshd
is listening for connections, but I'm likewise seeing errors on reseed_prngs
when connecting to the machine on port 2222.
What's your OS version?
comment:9 Changed 5 years ago by Ionic (Mihai Moldovan)
This issue is weird...
OpenSSL seems to report a seeding error when calling RAND_bytes()
, but checking RAND_status()
right after the RAND_seed()
call returns 1
, indicating that the (default) DRBG has been seeded with enough data.
Specifically, this:
error:2406E06E:random number generator:RAND_DRBG_reseed:error retrieving entropy
I guess I'll have to dig deeper.
comment:10 Changed 5 years ago by Ionic (Mihai Moldovan)
Yep, weird indeed.
I cleared the (OpenSSL) error stack before calling RAND_seed()
and checked it afterwards - the seeding operation seems to really also fail. I don't understand why RAND_status()
would return 1
in this case, but aside from that, something seems to be really messed up.
I can only guess at this point, but my best guess would be that this is related to sandboxing or privilege dropping.
It really doesn't happen in debug mode - the OpenSSL error stack remains empty.
I'll test disabling the sandbox and privilege separation tomorrow (if that's even possible).
comment:11 Changed 5 years ago by Ionic (Mihai Moldovan)
Disabling the sandbox or privilege separation is not possible since 7.5 - the option to toggle this was removed in that version.
Regardless, I tested a build without the hpn, gsskex and Apple Keychain integration patches. Same symptom.
I then went ahead and disabled the two sandbox patches, but left the launchd and pam patches applied. Got a password prompt from a detached sshd
.
Whatever is breaking OpenSSL, it must be something within these patches.
comment:12 Changed 5 years ago by Ionic (Mihai Moldovan)
Okay, here we go:
OpenSSH already has support for the Apple sandbox, although its default setup seems to be too restrictive to Apple itself for some reason. Hence, they patch it to include and read a custom-crafted profile and so do we.
The debug mode disables any child forking, and hence also privilege separation, which explains why that worked.
With privilege separation enabled, the child spawned by sshd
chroots into some specific directory.
In vanilla OpenSSH, sshd
enables the sandbox in the child process after reseeding the OpenSSL RNG and chrooting to that directory.
However, since Apple (and we) use a special profile file, they (and we) enable the sandbox first, then do all the other things. It's mostly just a code move, but an important one, because a chrooted child couldn't ever be able to read the special profile file residing outside of the chroot.
All of this has been done in exactly the same fashion for years (a decade or longer) and it never failed.
I'm still clueless why it started to fail. There aren't any obvious code changes that would cause this (at least not within OpenSSH) and my experiments also didn't shed light on this.
For instance, I essentially turned the sandbox into "transparency" mode by just blindly allowing everything. Didn't change a thing. I enabled debugging within the sandbox so that each violation would be logged. Not a thing.
It doesn't look like the sandbox is prohibiting anything, but yet OpenSSL gets into some confused state it can't recover from once the sandbox is turned on. And not even that is true.
As previously explained, the vanilla work flow is like this:
spawn child -> do a lot of other work -> reseed -> chroot -> enable sandbox
Contrast this to the (recently breaking) Apple-patched work flow:
spawn child -> do a lot of other work -> enable sandbox -> reseed -> chroot
If I modify this slightly like this:
spawn child -> do a lot of other work -> reseed -> enable sandbox -> reseed -> chroot
everything seems to work just fine.
This doesn't make sense to me. I understand that a reseeding operation before enabling the sandbox works just fine... essentially because it also does so in vanilla OpenSSH. What I cannot wrap my head around is that subsequent reseeding operations also work just fine after enabling the sandbox.
So, I have a workaround, but I don't want to blindly commit this to a security-critical package until I really understand what is going on.
So far, I have only briefly skimmed the OpenSSL (not -SSH) source code and didn't get into the nitty-gritty details of reseeding, including fetching random data from the system, but it looks like I have to in order to understand what it's doing and why it thinks that it can't gather system entropy.
To that end, I wondered whether sandboxing could change access to (already opened) file descriptors or would be ignorant to that, but that (changing access) doesn't seem to be the case. Hence, should OpenSSL already have an open file descriptor to, say, /dev/random
, that FD shouldn't be affected by enabling the sandbox retrospectively. This a commonly used technique, c.f., Chromium.
OpenSSH 8.1p1 introduced a set of more complex IPC between master and child processes by means of not only opening up pipes between the processes, but also sending some data over them. This more complex handling is really the only actual change from 7.9p1 to 8.1p1, but at the same time doesn't explain any of the things experienced.
Further down into the rabbit hole of OpenSSL-debugging it is, then, I guess.
comment:13 Changed 5 years ago by Ionic (Mihai Moldovan)
I finally understood what is going on, hooray. Leaving this here for future generations.
The sandbox never really had a role to play in this issue. Rather, it was a combination of OpenSSH castrating itself and the OpenSSL crypto core being rewritten in 1.1.1*
and functioning completely differently compared to older releases (such as 1.1.0*
). The sandbox would have affected it, but it never came to that.
What OpenSSL 1.1.1 uses, compared to older versions, is an "AES-CTR DRBG according to NIST standard SP 800-90Ar1". It also introduced crypto objects chaining, such that each random number generator object can be hooked up to another via parenting. They also introduced two global instances of this DRBG - one used for generating random numbers for use with public keys, the other one for generating random data for use with private keys. This makes the code more complicated, but trust me, that's actually a good thing!
Each DRBG has a specific state it is in (uninitialized
, ready
, error
) and a few pools with random data - for seeding, additional data and getting actual randomness out of it.
When a DRBG is created (internally or externally, though for OpenSSH it's really an internal implementation detail in OpenSSL), the code is creating a seed pool - initially comprised of seeding data the application provides - and then tries to get more entropy from the system to add to this pool. This means that a bad seed does not necessarily compromise the random number generator used by OpenSSL, which sounds good!
When it's reseeded or random data requested by the application, the internal state is checked. If it's not READY
but ERROR
, the DRBG is restarted (uninitialized and initialized again) in order to clear the error state - including, if applicable, its parent DRBG instances.
So... why does this fail in a forked OpenSSH child?
As already explained, during initialization, system entropy is fetched through different means. These means, on OS X/macOS consist of:
- using the
getentropy
system call to fill a buffer with random bytes (but THAT one is only available on 10.12 and higher!) XOR - reading random data from system devices like
/dev/urandom
,/dev/random
,/dev/hwrng
,/dev/srandom
and something else I've forgotten IFF they exist and can be opened successfully. Crucially, they are only opened once and the file descriptor left open for additional, later access if reading from the device actually returned useful data. XOR - generating entropy via the
RDTSC
method that reads a high-resolution timer within the CPU XOR generating entropy via theRDSEED
/RDRAND
CPU instruction(s).
There is no other entropy source defined in OpenSSL 1.1.1. For OS X/macOS this list is shortened further, because:
- the
RDTSC
method is forcefully disabled within OpenSSL (quote: "IMPORTANT NOTE: It is not currently possible to use this code because we are not sure about the amount of randomness it provides. Some SP900 tests have been run, but there is internal skepticism. So for now this code is not used.") - the
RDSEED
/RDRAND
functions are implemented, but not enabled by default and we don't enable them. That's probably fine, because using a default-disabled function set in a security-related application feels weird.
Additionally, both these methods would only be usable on x86_64
(or maybe also x86
) CPUs, which would leave out ppc
ones for good.
To recap, on 10.11 and below, the only entropy source as usable by OpenSSL are the system devices /dev/urandom
and /dev/random
.
These would work fine, but OpenSSH pulls an additional trigger after enabling the sandbox:
/* * The kSBXProfilePureComputation still allows sockets, so * we must disable these using rlimit. */ rl_zero.rlim_cur = rl_zero.rlim_max = 0; if (setrlimit(RLIMIT_FSIZE, &rl_zero) == -1) fatal("%s: setrlimit(RLIMIT_FSIZE, { 0, 0 }): %s", __func__, strerror(errno)); if (setrlimit(RLIMIT_NOFILE, &rl_zero) == -1) fatal("%s: setrlimit(RLIMIT_NOFILE, { 0, 0 }): %s", __func__, strerror(errno)); if (setrlimit(RLIMIT_NPROC, &rl_zero) == -1) fatal("%s: setrlimit(RLIMIT_NPROC, { 0, 0 }): %s", __func__, strerror(errno));
This code has been in there for longer than a decade as well and what it does is:
- disabling creating new files with a file size greater than zero (so essentially writing any data to files... and sockets(?))
- disabling OPENING any files or sockets to begin with
- disabling spawning additional processes
That's generally fine, because the forked child is only used for authentication and gets all its internal state from the parent instance it was forked from. It doesn't need to create additional files or network sockets and this makes the process more robust to outside tinkering by buffer overflows or the like. The sandbox also plays a big role in that hardening, of course.
However, you might have noticed a conflict here: thusly spawned processes may not open any new files, but OpenSSL 1.1.1*
might need to (and, on older systems, must) open system crypto devices to garner entropy. Boom.
This also explains why reseeding the DRBG(s) prior to enabling the sandbox works and continue to work afterwards: the operations succeed, open the crypto devices and leave it open, keeping the file descriptor around. Subsequent reseeding operations can then continue to use it.
But... why did this work for such a long time without generating errors?
Previous OpenSSL versions (1.1.0 and older) are scary. They also initialize a random number generator if it wasn't previously initialized when requesting random data and that operation would generally also pull in system entropy via system crypto devices on OS X/macOS, but... failures to do so are non-fatal. That state is never recorded properly. Additionally, the random seed and random data in general seems to be getting hashed in previous versions in order to fill the pool. Also, failures to fill the pool with system entropy do not necessarily need to lead to failures when fetching random data within the application, since previous OpenSSL versions also mix in some "pseudo-random" data like the PID, user ID and current timestamp to the pool unconditionally. And the random pool data also seems to be getting hashed when requesting it in the application...
Since OpenSSH only ever requests one byte of random data, that might be just enough to satisfy the condition.
As far as I can tell, the error condition was just masked by OpenSSL's previous implementation.
Now that we know what is going wrong, the remaining question is how to fix it.
Calling the reseed function prior to enabling the sandbox is a valid workaround. By doing this, OpenSSL will open a file descriptor to some crypto device (typically /dev/urandom
) and cache it. As soon as the device returns data, it shouldn't get closed, so we can continue to use it in the process. The caveat with that approach is, that, should the device block at some point and NOT return more data, the file descriptor will be closed and OpenSSL will not be able to reopen it again. That normally shouldn't be the case for /dev/urandom
, so I don't see this as a huge drawback.
Alternatively, I could relax the number of open files limitation (to what level, though?) and add a sandbox exception for the crypto devices. That would probably also work, but relax the security limitations a bit too much - i.e., the process could suddenly open and read other files as well. For this reason, I don't like that solution.
I'll probably commit a fix with the first implementation tomorrow.
comment:14 Changed 5 years ago by mouse07410 (Mouse)
Who disabled RDSEED/RDRAND, and why? I understand that some people don't trust it, and the world stock of lithium is limited, so not everybody who needs it may get it. But still, the strength of an RNG is equal to the strength of it's strongest component. Meaning - if you combine output from several generators, your resulting randomness works be as good as the best of them.
Do yourself a favor and re-enable it.
comment:15 Changed 5 years ago by Ionic (Mihai Moldovan)
As far as I've seen, it's disabled by default in the OpenSSL upstream configuration. I didn't find a configure option to even enable it while quickly grepping the source code, but it looks like passing --with-rand-seed=os,rdcpu
(or something similar) would do that. However, the upstream default is just os
. Plus, like I said, it would only work on Intel CPUs AFAIK, but we also have to care for the PowerPC faction. It wouldn't even help universally, but admittedly in most cases.
I'm also pretty sure that mixing entropy of different qualities actually degrades the overall quality, but don't quote me on that. :)
And lastly... OpenSSL doesn't really mix them all together. It picks the first method available and working. The other methods are only tried in case of errors or if no entropy is coming out any longer.
I'm not saying that you don't have a point, but you'd have to discuss that with the OpenSSL port maintainers.
comment:16 Changed 5 years ago by mouse07410 (Mouse)
As a cryptographer I assure you that mixing randomness from different sources only improves it.
comment:17 Changed 5 years ago by Mihai Moldovan <ionic@…>
Owner: | set to Mihai Moldovan <ionic@…> |
---|---|
Resolution: | → fixed |
Status: | reopened → closed |
Looks like detaching sshd into backgound is the problem.
Never seen this before.