* Sandboxing strategy @ 2025-09-09 7:57 Demi Marie Obenour 2025-09-10 15:11 ` Alyssa Ross 0 siblings, 1 reply; 8+ messages in thread From: Demi Marie Obenour @ 2025-09-09 7:57 UTC (permalink / raw) To: Spectrum OS Development [-- Attachment #1.1.1: Type: text/plain, Size: 1562 bytes --] I was thinking about how to sandbox the various per-VM daemons and came up with the following strategy: - Each VM gets its own PID and mount namespace and set of user IDs. - Mount namespace includes /proc, /sys, /dev, and the host rootfs. - Each service gets its own /tmp and /dev/shm if they are needed at all. - virtiofsd gets r/w access to the VM private storage. - IPC namespaces are irrelevant because the kernel is built without System V IPC or POSIX message queues. - Sending signals between services in the namespace is blocked by Landlock. Landlock also blocks ptrace() and other nastiness, as well as communication via abstract AF_UNIX sockets. - Since AF_UNIX abstract sockets between services are blocked by Landlock and Spectrum builds without IP or even Ethernet on the host there is no need for network namespacing. - The sandbox manager is PID 1 in the VM's PID namespace. When s6 tells it to shut down, it tries to gracefully shut down the VM. After a timeout or once the VM has shut down, it exits, and Linux automatically kills all the processes and cleans up the mount namespace. - The sandbox manager uses prctl(PR_SET_PDEATHSIG) to ensure it dies if the parent s6 process dies. This requires s6 to provide its own PID to avoid races, but that is easy to implement. All of this behavior will be hard-coded into C and Rust source code, so it will be vastly simpler than a generic program that must support many use-cases. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Sandboxing strategy 2025-09-09 7:57 Sandboxing strategy Demi Marie Obenour @ 2025-09-10 15:11 ` Alyssa Ross 2025-09-10 15:14 ` Alyssa Ross 2025-09-10 20:35 ` Demi Marie Obenour 0 siblings, 2 replies; 8+ messages in thread From: Alyssa Ross @ 2025-09-10 15:11 UTC (permalink / raw) To: Demi Marie Obenour; +Cc: Spectrum OS Development [-- Attachment #1: Type: text/plain, Size: 2718 bytes --] Demi Marie Obenour <demiobenour@gmail.com> writes: > I was thinking about how to sandbox the various per-VM daemons > and came up with the following strategy: > > - Each VM gets its own PID and mount namespace and set of user IDs. Didn't you say to me we couldn't do PID namespaces without support from s6? > - Mount namespace includes /proc, /sys, /dev, and the host rootfs. > > - Each service gets its own /tmp and /dev/shm if they are needed at all. Just a question: if we put services into cgroups, does use of tmpfs get charged to the appropriate cgroup? > - virtiofsd gets r/w access to the VM private storage. > > - IPC namespaces are irrelevant because the kernel is > built without System V IPC or POSIX message queues. > > - Sending signals between services in the namespace is blocked > by Landlock. Landlock also blocks ptrace() and other nastiness, > as well as communication via abstract AF_UNIX sockets. > > - Since AF_UNIX abstract sockets between services are blocked by > Landlock and Spectrum builds without IP or even Ethernet on the > host there is no need for network namespacing. It doesn't currently, just to be clear. (I'm still putting off using a custom kernel config on the host until we have better tooling for keeping up with Nixpkgs.) > - The sandbox manager is PID 1 in the VM's PID namespace. > When s6 tells it to shut down, it tries to gracefully shut > down the VM. After a timeout or once the VM has shut down, > it exits, and Linux automatically kills all the processes > and cleans up the mount namespace. > > - The sandbox manager uses prctl(PR_SET_PDEATHSIG) to ensure it > dies if the parent s6 process dies. This requires s6 to provide > its own PID to avoid races, but that is easy to implement. > > All of this behavior will be hard-coded into C and Rust source code, > so it will be vastly simpler than a generic program that must support > many use-cases. This all sounds fine, BUT there are a couple of important things to bear in mind: • This needs to be maintainable. I don't know how much code this is going to be our how complex it's going to be, but that this will be totally custom does make me a bit concerned. • These services are part of our TCB anyway. Sandboxing only gets us defense in depth. With that in mind, it's basically never going to be worth adding sandboxing if it adds any amount of attack surface. One example of that would be user namespaces. They've been a consistent source of kernel security issues, and it might be better to turn them off entirely than to use them for sandboxing stuff that's trusted anyway. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 227 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Sandboxing strategy 2025-09-10 15:11 ` Alyssa Ross @ 2025-09-10 15:14 ` Alyssa Ross 2025-09-10 20:35 ` Demi Marie Obenour 1 sibling, 0 replies; 8+ messages in thread From: Alyssa Ross @ 2025-09-10 15:14 UTC (permalink / raw) To: Demi Marie Obenour; +Cc: Spectrum OS Development [-- Attachment #1: Type: text/plain, Size: 747 bytes --] Alyssa Ross <hi@alyssa.is> writes: > This all sounds fine, BUT there are a couple of important things to bear > in mind: > > • This needs to be maintainable. I don't know how much code this is > going to be our how complex it's going to be, but that this will be > totally custom does make me a bit concerned. When you submit this, it might be helpful if you can structure it as adding one sandboxing feature at a time (and ideally ordered by your expectation of least to most controversial), so we can start getting it in gradually. A small program that adds landlock rules sounds fine. Once we start getting into namespaces I get a little scared. (Not saying no, just that I'd expect we'll have to discuss it more.) [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 227 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Sandboxing strategy 2025-09-10 15:11 ` Alyssa Ross 2025-09-10 15:14 ` Alyssa Ross @ 2025-09-10 20:35 ` Demi Marie Obenour 2025-09-17 11:27 ` Alyssa Ross 1 sibling, 1 reply; 8+ messages in thread From: Demi Marie Obenour @ 2025-09-10 20:35 UTC (permalink / raw) To: Alyssa Ross; +Cc: Spectrum OS Development [-- Attachment #1.1.1: Type: text/plain, Size: 4420 bytes --] On 9/10/25 11:11, Alyssa Ross wrote: > Demi Marie Obenour <demiobenour@gmail.com> writes: > >> I was thinking about how to sandbox the various per-VM daemons >> and came up with the following strategy: >> >> - Each VM gets its own PID and mount namespace and set of user IDs. > > Didn't you say to me we couldn't do PID namespaces without support from > s6? I was mistaken about this. Without direct support in s6, there is no way to avoid having a persistent process outside the PID namespace as s6's direct child, but that is harmless. >> - Mount namespace includes /proc, /sys, /dev, and the host rootfs. >> >> - Each service gets its own /tmp and /dev/shm if they are needed at all. > > Just a question: if we put services into cgroups, does use of tmpfs get > charged to the appropriate cgroup? It definitely should, especially if the tmpfs is mounted from inside the cgroup. Whether it actually does I don't know. >> - virtiofsd gets r/w access to the VM private storage. >> >> - IPC namespaces are irrelevant because the kernel is >> built without System V IPC or POSIX message queues. >> >> - Sending signals between services in the namespace is blocked >> by Landlock. Landlock also blocks ptrace() and other nastiness, >> as well as communication via abstract AF_UNIX sockets. >> >> - Since AF_UNIX abstract sockets between services are blocked by >> Landlock and Spectrum builds without IP or even Ethernet on the >> host there is no need for network namespacing. > > It doesn't currently, just to be clear. (I'm still putting off using a > custom kernel config on the host until we have better tooling for > keeping up with Nixpkgs.) Makes sense. >> - The sandbox manager is PID 1 in the VM's PID namespace. >> When s6 tells it to shut down, it tries to gracefully shut >> down the VM. After a timeout or once the VM has shut down, >> it exits, and Linux automatically kills all the processes >> and cleans up the mount namespace. >> >> - The sandbox manager uses prctl(PR_SET_PDEATHSIG) to ensure it >> dies if the parent s6 process dies. This requires s6 to provide >> its own PID to avoid races, but that is easy to implement. >> >> All of this behavior will be hard-coded into C and Rust source code, >> so it will be vastly simpler than a generic program that must support >> many use-cases. > > This all sounds fine, BUT there are a couple of important things to bear > in mind: > > • This needs to be maintainable. I don't know how much code this is > going to be our how complex it's going to be, but that this will be > totally custom does make me a bit concerned. This should not be too difficult. It's the same system calls used by container managers, so if there is a problem it should be possible to get help fairly easily. bubblewrap > • These services are part of our TCB anyway. Sandboxing only gets us > defense in depth. With that in mind, it's basically never going to > be worth adding sandboxing if it adds any amount of attack surface. > One example of that would be user namespaces. They've been a > consistent source of kernel security issues, and it might be better > to turn them off entirely than to use them for sandboxing stuff > that's trusted anyway. Sandboxing virtiofsd is going to be really annoying and will definitely come at a performance cost. The most efficient way to use virtiofsd is to give it CAP_DAC_READ_SEARCH in the initial user namespace and delegate _all_ access control to it. This allows virtiofs to use open_by_handle_at() for all filesystem access. Unfortunately, this also allows virtiofsd to open any file on the filesystem, ignoring all discretionary access control checks. I don't think Landlock would work either. SELinux or SMACK might work, but using them is significantly more complicated. If one wants to sandbox virtiofsd, one either needs to use --cache=never or run into an effective resource leak (https://gitlab.com/virtio-fs/virtiofsd/-/issues/194). My hope is that in the future the problem will be solved by DAX and an in-kernel shrinker that is aware of the host resources it is using. Denial of service would be prevented by cgroups on the host, addressing the objection mentioned in the issue comments. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Sandboxing strategy 2025-09-10 20:35 ` Demi Marie Obenour @ 2025-09-17 11:27 ` Alyssa Ross 2025-09-18 2:34 ` Demi Marie Obenour 0 siblings, 1 reply; 8+ messages in thread From: Alyssa Ross @ 2025-09-17 11:27 UTC (permalink / raw) To: Demi Marie Obenour; +Cc: Spectrum OS Development [-- Attachment #1: Type: text/plain, Size: 2182 bytes --] Demi Marie Obenour <demiobenour@gmail.com> writes: > On 9/10/25 11:11, Alyssa Ross wrote: >> This all sounds fine, BUT there are a couple of important things to bear >> in mind: >> >> • This needs to be maintainable. I don't know how much code this is >> going to be our how complex it's going to be, but that this will be >> totally custom does make me a bit concerned. > > This should not be too difficult. It's the same system calls used by > container managers, so if there is a problem it should be possible to > get help fairly easily. bubblewrap bubblewrap? :) >> • These services are part of our TCB anyway. Sandboxing only gets us >> defense in depth. With that in mind, it's basically never going to >> be worth adding sandboxing if it adds any amount of attack surface. >> One example of that would be user namespaces. They've been a >> consistent source of kernel security issues, and it might be better >> to turn them off entirely than to use them for sandboxing stuff >> that's trusted anyway. > > Sandboxing virtiofsd is going to be really annoying and will definitely > come at a performance cost. The most efficient way to use virtiofsd > is to give it CAP_DAC_READ_SEARCH in the initial user namespace and > delegate _all_ access control to it. This allows virtiofs to use > open_by_handle_at() for all filesystem access. Unfortunately, > this also allows virtiofsd to open any file on the filesystem, ignoring > all discretionary access control checks. I don't think Landlock would > work either. SELinux or SMACK might work, but using them is > significantly more complicated. > > If one wants to sandbox virtiofsd, one either needs to > use --cache=never or run into an effective resource leak > (https://gitlab.com/virtio-fs/virtiofsd/-/issues/194). > My hope is that in the future the problem will be solved > by DAX and an in-kernel shrinker that is aware of the host > resources it is using. Denial of service would be prevented > by cgroups on the host, addressing the objection mentioned > in the issue comments. Do we not trust virtiofsd's built-in sandboxing? [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 227 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Sandboxing strategy 2025-09-17 11:27 ` Alyssa Ross @ 2025-09-18 2:34 ` Demi Marie Obenour 2025-09-19 13:17 ` Alyssa Ross 0 siblings, 1 reply; 8+ messages in thread From: Demi Marie Obenour @ 2025-09-18 2:34 UTC (permalink / raw) To: Alyssa Ross; +Cc: Spectrum OS Development [-- Attachment #1.1.1: Type: text/plain, Size: 2849 bytes --] On 9/17/25 07:27, Alyssa Ross wrote: > Demi Marie Obenour <demiobenour@gmail.com> writes: > >> On 9/10/25 11:11, Alyssa Ross wrote: >>> This all sounds fine, BUT there are a couple of important things to bear >>> in mind: >>> >>> • This needs to be maintainable. I don't know how much code this is >>> going to be our how complex it's going to be, but that this will be >>> totally custom does make me a bit concerned. >> >> This should not be too difficult. It's the same system calls used by >> container managers, so if there is a problem it should be possible to >> get help fairly easily. bubblewrap > > bubblewrap? :) Bubblewrap is a bit more complex than I would like, and doesn't support useful features like non-recursive bind mounts. I don't know if minijail supports them, but it might well. >>> • These services are part of our TCB anyway. Sandboxing only gets us >>> defense in depth. With that in mind, it's basically never going to >>> be worth adding sandboxing if it adds any amount of attack surface. >>> One example of that would be user namespaces. They've been a >>> consistent source of kernel security issues, and it might be better >>> to turn them off entirely than to use them for sandboxing stuff >>> that's trusted anyway. >> >> Sandboxing virtiofsd is going to be really annoying and will definitely >> come at a performance cost. The most efficient way to use virtiofsd >> is to give it CAP_DAC_READ_SEARCH in the initial user namespace and >> delegate _all_ access control to it. This allows virtiofs to use >> open_by_handle_at() for all filesystem access. Unfortunately, >> this also allows virtiofsd to open any file on the filesystem, ignoring >> all discretionary access control checks. I don't think Landlock would >> work either. SELinux or SMACK might work, but using them is >> significantly more complicated. >> >> If one wants to sandbox virtiofsd, one either needs to >> use --cache=never or run into an effective resource leak >> (https://gitlab.com/virtio-fs/virtiofsd/-/issues/194). >> My hope is that in the future the problem will be solved >> by DAX and an in-kernel shrinker that is aware of the host >> resources it is using. Denial of service would be prevented >> by cgroups on the host, addressing the objection mentioned >> in the issue comments. > > Do we not trust virtiofsd's built-in sandboxing? I do trust it, provided that it is verifiable (by dumping the state of the process at runtime). However, allowing unrestricted open_by_handle_at() allows opening any file on the system, conditioned only on the filesystem supporting open_by_handle_at(). Therefore, sandboxing and using handles for all filesystem access are incompatible. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Sandboxing strategy 2025-09-18 2:34 ` Demi Marie Obenour @ 2025-09-19 13:17 ` Alyssa Ross 2025-09-19 19:37 ` Demi Marie Obenour 0 siblings, 1 reply; 8+ messages in thread From: Alyssa Ross @ 2025-09-19 13:17 UTC (permalink / raw) To: Demi Marie Obenour; +Cc: Spectrum OS Development [-- Attachment #1: Type: text/plain, Size: 2511 bytes --] Demi Marie Obenour <demiobenour@gmail.com> writes: > On 9/17/25 07:27, Alyssa Ross wrote: >> Demi Marie Obenour <demiobenour@gmail.com> writes: >> >>> On 9/10/25 11:11, Alyssa Ross wrote: >>>> • These services are part of our TCB anyway. Sandboxing only gets us >>>> defense in depth. With that in mind, it's basically never going to >>>> be worth adding sandboxing if it adds any amount of attack surface. >>>> One example of that would be user namespaces. They've been a >>>> consistent source of kernel security issues, and it might be better >>>> to turn them off entirely than to use them for sandboxing stuff >>>> that's trusted anyway. >>> >>> Sandboxing virtiofsd is going to be really annoying and will definitely >>> come at a performance cost. The most efficient way to use virtiofsd >>> is to give it CAP_DAC_READ_SEARCH in the initial user namespace and >>> delegate _all_ access control to it. This allows virtiofs to use >>> open_by_handle_at() for all filesystem access. Unfortunately, >>> this also allows virtiofsd to open any file on the filesystem, ignoring >>> all discretionary access control checks. I don't think Landlock would >>> work either. SELinux or SMACK might work, but using them is >>> significantly more complicated. >>> >>> If one wants to sandbox virtiofsd, one either needs to >>> use --cache=never or run into an effective resource leak >>> (https://gitlab.com/virtio-fs/virtiofsd/-/issues/194). >>> My hope is that in the future the problem will be solved >>> by DAX and an in-kernel shrinker that is aware of the host >>> resources it is using. Denial of service would be prevented >>> by cgroups on the host, addressing the objection mentioned >>> in the issue comments. >> >> Do we not trust virtiofsd's built-in sandboxing? > > I do trust it, provided that it is verifiable (by dumping the state > of the process at runtime). However, allowing unrestricted > open_by_handle_at() allows opening any file on the system, conditioned > only on the filesystem supporting open_by_handle_at(). Therefore, > sandboxing and using handles for all filesystem access are incompatible. Wouldn't it be limited to only files on the same filesystem, since you have to pass a mount FD to open_by_handle_at()? That's still bad though. So then to start with we just want to make sure it doesn't have CAP_DAC_READ_SEARCH, and then we hope that something comes along to address the limitations of that? [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 227 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Sandboxing strategy 2025-09-19 13:17 ` Alyssa Ross @ 2025-09-19 19:37 ` Demi Marie Obenour 0 siblings, 0 replies; 8+ messages in thread From: Demi Marie Obenour @ 2025-09-19 19:37 UTC (permalink / raw) To: Alyssa Ross; +Cc: Spectrum OS Development [-- Attachment #1.1.1: Type: text/plain, Size: 3429 bytes --] On 9/19/25 09:17, Alyssa Ross wrote: > Demi Marie Obenour <demiobenour@gmail.com> writes: > >> On 9/17/25 07:27, Alyssa Ross wrote: >>> Demi Marie Obenour <demiobenour@gmail.com> writes: >>> >>>> On 9/10/25 11:11, Alyssa Ross wrote: >>>>> • These services are part of our TCB anyway. Sandboxing only gets us >>>>> defense in depth. With that in mind, it's basically never going to >>>>> be worth adding sandboxing if it adds any amount of attack surface. >>>>> One example of that would be user namespaces. They've been a >>>>> consistent source of kernel security issues, and it might be better >>>>> to turn them off entirely than to use them for sandboxing stuff >>>>> that's trusted anyway. >>>> >>>> Sandboxing virtiofsd is going to be really annoying and will definitely >>>> come at a performance cost. The most efficient way to use virtiofsd >>>> is to give it CAP_DAC_READ_SEARCH in the initial user namespace and >>>> delegate _all_ access control to it. This allows virtiofs to use >>>> open_by_handle_at() for all filesystem access. Unfortunately, >>>> this also allows virtiofsd to open any file on the filesystem, ignoring >>>> all discretionary access control checks. I don't think Landlock would >>>> work either. SELinux or SMACK might work, but using them is >>>> significantly more complicated. >>>> >>>> If one wants to sandbox virtiofsd, one either needs to >>>> use --cache=never or run into an effective resource leak >>>> (https://gitlab.com/virtio-fs/virtiofsd/-/issues/194). >>>> My hope is that in the future the problem will be solved >>>> by DAX and an in-kernel shrinker that is aware of the host >>>> resources it is using. Denial of service would be prevented >>>> by cgroups on the host, addressing the objection mentioned >>>> in the issue comments. >>> >>> Do we not trust virtiofsd's built-in sandboxing? >> >> I do trust it, provided that it is verifiable (by dumping the state >> of the process at runtime). However, allowing unrestricted >> open_by_handle_at() allows opening any file on the system, conditioned >> only on the filesystem supporting open_by_handle_at(). Therefore, >> sandboxing and using handles for all filesystem access are incompatible. > > Wouldn't it be limited to only files on the same filesystem, since you > have to pass a mount FD to open_by_handle_at()? It would, but I think that different mounts count as the same filesystem for this purpose. open_by_handle_at() in privileged mode bypasses the VFS layer and goes straight to the underlying filesystem driver. File handles have low enough entropy that they can be guessed. > That's still bad though. So then to start with we just want to make > sure it doesn't have CAP_DAC_READ_SEARCH, and then we hope that > something comes along to address the limitations of that? This is correct. There is already one idea for that, which is to cryptographically sign (and possibly encrypt) file handles so that one cannot guess them. This would ensure that one cannot get a a file handle without using name_to_handle_at(), which already does access checks. In the future, it might make sense for virtiofsd to talk to a userspace filesystem implementation. Depending on how this is implemented, it might or might not be possible to sandbox virtiofsd in this case. -- Sincerely, Demi Marie Obenour (she/her/hers) [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7253 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-09-19 19:37 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-09 7:57 Sandboxing strategy Demi Marie Obenour 2025-09-10 15:11 ` Alyssa Ross 2025-09-10 15:14 ` Alyssa Ross 2025-09-10 20:35 ` Demi Marie Obenour 2025-09-17 11:27 ` Alyssa Ross 2025-09-18 2:34 ` Demi Marie Obenour 2025-09-19 13:17 ` Alyssa Ross 2025-09-19 19:37 ` Demi Marie Obenour
Code repositories for project(s) associated with this public inbox https://spectrum-os.org/git/crosvm https://spectrum-os.org/git/doc https://spectrum-os.org/git/mktuntap https://spectrum-os.org/git/nixpkgs https://spectrum-os.org/git/spectrum https://spectrum-os.org/git/ucspi-vsock https://spectrum-os.org/git/www This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).