备注

AI Translation Notice

This document was automatically translated by hunyuan-turbos-latest model, for reference only.

Source document: kernel/syscall/sys_capget_capset.md
Translation time: 2025-09-25 09:18:48
Translation model: hunyuan-turbos-latest

Please report issues via Community Channel

Design Documentation for sys_capget / sys_capset

This document briefly introduces the design and implementation key points of sys_capget and sys_capset in DragonOS, covering version negotiation, user-space data structures, capability bitset rules, and call flows.

Source Code:

kernel/src/process/syscall/sys_cap_get_set.rs
kernel/src/process/cred.rs

Overview

DragonOS aligns with Linux’s capability interface, supporting user-space reading or setting process capability sets via capget/capset.
Capability sets include:
- cap_effective (pE): The capabilities currently in effect for the process
- cap_permitted (pP): The upper limit of capabilities granted to the process
- cap_inheritable (pI): Capabilities that can be inherited by child processes
- cap_bset: Bounding set, limiting the upper bound of obtainable capabilities (used only for rule constraints, not directly read/written in this interface)
- cap_ambient: Ambient set (not modified by capset)
Capability bit width: DragonOS uses 64-bit storage but currently only supports the lower 41 bits (CAP_FULL_SET = (1<<41)-1), with higher bits truncated.

User-Space Data Structures and Versions

Aligned with Linux’s user-space structures:

// header: cap_user_header_t
struct CapUserHeader {
    uint32_t version; // 版本号
    int32_t  pid;     // 目标进程: 0=当前进程，其他=指定pid
};

// data: cap_user_data_t 数组元素
struct CapUserData {
    uint32_t effective;
    uint32_t permitted;
    uint32_t inheritable;
}

Version constants:
- _LINUX_CAPABILITY_VERSION_1 = 0x19980330
- _LINUX_CAPABILITY_VERSION_2 = 0x20071026 (deprecated)
- _LINUX_CAPABILITY_VERSION_3 = 0x20080522
Kernel-supported version in DragonOS: _KERNEL_CAPABILITY_VERSION = v3
Number of u32 groups copied per version:
- v1: 1 group (lower 32 bits only)
- v2/v3: 2 groups (lower 32 bits + upper 32 bits)

Aggregation/Splitting Rules:

capset: Aggregates CapUserData[0…tocopy) from user input into a u64 (truncated to 41 bits at higher positions)
capget: Returns the number of u32 groups corresponding to the requested version (v1:1 group; v2/v3:2 groups) based on the request, also returning 0 when data==NULL.

Version Negotiation and Probe Behavior

capget:
- If version is unknown: Writes back header.version as the kernel-supported version (v3) and returns:
  - If data==NULL: Returns 0 (for probing)
  - If data!=NULL: Returns EINVAL
- If version is valid: Returns the number of u32 groups corresponding to the requested version (v1:1 group; v2/v3:2 groups), also returning 0 when data==NULL.
capset:
- If version is unknown: Directly returns EINVAL (does not take on probing responsibility), more consistent with Linux.
- data cannot be empty (NULL returns EFAULT).

Target Process Selection and pid Semantics

capget:
- pid < 0: EINVAL
- pid == 0: Uses the current process
- pid != 0: Looks up the target process (returns ESRCH if not found)
capset:
- pid < 0: EPERM (negative pid targets not allowed)
- pid == 0 or pid == current process pid: Allowed
- pid != current process pid: EPERM (only self-modification allowed)

Capability Set Rules (capset)

Let:

pE_old = old effective
pP_old = old permitted
pI_old = old inheritable
bset = bounding set
pE_new, pP_new, pI_new derived from user data (already truncated to 41-bit mask)

Constraints:

pE_new ⊆ pP_new
If any bit in pE_new is not in pP_new: EPERM
pP_new ⊆ pP_old (not allowed to elevate permitted)
If pP_new contains any bits not in pP_old: EPERM
pI_new limitation (aligned with Linux’s CAP_SETPCAP and bset constraints)
- If the current process has CAP_SETPCAP_BIT (in the pE_old effective set): pI_new ⊆ (pI_old ∪ pP_old) ∩ bset
  If exceeded: EPERM
- If not: pI_new ⊆ (pI_old ∪ pP_old) and pI_new ⊆ (pI_old ∪ bset)
  Any exceedance: EPERM

Note:

Ambient capabilities are not modified by capset and remain unchanged.
By cloning the old cred, updating pE/pP/pI, and then atomically replacing it in the PCB (pcb.set_cred).

Flowchart

Main flow of capget:

[读取 header(version,pid)]
        |
   [版本合法?]
      /     \
    否       是
    |         |
[写回 header.version=v3]     [pid 选择]
        |                     |-- pid<0 -> EINVAL
   [data==NULL?]              |-- pid==0 -> 当前进程 cred
      /     \                 |-- pid!=0 -> 查找目标任务
    是       否               |              |- 未找到 -> ESRCH
    |         |               |              |- 找到 -> 目标 cred
  返回 0     EINVAL           |
                              [拆分 e/p/i 为低/高 32 位]
                              [data==NULL?]
                                /       \
                              是         否
                               |          |
                             返回 0     写回用户缓冲区，返回 0

Main flow of capset:

[读取 header(version,pid)]
        |
   [版本合法?]
      /     \
    否       是
    |         |
  EINVAL   [data==NULL?]
              /      \
            是        否
             |         |
           EFAULT    [pid 检查]
                      |- pid<0 -> EPERM
                      |- pid!=self -> EPERM
                      |- pid==self -> [读取用户数据并聚合 pE/pP/pI]
                                      [规则1: pE_new ⊆ pP_new?]  否 -> EPERM
                                      [规则2: pP_new ⊆ pP_old?] 否 -> EPERM
                                      [规则3: pI_new 受 CAP_SETPCAP/bset 限制?] 否 -> EPERM
                                      [克隆 cred 更新 pE/pP/pI]
                                      [pcb.set_cred 原子替换]
                                      返回 0

Capability Bit Width and Masks

Apply masks to e/p/i during aggregation:

mask = CAPFlags::CAP_FULL_SET.bits() = (1<<41)-1
Higher bits are truncated to ensure cross-version compatibility and consistency with the current implementation.

Design Trade-offs and Alignment

capget supports “probe” semantics for unknown versions: writes back the supported version and returns 0 when data==NULL.
capset does not take on probing: unknown versions directly return EINVAL, more closely aligned with Linux behavior.
pid constraints are stricter: capset only allows modification of the current process to avoid cross-process permission modifications.
Rules follow the Linux capability model: not allowed to elevate permitted; effective must be limited by permitted; inheritable is constrained by CAP_SETPCAP and bset.

Future Work

Improve more interfaces for ambient capabilities and bounding set (currently ambient is not modified in capset).
Introduce more complete capability bit definitions and permission check interfaces.
Align documentation and test cases with more boundary conditions (such as the impact of user namespaces).