备注

AI Translation Notice

This document was automatically translated by hunyuan-turbos-latest model, for reference only.

  • Source document: kernel/memory_management/extable_safe_copy_design.md

  • Translation time: 2025-11-21 06:28:18

  • Translation model: hunyuan-turbos-latest

Please report issues via Community Channel

Secure Memory Copy Scheme Based on Exception Table

备注

Author: Long Jin longjin@dragonos.org

Overview

This document describes the core design of a secure memory copy scheme in DragonOS based on the Exception Table mechanism. This solution addresses the issue of safely accessing user-space memory in system call contexts, preventing kernel panics caused by accessing invalid user addresses.

Design Background and Motivation

Problem Definition

During system call processing, the kernel needs to access pointers passed from user space (such as path strings, parameter structures, etc.). These accesses may fail:

  1. Unmapped address: The user-provided address has no corresponding VMA (Virtual Memory Area)

  2. Insufficient permissions: The page exists but lacks required permissions

  3. Malicious input: The user intentionally provides illegal addresses

Limitations of Traditional Solutions

TOCTTOU issues with pre-checking solutions:

  • An address may be valid during check but modified by other threads when used

  • Race condition window exists

Challenges with direct access:

  • Cannot distinguish between “normal page fault” and “illegal access”

  • Page fault handler cannot determine whether it’s a kernel bug or user error

Exception Table Mechanism Principles

Core Idea

The exception table mechanism achieves secure user-space access through compile-time marking + runtime lookup:

  1. Compile-time: Generate exception table entries for instructions that may trigger page faults

  2. Runtime: When a page fault occurs, search the exception table and jump to fix-up code

  3. Zero overhead: No performance loss on the normal path

Architectural Diagram

┌─────────────────────────────────────────────────────────────┐
│                      系统调用执行流程                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   用户空间                  内核空间                           │
│  ┌──────┐         ┌──────────────────────────────┐          │
│  │0x1000│         │ 1. 系统调用入口                │          │
│  │(未映射)─────────→ 2. 拷贝用户数据(带标记)         │          │
│  └──────┘         │    ├─ 正常完成 ──→ 返回成功     │          │
│                   │    └─ 触发#PF                 │          │
│                   │         ↓                    │          │
│                   │ 3. 页错误处理器                │          │
│                   │    ├─ 查找异常表               │          │
│                   │    └─ 找到修复代码地址          │          │
│                   │         ↓                    │          │
│                   │ 4. 修改指令指针(RIP)           │          │
│                   │         ↓                    │          │
│                   │ 5. 执行修复代码                │          │
│                   │    └─ 返回剩余未拷贝字节数      │          │
│                   │         ↓                    │          │
│                   │ 6. 返回EFAULT给用户            │          │
│                   └──────────────────────────────┘          │
└─────────────────────────────────────────────────────────────┘

Core Data Structures

Exception table entry (8-byte aligned):

┌─────────────────┬──────────────────┐
│  指令相对偏移     │  修复代码相对偏移   │
│    (4 bytes)    │    (4 bytes)     │
└─────────────────┴──────────────────┘

Design highlights:

  • Uses relative offsets to support ASLR (Address Space Layout Randomization)

  • 8-byte alignment improves cache performance

  • Stored in read-only segments to prevent tampering

Workflow

编译期:
    源码 ──→ 带标记的指令 ──→ 生成异常表条目 ──→ 链接到内核镜像
              (rep movsb)       (insn→fixup)

运行期:
    执行拷贝 ──→ 触发页错误? ─否→ 正常返回
                     │
                    是
                     ↓
              查找异常表 ──→ 找到? ─否→ 内核panic
                              │
                             是
                              ↓
                    修改RIP到修复代码 ──→ 返回剩余未拷贝字节数

Typical Execution Scenarios

Scenario: System call with invalid address

Taking the open() system call as an example, demonstrating the exception table’s operation:

用户程序: open(0x1000, O_RDONLY)  // 0x1000未映射
         │
         ↓
    ┌────────────────────────────────┐
    │ 1. 进入系统调用                  │
    │    ├─ 解析路径字符串             │
    │    └─ 逐字节拷贝直到'\0'         │
    └────────────────────────────────┘
         │
         ↓
    ┌────────────────────────────────┐
    │ 2. 拷贝第一个字节时触发页错误      │
    │    (地址0x1000不在VMA中)         │
    └────────────────────────────────┘
         │
         ↓
    ┌────────────────────────────────┐
    │ 3. 页错误处理器                  │
    │    ├─ 检测到访问用户地址          │
    │    ├─ 查找异常表                 │
    │    └─ 找到对应的修复代码          │
    └────────────────────────────────┘
         │
         ↓
    ┌────────────────────────────────┐
    │ 4. 修改指令指针到修复代码          │
    │    └─ 设置返回值为剩余未拷贝字节数  │
    └────────────────────────────────┘
         │
         ↓
    ┌────────────────────────────────┐
    │ 5. 系统调用返回EFAULT            │
    └────────────────────────────────┘
         │
         ↓
用户程序: fd = -1, errno = EFAULT

Key points:

  • No need for pre-checking address validity

  • Page faults are automatically converted to returning the number of remaining uncopied bytes

  • The kernel won’t panic, and user programs receive clear error information

Usage Scenario Analysis

✅ Scenarios Suitable for Exception Table Protection

1. Small system call parameter data

Characteristics:

  • Small data volume (typically < 4KB)

  • One-time copy

  • Unknown data length (e.g., strings)

Typical applications:

  • Path strings: open(), stat(), execve(), etc.

  • Fixed-size structures: sigaction, timespec, stat, etc.

  • Small arrays: iovec[], pollfd[], etc.

Advantages:

  • Avoids TOCTTOU races: No pre-checking needed

  • High robustness: User errors won’t cause kernel panics

  • Acceptable performance: Small data volume; even multiple copies have minimal impact

2. Scenarios with uncertain address validity

When address validity cannot be verified by other means, the exception table is the safest choice:

  • Raw pointers directly provided by users

  • Addresses that may be concurrently modified in multi-threaded environments

  • Operations requiring atomicity guarantees

❌ Scenarios Unsuitable for Exception Table Protection

1. Large data transfers

Anti-pattern: Double buffering in read/write system calls

用户缓冲区 → 内核临时缓冲区 → 用户缓冲区  ❌

Issues:

  • Memory waste: Requires additional kernel buffers

  • Double copying: Data is copied twice

  • OOM risk: Large concurrent reads/writes exhaust memory

Correct approach: Zero-copy

  • Pre-validate addresses within valid VMAs

  • Operate directly on user buffers

  • Page faults trigger normal page fault handling (not errors)

2. Addresses already validated within VMAs

If addresses have been verified through VMA checks, the exception table is redundant:

  • Immediate access after mmap()

  • DMA buffers

  • Shared memory regions

In these scenarios, page faults are normal page fault handling (e.g., COW), not errors.

3. Performance-critical hot paths

Avoid frequently calling functions protected by exception tables in loops:

  • Batch processing: Copy entire arrays at once rather than element by element

  • Pre-validation: Validate addresses outside loops and access directly within loops

Decision Matrix

Scenario Characteristics

Data Volume

Recommended Solution

Core Considerations

Small system call parameters

< 4KB

Exception table protection

Avoid TOCTTOU, improve robustness

File read/write

Variable (MB-level)

Zero-copy

Performance priority, avoid double buffering

Access after mmap

Arbitrary

Direct access

VMA already validated, normal page fault

Batch small data

Cumulative KB-level

Batch copy

Reduce system call count

String parsing

Unknown

Exception table protection

Byte-by-byte scanning, requires robustness

Security Analysis

Defensive Capabilities

The exception table mechanism can defend against:

  1. Null pointer dereferencing: Returns EFAULT instead of segmentation fault

  2. Kernel address injection: User-provided kernel addresses are safely rejected

  3. Race attacks: TOCTTOU windows are eliminated

  4. Out-of-bounds access: Accesses outside VMAs are captured

Security Boundaries

The exception table cannot defend against:

  1. Kernel bugs: Such as wild pointer dereferencing

  2. Hardware failures: Physical memory damage

  3. Other exception types: Only handles page faults

Multi-layer Defense

The exception table is part of defense in depth:

┌─────────────────────────────────────┐
│  用户空间权限检查 (SELinux/AppArmor)   │  ← 权限层
├─────────────────────────────────────┤
│  系统调用参数验证                      │  ← 逻辑层
├─────────────────────────────────────┤
│  异常表机制                           │  ← 内存安全层
├─────────────────────────────────────┤
│  硬件页保护 (MMU)                     │  ← 硬件层
└─────────────────────────────────────┘

Implementation Essentials

Key Technologies

  1. Relative offset encoding: Supports address randomization (ASLR)

  2. Binary search: O(log n) time complexity for fast location

  3. Inline assembly: Precise control over instruction and exception table generation

  4. Zero-overhead abstraction: No performance loss on the normal path

Architecture Portability

The exception table mechanism can be ported to other architectures:

  • x86_64: Uses rep movsb instruction

  • ARM64: Uses ldp/stp instruction sequence

  • RISC-V: Uses ld/sd instruction sequence

The core idea remains unchanged, requiring only adjustments to assembly syntax.