备注
AI Translation Notice
This document was automatically translated by hunyuan-turbos-latest model, for reference only.
Source document: kernel/memory_management/extable_safe_copy_design.md
Translation time: 2025-11-21 06:28:18
Translation model:
hunyuan-turbos-latest
Please report issues via Community Channel
Secure Memory Copy Scheme Based on Exception Table
备注
Author: Long Jin longjin@dragonos.org
Overview
This document describes the core design of a secure memory copy scheme in DragonOS based on the Exception Table mechanism. This solution addresses the issue of safely accessing user-space memory in system call contexts, preventing kernel panics caused by accessing invalid user addresses.
Design Background and Motivation
Problem Definition
During system call processing, the kernel needs to access pointers passed from user space (such as path strings, parameter structures, etc.). These accesses may fail:
Unmapped address: The user-provided address has no corresponding VMA (Virtual Memory Area)
Insufficient permissions: The page exists but lacks required permissions
Malicious input: The user intentionally provides illegal addresses
Limitations of Traditional Solutions
TOCTTOU issues with pre-checking solutions:
An address may be valid during check but modified by other threads when used
Race condition window exists
Challenges with direct access:
Cannot distinguish between “normal page fault” and “illegal access”
Page fault handler cannot determine whether it’s a kernel bug or user error
Exception Table Mechanism Principles
Core Idea
The exception table mechanism achieves secure user-space access through compile-time marking + runtime lookup:
Compile-time: Generate exception table entries for instructions that may trigger page faults
Runtime: When a page fault occurs, search the exception table and jump to fix-up code
Zero overhead: No performance loss on the normal path
Architectural Diagram
┌─────────────────────────────────────────────────────────────┐
│ 系统调用执行流程 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 用户空间 内核空间 │
│ ┌──────┐ ┌──────────────────────────────┐ │
│ │0x1000│ │ 1. 系统调用入口 │ │
│ │(未映射)─────────→ 2. 拷贝用户数据(带标记) │ │
│ └──────┘ │ ├─ 正常完成 ──→ 返回成功 │ │
│ │ └─ 触发#PF │ │
│ │ ↓ │ │
│ │ 3. 页错误处理器 │ │
│ │ ├─ 查找异常表 │ │
│ │ └─ 找到修复代码地址 │ │
│ │ ↓ │ │
│ │ 4. 修改指令指针(RIP) │ │
│ │ ↓ │ │
│ │ 5. 执行修复代码 │ │
│ │ └─ 返回剩余未拷贝字节数 │ │
│ │ ↓ │ │
│ │ 6. 返回EFAULT给用户 │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Core Data Structures
Exception table entry (8-byte aligned):
┌─────────────────┬──────────────────┐
│ 指令相对偏移 │ 修复代码相对偏移 │
│ (4 bytes) │ (4 bytes) │
└─────────────────┴──────────────────┘
Design highlights:
Uses relative offsets to support ASLR (Address Space Layout Randomization)
8-byte alignment improves cache performance
Stored in read-only segments to prevent tampering
Workflow
编译期:
源码 ──→ 带标记的指令 ──→ 生成异常表条目 ──→ 链接到内核镜像
(rep movsb) (insn→fixup)
运行期:
执行拷贝 ──→ 触发页错误? ─否→ 正常返回
│
是
↓
查找异常表 ──→ 找到? ─否→ 内核panic
│
是
↓
修改RIP到修复代码 ──→ 返回剩余未拷贝字节数
Typical Execution Scenarios
Scenario: System call with invalid address
Taking the open() system call as an example, demonstrating the exception table’s operation:
用户程序: open(0x1000, O_RDONLY) // 0x1000未映射
│
↓
┌────────────────────────────────┐
│ 1. 进入系统调用 │
│ ├─ 解析路径字符串 │
│ └─ 逐字节拷贝直到'\0' │
└────────────────────────────────┘
│
↓
┌────────────────────────────────┐
│ 2. 拷贝第一个字节时触发页错误 │
│ (地址0x1000不在VMA中) │
└────────────────────────────────┘
│
↓
┌────────────────────────────────┐
│ 3. 页错误处理器 │
│ ├─ 检测到访问用户地址 │
│ ├─ 查找异常表 │
│ └─ 找到对应的修复代码 │
└────────────────────────────────┘
│
↓
┌────────────────────────────────┐
│ 4. 修改指令指针到修复代码 │
│ └─ 设置返回值为剩余未拷贝字节数 │
└────────────────────────────────┘
│
↓
┌────────────────────────────────┐
│ 5. 系统调用返回EFAULT │
└────────────────────────────────┘
│
↓
用户程序: fd = -1, errno = EFAULT
Key points:
No need for pre-checking address validity
Page faults are automatically converted to returning the number of remaining uncopied bytes
The kernel won’t panic, and user programs receive clear error information
Usage Scenario Analysis
✅ Scenarios Suitable for Exception Table Protection
1. Small system call parameter data
Characteristics:
Small data volume (typically < 4KB)
One-time copy
Unknown data length (e.g., strings)
Typical applications:
Path strings:
open(),stat(),execve(), etc.Fixed-size structures:
sigaction,timespec,stat, etc.Small arrays:
iovec[],pollfd[], etc.
Advantages:
Avoids TOCTTOU races: No pre-checking needed
High robustness: User errors won’t cause kernel panics
Acceptable performance: Small data volume; even multiple copies have minimal impact
2. Scenarios with uncertain address validity
When address validity cannot be verified by other means, the exception table is the safest choice:
Raw pointers directly provided by users
Addresses that may be concurrently modified in multi-threaded environments
Operations requiring atomicity guarantees
❌ Scenarios Unsuitable for Exception Table Protection
1. Large data transfers
Anti-pattern: Double buffering in read/write system calls
用户缓冲区 → 内核临时缓冲区 → 用户缓冲区 ❌
Issues:
Memory waste: Requires additional kernel buffers
Double copying: Data is copied twice
OOM risk: Large concurrent reads/writes exhaust memory
Correct approach: Zero-copy
Pre-validate addresses within valid VMAs
Operate directly on user buffers
Page faults trigger normal page fault handling (not errors)
2. Addresses already validated within VMAs
If addresses have been verified through VMA checks, the exception table is redundant:
Immediate access after
mmap()DMA buffers
Shared memory regions
In these scenarios, page faults are normal page fault handling (e.g., COW), not errors.
3. Performance-critical hot paths
Avoid frequently calling functions protected by exception tables in loops:
Batch processing: Copy entire arrays at once rather than element by element
Pre-validation: Validate addresses outside loops and access directly within loops
Decision Matrix
Scenario Characteristics |
Data Volume |
Recommended Solution |
Core Considerations |
|---|---|---|---|
Small system call parameters |
< 4KB |
Exception table protection |
Avoid TOCTTOU, improve robustness |
File read/write |
Variable (MB-level) |
Zero-copy |
Performance priority, avoid double buffering |
Access after mmap |
Arbitrary |
Direct access |
VMA already validated, normal page fault |
Batch small data |
Cumulative KB-level |
Batch copy |
Reduce system call count |
String parsing |
Unknown |
Exception table protection |
Byte-by-byte scanning, requires robustness |
Security Analysis
Defensive Capabilities
The exception table mechanism can defend against:
Null pointer dereferencing: Returns EFAULT instead of segmentation fault
Kernel address injection: User-provided kernel addresses are safely rejected
Race attacks: TOCTTOU windows are eliminated
Out-of-bounds access: Accesses outside VMAs are captured
Security Boundaries
The exception table cannot defend against:
Kernel bugs: Such as wild pointer dereferencing
Hardware failures: Physical memory damage
Other exception types: Only handles page faults
Multi-layer Defense
The exception table is part of defense in depth:
┌─────────────────────────────────────┐
│ 用户空间权限检查 (SELinux/AppArmor) │ ← 权限层
├─────────────────────────────────────┤
│ 系统调用参数验证 │ ← 逻辑层
├─────────────────────────────────────┤
│ 异常表机制 │ ← 内存安全层
├─────────────────────────────────────┤
│ 硬件页保护 (MMU) │ ← 硬件层
└─────────────────────────────────────┘
Implementation Essentials
Key Technologies
Relative offset encoding: Supports address randomization (ASLR)
Binary search: O(log n) time complexity for fast location
Inline assembly: Precise control over instruction and exception table generation
Zero-overhead abstraction: No performance loss on the normal path
Architecture Portability
The exception table mechanism can be ported to other architectures:
x86_64: Uses
rep movsbinstructionARM64: Uses
ldp/stpinstruction sequenceRISC-V: Uses
ld/sdinstruction sequence
The core idea remains unchanged, requiring only adjustments to assembly syntax.