备注
AI Translation Notice
This document was automatically translated by hunyuan-turbos-latest model, for reference only.
Source document: kernel/memory_management/extable_safe_copy_design.md
Translation time: 2025-11-18 13:03:29
Translation model:
hunyuan-turbos-latest
Please report issues via Community Channel
Design of Secure Memory Copy Scheme Based on Exception Table
备注
Author: Long Jin longjin@dragonos.org
Overview
This document describes the core design concept of a secure memory copy scheme in DragonOS based on the Exception Table mechanism. This solution addresses the issue of safely accessing user-space memory in system call contexts, preventing kernel panics caused by accessing invalid user addresses.
Design Background and Motivation
Problem Definition
During system call processing, the kernel needs to access pointers passed from user space (such as path strings, parameter structures, etc.). These accesses may fail due to:
Unmapped address: The user-provided address has no corresponding VMA (Virtual Memory Area)
Insufficient permissions: The page exists but lacks required permissions
Malicious input: The user intentionally provides illegal addresses
Limitations of Traditional Solutions
TOCTTOU issues with pre-checking solutions:
Addresses may be valid during check but modified by other threads when used
Race condition windows exist
Dilemma of direct access:
Cannot distinguish between “normal page fault” and “illegal access”
Page fault handlers cannot determine whether it’s a kernel bug or user error
Principles of Exception Table Mechanism
Core Idea
The Exception Table mechanism achieves secure user-space access through compile-time marking + runtime lookup:
Compile-time: Generate exception table entries at instructions that may trigger page faults
Runtime: When a page fault occurs, search the exception table and jump to fix-up code
Zero overhead: No performance loss on normal paths
Architectural Diagram
┌─────────────────────────────────────────────────────────────┐
│ 系统调用执行流程 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 用户空间 内核空间 │
│ ┌──────┐ ┌──────────────────────────────┐ │
│ │0x1000│ │ 1. 系统调用入口 │ │
│ │(未映射)─────────→ 2. 拷贝用户数据(带标记) │ │
│ └──────┘ │ ├─ 正常完成 ──→ 返回成功 │ │
│ │ └─ 触发#PF │ │
│ │ ↓ │ │
│ │ 3. 页错误处理器 │ │
│ │ ├─ 查找异常表 │ │
│ │ └─ 找到修复代码地址 │ │
│ │ ↓ │ │
│ │ 4. 修改指令指针(RIP) │ │
│ │ ↓ │ │
│ │ 5. 执行修复代码 │ │
│ │ └─ 设置错误码(-1) │ │
│ │ ↓ │ │
│ │ 6. 返回EFAULT给用户 │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Core Data Structures
Exception Table Entry (8-byte aligned):
┌─────────────────┬──────────────────┐
│ 指令相对偏移 │ 修复代码相对偏移 │
│ (4 bytes) │ (4 bytes) │
└─────────────────┴──────────────────┘
Design Highlights:
Uses relative offsets to support ASLR (Address Space Layout Randomization)
8-byte alignment improves cache performance
Stored in read-only segments to prevent tampering
Workflow
编译期:
源码 ──→ 带标记的指令 ──→ 生成异常表条目 ──→ 链接到内核镜像
(rep movsb) (insn→fixup)
运行期:
执行拷贝 ──→ 触发页错误? ─否→ 正常返回
│
是
↓
查找异常表 ──→ 找到? ─否→ 内核panic
│
是
↓
修改RIP到修复代码 ──→ 返回错误码
Typical Execution Scenarios
Scenario: System Call with Invalid Address
Taking the open() system call as an example, demonstrating the operation of the exception table:
用户程序: open(0x1000, O_RDONLY) // 0x1000未映射
│
↓
┌────────────────────────────────┐
│ 1. 进入系统调用 │
│ ├─ 解析路径字符串 │
│ └─ 逐字节拷贝直到'\0' │
└────────────────────────────────┘
│
↓
┌────────────────────────────────┐
│ 2. 拷贝第一个字节时触发页错误 │
│ (地址0x1000不在VMA中) │
└────────────────────────────────┘
│
↓
┌────────────────────────────────┐
│ 3. 页错误处理器 │
│ ├─ 检测到访问用户地址 │
│ ├─ 查找异常表 │
│ └─ 找到对应的修复代码 │
└────────────────────────────────┘
│
↓
┌────────────────────────────────┐
│ 4. 修改指令指针到修复代码 │
│ └─ 设置返回值为错误码 │
└────────────────────────────────┘
│
↓
┌────────────────────────────────┐
│ 5. 系统调用返回EFAULT │
└────────────────────────────────┘
│
↓
用户程序: fd = -1, errno = EFAULT
Key Points:
No need for pre-checking address validity
Page faults are automatically converted to error codes
Kernel won’t panic, user programs receive clear error information
Usage Scenario Analysis
✅ Suitable Scenarios for Exception Table Protection
1. Small Data System Call Parameters
Characteristics:
Small data volume (typically < 4KB)
Single copy operation
Unknown data length (e.g., strings)
Typical Applications:
Path strings:
open(),stat(),execve(), etc.Fixed-size structures:
sigaction,timespec,stat, etc.Small arrays:
iovec[],pollfd[], etc.
Advantages:
Avoids TOCTTOU races: No pre-checking needed
High robustness: User errors won’t cause kernel panics
Acceptable performance: Small data volume, minimal impact even with extra copies
2. Scenarios with Uncertain Address Validity
When address validity cannot be verified by other means, the exception table is the safest choice:
Raw pointers directly provided by users
Addresses that may be concurrently modified in multi-threaded environments
Operations requiring atomicity guarantees
❌ Unsuitable Scenarios for Exception Table Protection
1. Large Data Transfers
Anti-pattern: Double buffering in read/write system calls
用户缓冲区 → 内核临时缓冲区 → 用户缓冲区 ❌
Issues:
Memory waste: Requires additional kernel buffers
Double copying: Data is copied twice
OOM risk: Concurrent large reads/writes may exhaust memory
Correct Approach: Zero-copy
Pre-validate addresses within valid VMAs
Operate directly on user buffers
Page faults trigger normal page fault handling (not errors)
2. Addresses Already Validated in VMAs
If addresses have been validated through VMA checks, the exception table is redundant:
Immediate access after
mmap()DMA buffers
Shared memory regions
In these scenarios, page faults are normal page fault handling (like COW), not errors.
3. Performance-Critical Hot Paths
Avoid frequently calling functions protected by exception tables in loops:
Batch processing: Copy entire arrays at once rather than element-by-element
Pre-validation: Validate addresses outside loops, direct access inside loops
Decision Matrix
Scenario Characteristics |
Data Volume |
Recommended Solution |
Key Considerations |
|---|---|---|---|
Small system call parameters |
< 4KB |
Exception table protection |
Avoid TOCTTOU, improve robustness |
File I/O |
Variable (MB-level) |
Zero-copy |
Performance priority, avoid double buffering |
Access after mmap |
Any |
Direct access |
VMA already validated, normal page fault |
Batch small data |
Cumulative KB-level |
Batch copy |
Reduce system call count |
String parsing |
Unknown |
Exception table protection |
Byte-by-byte scanning, requires robustness |
Security Analysis
Defensive Capabilities
The Exception Table mechanism can defend against:
Null pointer dereferencing: Returns EFAULT instead of segmentation fault
Kernel address injection: User-provided kernel addresses are safely rejected
Race attacks: TOCTTOU windows are eliminated
Out-of-bounds access: Accesses outside VMAs are captured
Security Boundaries
The Exception Table cannot defend against:
Kernel bugs: Such as wild pointer dereferencing
Hardware failures: Physical memory damage
Other exception types: Only handles page faults
Multi-layer Defense
The Exception Table is part of defense in depth:
┌─────────────────────────────────────┐
│ 用户空间权限检查 (SELinux/AppArmor) │ ← 权限层
├─────────────────────────────────────┤
│ 系统调用参数验证 │ ← 逻辑层
├─────────────────────────────────────┤
│ 异常表机制 │ ← 内存安全层
├─────────────────────────────────────┤
│ 硬件页保护 (MMU) │ ← 硬件层
└─────────────────────────────────────┘
Implementation Essentials
Key Technologies
Relative offset encoding: Supports address randomization (ASLR)
Binary search: O(log n) time complexity for fast location
Inline assembly: Precise control over instruction generation and exception table creation
Zero-overhead abstraction: No performance loss on normal paths
Architecture Portability
The Exception Table mechanism can be ported to other architectures:
x86_64: Uses
rep movsbinstructionARM64: Uses
ldp/stpinstruction sequenceRISC-V: Uses
ld/sdinstruction sequence
The core concept remains unchanged, only assembly syntax needs adjustment.