Files
ChuXun c882a7a216 1
2025-12-27 14:36:56 +08:00

106 lines
4.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 阶段性存档Problem 2 完成,准备进入 Problem 3
**当前状态**:已完成基于 BMI 的单变量分组与时点优化。建立了基于 GPR 的概率预测模型和基于期望风险的决策模型。发现了高 BMI 组的反直觉规律。
---
### 1. 符号与定义 (Notations & Definitions)
| 符号 | 变量名 | 物理含义 | 设定值/单位 |
| :--- | :--- | :--- | :--- |
| $Y(B,t)$ | `Y_conc` | 胎儿 Y 染色体浓度 | 阈值 $0.04$ (4%) |
| $t$ | `GA` | 检测孕周 (Gestational Age) | weeks |
| $B$ | `BMI` | 孕妇身体质量指数 | $kg/m^2$ |
| $C(t)$ | `get_time_cost(t)` | 时间成本函数 | 早期(≤12): 1; 中期(13-27): 10; 晚期(≥28): 100 |
| $P_{fail}$ | `p_fail` | 检测失败概率 ($Y < 4\%$) | $P(Y < 0.04 | B, t)$ |
| $\lambda$ | `penalty` | 重测惩罚系数 | $20$ |
| $\Delta t$ | - | 重测时间推迟 | $2$ 周 |
| $\sigma_{err}$ | `sigma_err` | 仪器检测误差 | 默认 $0.005$ (0.5%) |
### 2. 核心发现 (Key Conclusions)
1. **分组界限 (Grouping)**
* **Group A (Low BMI)**: $BMI < 27.5$
* **Group B (Mid BMI)**: $27.5 \le BMI < 35.5$
* **Group C (High BMI)**: $BMI \ge 35.5$
2. **最佳时点 (Optimal Timing)**
* **Group A**: **10-11 周** (浓度高,早测风险低)。
* **Group B**: **13 周** (推迟以确保浓度达标,避免重测)。
* **Group C**: **10 周** (反直觉结论)。
3. **重要洞察 (Critical Insight for Group C)**
* 对于重度肥胖孕妇 ($BMI > 35$)GPR 模型显示其 Y 染色体浓度随孕周增长**极其缓慢甚至停滞**,且预测方差极大。
* 等待至 20 周并不能显著降低失败率 (仅降低约 3%),但时间成本会增加 10 倍。
* 因此,模型给出的最优解是**“早期止损” (Early Stop)**:尽早检测,若失败则直接转诊,而非盲目等待。**在后续建模中需保留此结论。**
### 3. 数学模型 (The Model)
**A. 浓度预测 (GPR)**
$$ Y(B, t) \sim \mathcal{N}(\mu(B,t), \sigma_{model}^2(B,t) + \sigma_{err}^2) $$
**B. 期望风险最小化 (Optimization)**
$$ \min_{t} E(Risk)_t = C(t) + \Phi\left(\frac{0.04 - \mu(B,t)}{\sqrt{\sigma_{model}^2 + \sigma_{err}^2}}\right) \times [ C(t+2) + \lambda ] $$
### 4. Python 复现代码 (Minimal Snippet)
```python
import pandas as pd
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel, ConstantKernel as C
from sklearn.preprocessing import StandardScaler
from scipy.stats import norm
# 1. 辅助函数与配置
def get_time_cost(t):
if t < 13.0: return 1.0
elif t < 28.0: return 10.0
else: return 100.0
def parse_weeks(s):
try:
s = str(s).lower().replace(' ', '')
if 'w' in s:
parts = s.split('w')
weeks = float(parts[0])
days = float(parts[1].replace('+', '').replace('d', '')) if len(parts)>1 and parts[1] else 0
return weeks + days / 7.0
return float(s)
except: return np.nan
# 2. 数据加载与 GPR 训练 (快速恢复现场)
df = pd.concat([pd.read_csv('input_file_0.csv'), pd.read_csv('input_file_1.csv')], ignore_index=True)
y_col = [c for c in df.columns if 'Y染色体浓度' in c][0]
data = df.dropna(subset=[y_col]).copy()
data['Gestational_Age'] = data[[c for c in df.columns if '检测孕周' in c][0]].apply(parse_weeks)
data = data.rename(columns={y_col: 'Y_conc', '孕妇BMI': 'BMI', '体重': 'Weight', '身高': 'Height', '年龄': 'Age'})
cols = ['Y_conc', 'BMI', 'Gestational_Age', 'Weight', 'Height', 'Age']
for c in cols: data[c] = pd.to_numeric(data[c], errors='coerce')
data = data.dropna(subset=cols)
data = data[(data['Gestational_Age'] > 0) & (data['BMI'] > 10) & (data['Y_conc'] > 0)]
# 剔除异常值
data_clean = data[np.abs(data['Y_conc'] - data['Y_conc'].mean()) <= 3 * data['Y_conc'].std()]
# 训练 GPR (采样以加速)
sample_data = data_clean.sample(n=min(1000, len(data_clean)), random_state=42)
X_train = sample_data[['BMI', 'Gestational_Age']]
y_train = sample_data['Y_conc']
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X_train)
y_mean, y_std = y_train.mean(), y_train.std()
y_scaled = (y_train - y_mean) / y_std
kernel = C(1.0) * RBF([1.0, 1.0]) + WhiteKernel(noise_level=0.1)
gpr = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=0, random_state=42)
gpr.fit(X_scaled, y_scaled)
# 3. 风险计算函数 (供 Problem 3 调用)
def calculate_risk_p2(bmi, t, sigma_err=0.005):
X_q = scaler_X.transform([[bmi, t]])
y_pred, y_std_pred = gpr.predict(X_q, return_std=True)
mu = y_pred[0] * y_std + y_mean
sigma_total = np.sqrt((y_std_pred[0] * y_std)**2 + sigma_err**2)
p_fail = norm.cdf((0.04 - mu) / sigma_total)
risk = get_time_cost(t) + p_fail * (get_time_cost(t + 2) + 20)
return risk, p_fail