mirror of
https://github.com/ChuXunYu/OfficeFileHandle.git
synced 2026-01-31 10:11:25 +00:00
1
This commit is contained in:
106
xlsx2csv/2.md
Normal file
106
xlsx2csv/2.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# 阶段性存档:Problem 2 完成,准备进入 Problem 3
|
||||
|
||||
**当前状态**:已完成基于 BMI 的单变量分组与时点优化。建立了基于 GPR 的概率预测模型和基于期望风险的决策模型。发现了高 BMI 组的反直觉规律。
|
||||
|
||||
---
|
||||
|
||||
### 1. 符号与定义 (Notations & Definitions)
|
||||
|
||||
| 符号 | 变量名 | 物理含义 | 设定值/单位 |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| $Y(B,t)$ | `Y_conc` | 胎儿 Y 染色体浓度 | 阈值 $0.04$ (4%) |
|
||||
| $t$ | `GA` | 检测孕周 (Gestational Age) | weeks |
|
||||
| $B$ | `BMI` | 孕妇身体质量指数 | $kg/m^2$ |
|
||||
| $C(t)$ | `get_time_cost(t)` | 时间成本函数 | 早期(≤12): 1; 中期(13-27): 10; 晚期(≥28): 100 |
|
||||
| $P_{fail}$ | `p_fail` | 检测失败概率 ($Y < 4\%$) | $P(Y < 0.04 | B, t)$ |
|
||||
| $\lambda$ | `penalty` | 重测惩罚系数 | $20$ |
|
||||
| $\Delta t$ | - | 重测时间推迟 | $2$ 周 |
|
||||
| $\sigma_{err}$ | `sigma_err` | 仪器检测误差 | 默认 $0.005$ (0.5%) |
|
||||
|
||||
### 2. 核心发现 (Key Conclusions)
|
||||
|
||||
1. **分组界限 (Grouping)**:
|
||||
* **Group A (Low BMI)**: $BMI < 27.5$
|
||||
* **Group B (Mid BMI)**: $27.5 \le BMI < 35.5$
|
||||
* **Group C (High BMI)**: $BMI \ge 35.5$
|
||||
|
||||
2. **最佳时点 (Optimal Timing)**:
|
||||
* **Group A**: **10-11 周** (浓度高,早测风险低)。
|
||||
* **Group B**: **13 周** (推迟以确保浓度达标,避免重测)。
|
||||
* **Group C**: **10 周** (反直觉结论)。
|
||||
|
||||
3. **重要洞察 (Critical Insight for Group C)**:
|
||||
* 对于重度肥胖孕妇 ($BMI > 35$),GPR 模型显示其 Y 染色体浓度随孕周增长**极其缓慢甚至停滞**,且预测方差极大。
|
||||
* 等待至 20 周并不能显著降低失败率 (仅降低约 3%),但时间成本会增加 10 倍。
|
||||
* 因此,模型给出的最优解是**“早期止损” (Early Stop)**:尽早检测,若失败则直接转诊,而非盲目等待。**在后续建模中需保留此结论。**
|
||||
|
||||
### 3. 数学模型 (The Model)
|
||||
|
||||
**A. 浓度预测 (GPR)**
|
||||
$$ Y(B, t) \sim \mathcal{N}(\mu(B,t), \sigma_{model}^2(B,t) + \sigma_{err}^2) $$
|
||||
|
||||
**B. 期望风险最小化 (Optimization)**
|
||||
$$ \min_{t} E(Risk)_t = C(t) + \Phi\left(\frac{0.04 - \mu(B,t)}{\sqrt{\sigma_{model}^2 + \sigma_{err}^2}}\right) \times [ C(t+2) + \lambda ] $$
|
||||
|
||||
### 4. Python 复现代码 (Minimal Snippet)
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from sklearn.gaussian_process import GaussianProcessRegressor
|
||||
from sklearn.gaussian_process.kernels import RBF, WhiteKernel, ConstantKernel as C
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from scipy.stats import norm
|
||||
|
||||
# 1. 辅助函数与配置
|
||||
def get_time_cost(t):
|
||||
if t < 13.0: return 1.0
|
||||
elif t < 28.0: return 10.0
|
||||
else: return 100.0
|
||||
|
||||
def parse_weeks(s):
|
||||
try:
|
||||
s = str(s).lower().replace(' ', '')
|
||||
if 'w' in s:
|
||||
parts = s.split('w')
|
||||
weeks = float(parts[0])
|
||||
days = float(parts[1].replace('+', '').replace('d', '')) if len(parts)>1 and parts[1] else 0
|
||||
return weeks + days / 7.0
|
||||
return float(s)
|
||||
except: return np.nan
|
||||
|
||||
# 2. 数据加载与 GPR 训练 (快速恢复现场)
|
||||
df = pd.concat([pd.read_csv('input_file_0.csv'), pd.read_csv('input_file_1.csv')], ignore_index=True)
|
||||
y_col = [c for c in df.columns if 'Y染色体浓度' in c][0]
|
||||
data = df.dropna(subset=[y_col]).copy()
|
||||
data['Gestational_Age'] = data[[c for c in df.columns if '检测孕周' in c][0]].apply(parse_weeks)
|
||||
data = data.rename(columns={y_col: 'Y_conc', '孕妇BMI': 'BMI', '体重': 'Weight', '身高': 'Height', '年龄': 'Age'})
|
||||
cols = ['Y_conc', 'BMI', 'Gestational_Age', 'Weight', 'Height', 'Age']
|
||||
for c in cols: data[c] = pd.to_numeric(data[c], errors='coerce')
|
||||
data = data.dropna(subset=cols)
|
||||
data = data[(data['Gestational_Age'] > 0) & (data['BMI'] > 10) & (data['Y_conc'] > 0)]
|
||||
# 剔除异常值
|
||||
data_clean = data[np.abs(data['Y_conc'] - data['Y_conc'].mean()) <= 3 * data['Y_conc'].std()]
|
||||
|
||||
# 训练 GPR (采样以加速)
|
||||
sample_data = data_clean.sample(n=min(1000, len(data_clean)), random_state=42)
|
||||
X_train = sample_data[['BMI', 'Gestational_Age']]
|
||||
y_train = sample_data['Y_conc']
|
||||
scaler_X = StandardScaler()
|
||||
X_scaled = scaler_X.fit_transform(X_train)
|
||||
y_mean, y_std = y_train.mean(), y_train.std()
|
||||
y_scaled = (y_train - y_mean) / y_std
|
||||
|
||||
kernel = C(1.0) * RBF([1.0, 1.0]) + WhiteKernel(noise_level=0.1)
|
||||
gpr = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=0, random_state=42)
|
||||
gpr.fit(X_scaled, y_scaled)
|
||||
|
||||
# 3. 风险计算函数 (供 Problem 3 调用)
|
||||
def calculate_risk_p2(bmi, t, sigma_err=0.005):
|
||||
X_q = scaler_X.transform([[bmi, t]])
|
||||
y_pred, y_std_pred = gpr.predict(X_q, return_std=True)
|
||||
mu = y_pred[0] * y_std + y_mean
|
||||
sigma_total = np.sqrt((y_std_pred[0] * y_std)**2 + sigma_err**2)
|
||||
p_fail = norm.cdf((0.04 - mu) / sigma_total)
|
||||
risk = get_time_cost(t) + p_fail * (get_time_cost(t + 2) + 20)
|
||||
return risk, p_fail
|
||||
Reference in New Issue
Block a user