mirror of
https://github.com/ChuXunYu/OfficeFileHandle.git
synced 2026-01-31 09:01:25 +00:00
281 lines
6.9 KiB
Markdown
281 lines
6.9 KiB
Markdown
好的,这是针对我们刚刚完美解决的 \*\*Problem 3 (基于单调约束GBDT与平滑成本的优化模型)\*\* 的全量状态存档。
|
||
|
||
|
||
|
||
你可以直接复制下面的 Markdown 代码块,发送到新的对话窗口,即可无缝继续进行 Problem 4 的讨论。
|
||
|
||
|
||
|
||
```markdown
|
||
|
||
\# 状态存档:Problem 3 完成,准备进入 Problem 4
|
||
|
||
|
||
|
||
\*\*背景\*\*:我们正在进行 2025 数模竞赛 C 题。已完成问题 1(相关性分析)、问题 2(初级优化)和 \*\*问题 3(基于多因素的精细化分组与时点选择)\*\*。目前模型已修正了稀释效应逻辑,结果符合生物学规律。
|
||
|
||
|
||
|
||
---
|
||
|
||
|
||
|
||
\### 1. 符号与定义 (Notations)
|
||
|
||
|
||
|
||
| 符号 | 变量名 | 物理含义 | 设定值/单位 |
|
||
|
||
| :--- | :--- | :--- | :--- |
|
||
|
||
| $Y$ | `Y\_conc` | 胎儿 Y 染色体浓度 | 阈值 $0.04$ (4%) |
|
||
|
||
| $t$ | `GA` | 检测孕周 (Gestational Age) | weeks ($10 \\sim 30$) |
|
||
|
||
| $B$ | `BMI` | 孕妇身体质量指数 | $kg/m^2$ |
|
||
|
||
| $W, H, A$ | `Weight`, `Height`, `Age` | 体重、身高、年龄 | $kg, cm, years$ |
|
||
|
||
| $C(t)$ | `get\_smooth\_cost(t)` | 检测时间成本函数 | 平滑分段函数 (见公式) |
|
||
|
||
| $P\_{fail}$ | `p\_fail` | 检测失败概率 ($Y < 4\\%$) | $P(Y < 0.04 \\mid \\mathbf{x})$ |
|
||
|
||
| $\\lambda$ | `penalty` | 重测惩罚系数 | $20$ |
|
||
|
||
| $\\sigma\_{err}$ | `sigma\_err` | 仪器检测系统误差 | $0.005$ (0.5%) |
|
||
|
||
| $\\sigma\_{model}$ | `sigma\_model` | 模型预测残差标准差 | 约为 $0.0223$ |
|
||
|
||
|
||
|
||
\### 2. 核心假设 (Key Assumptions)
|
||
|
||
|
||
|
||
1\. \*\*严格单调性 (Strict Monotonicity)\*\*:模型训练时强制约束:$Y$ 随 $B, W$ 单调递减(稀释效应),随 $t$ 单调递增(累积效应)。
|
||
|
||
2\. \*\*平滑成本 (Smooth Cost)\*\*:时间成本不再是阶跃的,而在 12-18 周之间呈线性增长,以模拟错失最佳早期窗口的代价渐变。
|
||
|
||
3\. \*\*联合分布采样\*\*:在预测时,利用 KNN 根据 BMI 采样真实的 Height/Age/Weight 联合分布,而非使用合成数据。
|
||
|
||
4\. \*\*失败处理\*\*:若 $t$ 时刻检测失败,假设推迟 2 周重测。
|
||
|
||
|
||
|
||
\### 3. 模型与公式 (Formulas)
|
||
|
||
|
||
|
||
\*\*A. 浓度预测模型 (Monotonic GBDT)\*\*
|
||
|
||
$$ \\hat{Y} = f\_{HGBR}(B, t, W, H, A) \\quad \\text{s.t.} \\quad \\frac{\\partial \\hat{Y}}{\\partial B} < 0, \\frac{\\partial \\hat{Y}}{\\partial t} > 0 $$
|
||
|
||
|
||
|
||
\*\*B. 失败概率\*\*
|
||
|
||
$$ P\_{fail}(t) = \\Phi\\left( \\frac{0.04 - \\hat{Y}}{\\sqrt{\\sigma\_{model}^2 + \\sigma\_{err}^2}} \\right) $$
|
||
|
||
|
||
|
||
\*\*C. 平滑成本函数\*\*
|
||
|
||
$$
|
||
|
||
C(t) = \\begin{cases}
|
||
|
||
1.0 \& t \\le 12.0 \\\\
|
||
|
||
1.0 + 0.5 \\times (t - 12.0) \& 12.0 < t \\le 18.0 \\\\
|
||
|
||
10.0 \& t > 18.0
|
||
|
||
\\end{cases}
|
||
|
||
$$
|
||
|
||
|
||
|
||
\*\*D. 期望风险目标函数\*\*
|
||
|
||
$$ \\min\_{t} E\[Risk]\_t = C(t) + P\_{fail}(t) \\times \[ C(t+2) + \\lambda ] $$
|
||
|
||
|
||
|
||
\### 4. 关键数值结果 (Numerical Results)
|
||
|
||
|
||
|
||
基于真实数据 (`input\_file\_2.csv`) 的优化结果:
|
||
|
||
|
||
|
||
\* \*\*最优检测时点\*\*: \*\*13.5 周\*\* (适用于绝大多数 BMI < 35 的孕妇)。
|
||
|
||
\* \*\*分组策略界限\*\*:
|
||
|
||
\* \*\*Low Risk ($B < 28$)\*\*: $P\_{fail} \\approx 5.8\\%$, Risk $\\approx 3.1$. 策略:\*\*常规检测\*\*。
|
||
|
||
\* \*\*Mid Risk ($28 \\le B < 35$)\*\*: $P\_{fail} \\approx 7-15\\%$, Risk $\\approx 3.4$. 策略:\*\*推荐检测\*\*。
|
||
|
||
\* \*\*High Risk ($B \\ge 35$)\*\*: $P\_{fail} > 45\\%$, Risk $> 12.0$. 策略:\*\*高危预警/转诊\*\* (失败率过高,NIPT 性价比极低)。
|
||
|
||
|
||
|
||
\### 5. Python 核心代码复现 (Minimal Snippet)
|
||
|
||
|
||
|
||
```python
|
||
|
||
import pandas as pd
|
||
|
||
import numpy as np
|
||
|
||
from sklearn.ensemble import HistGradientBoostingRegressor
|
||
|
||
from sklearn.neighbors import NearestNeighbors
|
||
|
||
from sklearn.model\_selection import train\_test\_split
|
||
|
||
from scipy.stats import norm
|
||
|
||
|
||
|
||
\# 1. Load \& Preprocess
|
||
|
||
def parse\_weeks(s):
|
||
|
||
try:
|
||
|
||
s = str(s).lower().replace('w', '.')
|
||
|
||
return float(s.split('+')\[0]) if '+' in s else float(s)
|
||
|
||
except: return np.nan
|
||
|
||
|
||
|
||
df = pd.read\_csv('input\_file\_2.csv') # Ensure file exists
|
||
|
||
col\_map = {'Y染色体浓度':'Y\_conc', '孕妇BMI':'BMI', '检测孕周':'Gestational\_Age',
|
||
|
||
'体重':'Weight', '身高':'Height', '年龄':'Age'}
|
||
|
||
data = df.rename(columns={k:v for k,v in col\_map.items() if k in df.columns})
|
||
|
||
data\['Gestational\_Age'] = data\['Gestational\_Age'].apply(parse\_weeks)
|
||
|
||
cols = \['Y\_conc', 'BMI', 'Gestational\_Age', 'Weight', 'Height', 'Age']
|
||
|
||
data = data.dropna(subset=cols)
|
||
|
||
data = data\[(data\['Y\_conc'] > 0.001) \& (data\['BMI'] < 60) \& (data\['Gestational\_Age'] >= 10)]
|
||
|
||
\# Outlier removal
|
||
|
||
data = data\[np.abs(data\['Y\_conc'] - data\['Y\_conc'].mean()) <= 3 \* data\['Y\_conc'].std()]
|
||
|
||
|
||
|
||
\# 2. Monotonic Model Training
|
||
|
||
features = \['BMI', 'Gestational\_Age', 'Weight', 'Height', 'Age']
|
||
|
||
\# Constraints: BMI(-1), GA(+1), Weight(-1), Height(0), Age(0)
|
||
|
||
monotonic\_cst = \[-1, 1, -1, 0, 0]
|
||
|
||
model = HistGradientBoostingRegressor(monotonic\_cst=monotonic\_cst, learning\_rate=0.05, max\_iter=500, random\_state=2025)
|
||
|
||
model.fit(data\[features], data\['Y\_conc'])
|
||
|
||
|
||
|
||
\# Sigma calculation
|
||
|
||
residuals = data\['Y\_conc'] - model.predict(data\[features])
|
||
|
||
sigma\_total = np.sqrt(np.std(residuals)\*\*2 + 0.005\*\*2)
|
||
|
||
|
||
|
||
\# 3. Risk Calculation Function
|
||
|
||
knn = NearestNeighbors(n\_neighbors=50).fit(data\[\['BMI']])
|
||
|
||
|
||
|
||
def get\_strategy(bmi):
|
||
|
||
time\_grid = np.linspace(10, 18, 17) # 10.0 to 18.0
|
||
|
||
best\_risk = np.inf
|
||
|
||
best\_t = -1
|
||
|
||
|
||
|
||
# KNN Sampling
|
||
|
||
dists, idxs = knn.kneighbors(\[\[bmi]])
|
||
|
||
neighbors = data.iloc\[idxs\[0]]
|
||
|
||
|
||
|
||
for t in time\_grid:
|
||
|
||
# Cost
|
||
|
||
cost\_now = 1.0 if t <= 12 else (1.0 + 0.5\*(t-12) if t<=18 else 10.0)
|
||
|
||
cost\_fut = 1.0 if t+2 <= 12 else (1.0 + 0.5\*(t+2-12) if t+2<=18 else 10.0)
|
||
|
||
|
||
|
||
# Predict Prob
|
||
|
||
X\_sim = pd.DataFrame({'BMI': \[bmi]\*50, 'Gestational\_Age': \[t]\*50,
|
||
|
||
'Weight': bmi\*(neighbors\['Height'].values/100)\*\*2,
|
||
|
||
'Height': neighbors\['Height'].values, 'Age': neighbors\['Age'].values})
|
||
|
||
p\_fail = np.mean(norm.cdf((0.04 - model.predict(X\_sim))/sigma\_total))
|
||
|
||
|
||
|
||
risk = cost\_now + p\_fail \* (cost\_fut + 20)
|
||
|
||
if risk < best\_risk:
|
||
|
||
best\_risk = risk
|
||
|
||
best\_t = t
|
||
|
||
|
||
|
||
return best\_t, best\_risk, p\_fail # Return best\_t, risk, and p\_fail at best\_t
|
||
|
||
|
||
|
||
\# Example Usage
|
||
|
||
\# t, r, p = get\_strategy(25) # Low Risk
|
||
|
||
\# t, r, p = get\_strategy(40) # High Risk
|
||
|
||
```
|
||
|
||
|
||
|
||
\*\*任务请求\*\*:
|
||
|
||
基于以上存档,请开始 \*\*Problem 4\*\* (女胎异常判定)。
|
||
|
||
我们需要利用 `input\_file\_3.csv`(女胎数据)和 `input\_file\_2.csv` 中的标签,建立分类模型来判定 13、18、21 号染色体的三体综合征。请注意女胎没有 Y 染色体数据,需利用 Z 值、GC 含量等其他特征。
|
||
|
||
```
|
||
|