Docx/xlsx2csv/3.md

好的，这是针对我们刚刚完美解决的 \*\*Problem 3 (基于单调约束GBDT与平滑成本的优化模型)\*\* 的全量状态存档。


你可以直接复制下面的 Markdown 代码块，发送到新的对话窗口，即可无缝继续进行 Problem 4 的讨论。


```markdown

\# 状态存档：Problem 3 完成，准备进入 Problem 4


\*\*背景\*\*：我们正在进行 2025 数模竞赛 C 题。已完成问题 1（相关性分析）、问题 2（初级优化）和 \*\*问题 3（基于多因素的精细化分组与时点选择）\*\*。目前模型已修正了稀释效应逻辑，结果符合生物学规律。


---


\### 1. 符号与定义 (Notations)


| 符号 | 变量名 | 物理含义 | 设定值/单位 |

| :--- | :--- | :--- | :--- |

| $Y$ | `Y\_conc` | 胎儿 Y 染色体浓度 | 阈值 $0.04$ (4%) |

| $t$ | `GA` | 检测孕周 (Gestational Age) | weeks ($10 \\sim 30$) |

| $B$ | `BMI` | 孕妇身体质量指数 | $kg/m^2$ |

| $W, H, A$ | `Weight`, `Height`, `Age` | 体重、身高、年龄 | $kg, cm, years$ |

| $C(t)$ | `get\_smooth\_cost(t)` | 检测时间成本函数 | 平滑分段函数 (见公式) |

| $P\_{fail}$ | `p\_fail` | 检测失败概率 ($Y < 4\\%$) | $P(Y < 0.04 \\mid \\mathbf{x})$ |

| $\\lambda$ | `penalty` | 重测惩罚系数 | $20$ |

| $\\sigma\_{err}$ | `sigma\_err` | 仪器检测系统误差 | $0.005$ (0.5%) |

| $\\sigma\_{model}$ | `sigma\_model` | 模型预测残差标准差 | 约为 $0.0223$ |


\### 2. 核心假设 (Key Assumptions)


1\.  \*\*严格单调性 (Strict Monotonicity)\*\*：模型训练时强制约束：$Y$ 随 $B, W$ 单调递减（稀释效应），随 $t$ 单调递增（累积效应）。

2\.  \*\*平滑成本 (Smooth Cost)\*\*：时间成本不再是阶跃的，而在 12-18 周之间呈线性增长，以模拟错失最佳早期窗口的代价渐变。

3\.  \*\*联合分布采样\*\*：在预测时，利用 KNN 根据 BMI 采样真实的 Height/Age/Weight 联合分布，而非使用合成数据。

4\.  \*\*失败处理\*\*：若 $t$ 时刻检测失败，假设推迟 2 周重测。


\### 3. 模型与公式 (Formulas)


\*\*A. 浓度预测模型 (Monotonic GBDT)\*\*

$$ \\hat{Y} = f\_{HGBR}(B, t, W, H, A) \\quad \\text{s.t.} \\quad \\frac{\\partial \\hat{Y}}{\\partial B} < 0, \\frac{\\partial \\hat{Y}}{\\partial t} > 0 $$


\*\*B. 失败概率\*\*

$$ P\_{fail}(t) = \\Phi\\left( \\frac{0.04 - \\hat{Y}}{\\sqrt{\\sigma\_{model}^2 + \\sigma\_{err}^2}} \\right) $$


\*\*C. 平滑成本函数\*\*

$$

C(t) = \\begin{cases}

1.0 \& t \\le 12.0 \\\\

1.0 + 0.5 \\times (t - 12.0) \& 12.0 < t \\le 18.0 \\\\

10.0 \& t > 18.0

\\end{cases}

$$


\*\*D. 期望风险目标函数\*\*

$$ \\min\_{t} E\[Risk]\_t = C(t) + P\_{fail}(t) \\times \[ C(t+2) + \\lambda ] $$


\### 4. 关键数值结果 (Numerical Results)


基于真实数据 (`input\_file\_2.csv`) 的优化结果：


\*   \*\*最优检测时点\*\*: \*\*13.5 周\*\* (适用于绝大多数 BMI < 35 的孕妇)。

\*   \*\*分组策略界限\*\*:

&nbsp;   \*   \*\*Low Risk ($B < 28$)\*\*: $P\_{fail} \\approx 5.8\\%$, Risk $\\approx 3.1$. 策略：\*\*常规检测\*\*。

&nbsp;   \*   \*\*Mid Risk ($28 \\le B < 35$)\*\*: $P\_{fail} \\approx 7-15\\%$, Risk $\\approx 3.4$. 策略：\*\*推荐检测\*\*。

&nbsp;   \*   \*\*High Risk ($B \\ge 35$)\*\*: $P\_{fail} > 45\\%$, Risk $> 12.0$. 策略：\*\*高危预警/转诊\*\* (失败率过高，NIPT 性价比极低)。


\### 5. Python 核心代码复现 (Minimal Snippet)


```python

import pandas as pd

import numpy as np

from sklearn.ensemble import HistGradientBoostingRegressor

from sklearn.neighbors import NearestNeighbors

from sklearn.model\_selection import train\_test\_split

from scipy.stats import norm


\# 1. Load \& Preprocess

def parse\_weeks(s):

&nbsp;   try:

&nbsp;       s = str(s).lower().replace('w', '.')

&nbsp;       return float(s.split('+')\[0]) if '+' in s else float(s)

&nbsp;   except: return np.nan


df = pd.read\_csv('input\_file\_2.csv') # Ensure file exists

col\_map = {'Y染色体浓度':'Y\_conc', '孕妇BMI':'BMI', '检测孕周':'Gestational\_Age',

&nbsp;          '体重':'Weight', '身高':'Height', '年龄':'Age'}

data = df.rename(columns={k:v for k,v in col\_map.items() if k in df.columns})

data\['Gestational\_Age'] = data\['Gestational\_Age'].apply(parse\_weeks)

cols = \['Y\_conc', 'BMI', 'Gestational\_Age', 'Weight', 'Height', 'Age']

data = data.dropna(subset=cols)

data = data\[(data\['Y\_conc'] > 0.001) \& (data\['BMI'] < 60) \& (data\['Gestational\_Age'] >= 10)]

\# Outlier removal

data = data\[np.abs(data\['Y\_conc'] - data\['Y\_conc'].mean()) <= 3 \* data\['Y\_conc'].std()]


\# 2. Monotonic Model Training

features = \['BMI', 'Gestational\_Age', 'Weight', 'Height', 'Age']

\# Constraints: BMI(-1), GA(+1), Weight(-1), Height(0), Age(0)

monotonic\_cst = \[-1, 1, -1, 0, 0]

model = HistGradientBoostingRegressor(monotonic\_cst=monotonic\_cst, learning\_rate=0.05, max\_iter=500, random\_state=2025)

model.fit(data\[features], data\['Y\_conc'])


\# Sigma calculation

residuals = data\['Y\_conc'] - model.predict(data\[features])

sigma\_total = np.sqrt(np.std(residuals)\*\*2 + 0.005\*\*2)


\# 3. Risk Calculation Function

knn = NearestNeighbors(n\_neighbors=50).fit(data\[\['BMI']])


def get\_strategy(bmi):

&nbsp;   time\_grid = np.linspace(10, 18, 17) # 10.0 to 18.0

&nbsp;   best\_risk = np.inf

&nbsp;   best\_t = -1

&nbsp;

&nbsp;   # KNN Sampling

&nbsp;   dists, idxs = knn.kneighbors(\[\[bmi]])

&nbsp;   neighbors = data.iloc\[idxs\[0]]

&nbsp;

&nbsp;   for t in time\_grid:

&nbsp;       # Cost

&nbsp;       cost\_now = 1.0 if t <= 12 else (1.0 + 0.5\*(t-12) if t<=18 else 10.0)

&nbsp;       cost\_fut = 1.0 if t+2 <= 12 else (1.0 + 0.5\*(t+2-12) if t+2<=18 else 10.0)

&nbsp;

&nbsp;       # Predict Prob

&nbsp;       X\_sim = pd.DataFrame({'BMI': \[bmi]\*50, 'Gestational\_Age': \[t]\*50,

&nbsp;                             'Weight': bmi\*(neighbors\['Height'].values/100)\*\*2,

&nbsp;                             'Height': neighbors\['Height'].values, 'Age': neighbors\['Age'].values})

&nbsp;       p\_fail = np.mean(norm.cdf((0.04 - model.predict(X\_sim))/sigma\_total))

&nbsp;

&nbsp;       risk = cost\_now + p\_fail \* (cost\_fut + 20)

&nbsp;       if risk < best\_risk:

&nbsp;           best\_risk = risk

&nbsp;           best\_t = t

&nbsp;

&nbsp;   return best\_t, best\_risk, p\_fail # Return best\_t, risk, and p\_fail at best\_t


\# Example Usage

\# t, r, p = get\_strategy(25) # Low Risk

\# t, r, p = get\_strategy(40) # High Risk

```


\*\*任务请求\*\*：

基于以上存档，请开始 \*\*Problem 4\*\* (女胎异常判定)。

我们需要利用 `input\_file\_3.csv`（女胎数据）和 `input\_file\_2.csv` 中的标签，建立分类模型来判定 13、18、21 号染色体的三体综合征。请注意女胎没有 Y 染色体数据，需利用 Z 值、GC 含量等其他特征。

```