1

2026-01-31 09:01:25 +00:00 · 2025-12-27 14:36:56 +08:00
parent 95028f8070
commit c882a7a216
11 changed files with 2455 additions and 4 deletions
--- a/xlsx2csv/3.md
+++ b/xlsx2csv/3.md
@@ -0,0 +1,280 @@
+好的，这是针对我们刚刚完美解决的 \*\*Problem 3 (基于单调约束GBDT与平滑成本的优化模型)\*\* 的全量状态存档。
+
+
+
+你可以直接复制下面的 Markdown 代码块，发送到新的对话窗口，即可无缝继续进行 Problem 4 的讨论。
+
+
+
+```markdown
+
+\# 状态存档：Problem 3 完成，准备进入 Problem 4
+
+
+
+\*\*背景\*\*：我们正在进行 2025 数模竞赛 C 题。已完成问题 1（相关性分析）、问题 2（初级优化）和 \*\*问题 3（基于多因素的精细化分组与时点选择）\*\*。目前模型已修正了稀释效应逻辑，结果符合生物学规律。
+
+
+
+---
+
+
+
+\### 1. 符号与定义 (Notations)
+
+
+
+| 符号 | 变量名 | 物理含义 | 设定值/单位 |
+
+| :--- | :--- | :--- | :--- |
+
+| $Y$ | `Y\_conc` | 胎儿 Y 染色体浓度 | 阈值 $0.04$ (4%) |
+
+| $t$ | `GA` | 检测孕周 (Gestational Age) | weeks ($10 \\sim 30$) |
+
+| $B$ | `BMI` | 孕妇身体质量指数 | $kg/m^2$ |
+
+| $W, H, A$ | `Weight`, `Height`, `Age` | 体重、身高、年龄 | $kg, cm, years$ |
+
+| $C(t)$ | `get\_smooth\_cost(t)` | 检测时间成本函数 | 平滑分段函数 (见公式) |
+
+| $P\_{fail}$ | `p\_fail` | 检测失败概率 ($Y < 4\\%$) | $P(Y < 0.04 \\mid \\mathbf{x})$ |
+
+| $\\lambda$ | `penalty` | 重测惩罚系数 | $20$ |
+
+| $\\sigma\_{err}$ | `sigma\_err` | 仪器检测系统误差 | $0.005$ (0.5%) |
+
+| $\\sigma\_{model}$ | `sigma\_model` | 模型预测残差标准差 | 约为 $0.0223$ |
+
+
+
+\### 2. 核心假设 (Key Assumptions)
+
+
+
+1\.  \*\*严格单调性 (Strict Monotonicity)\*\*：模型训练时强制约束：$Y$ 随 $B, W$ 单调递减（稀释效应），随 $t$ 单调递增（累积效应）。
+
+2\.  \*\*平滑成本 (Smooth Cost)\*\*：时间成本不再是阶跃的，而在 12-18 周之间呈线性增长，以模拟错失最佳早期窗口的代价渐变。
+
+3\.  \*\*联合分布采样\*\*：在预测时，利用 KNN 根据 BMI 采样真实的 Height/Age/Weight 联合分布，而非使用合成数据。
+
+4\.  \*\*失败处理\*\*：若 $t$ 时刻检测失败，假设推迟 2 周重测。
+
+
+
+\### 3. 模型与公式 (Formulas)
+
+
+
+\*\*A. 浓度预测模型 (Monotonic GBDT)\*\*
+
+$$ \\hat{Y} = f\_{HGBR}(B, t, W, H, A) \\quad \\text{s.t.} \\quad \\frac{\\partial \\hat{Y}}{\\partial B} < 0, \\frac{\\partial \\hat{Y}}{\\partial t} > 0 $$
+
+
+
+\*\*B. 失败概率\*\*
+
+$$ P\_{fail}(t) = \\Phi\\left( \\frac{0.04 - \\hat{Y}}{\\sqrt{\\sigma\_{model}^2 + \\sigma\_{err}^2}} \\right) $$
+
+
+
+\*\*C. 平滑成本函数\*\*
+
+$$
+
+C(t) = \\begin{cases} 
+
+1.0 \& t \\le 12.0 \\\\
+
+1.0 + 0.5 \\times (t - 12.0) \& 12.0 < t \\le 18.0 \\\\
+
+10.0 \& t > 18.0 
+
+\\end{cases}
+
+$$
+
+
+
+\*\*D. 期望风险目标函数\*\*
+
+$$ \\min\_{t} E\[Risk]\_t = C(t) + P\_{fail}(t) \\times \[ C(t+2) + \\lambda ] $$
+
+
+
+\### 4. 关键数值结果 (Numerical Results)
+
+
+
+基于真实数据 (`input\_file\_2.csv`) 的优化结果：
+
+
+
+\*   \*\*最优检测时点\*\*: \*\*13.5 周\*\* (适用于绝大多数 BMI < 35 的孕妇)。
+
+\*   \*\*分组策略界限\*\*:
+
+&nbsp;   \*   \*\*Low Risk ($B < 28$)\*\*: $P\_{fail} \\approx 5.8\\%$, Risk $\\approx 3.1$. 策略：\*\*常规检测\*\*。
+
+&nbsp;   \*   \*\*Mid Risk ($28 \\le B < 35$)\*\*: $P\_{fail} \\approx 7-15\\%$, Risk $\\approx 3.4$. 策略：\*\*推荐检测\*\*。
+
+&nbsp;   \*   \*\*High Risk ($B \\ge 35$)\*\*: $P\_{fail} > 45\\%$, Risk $> 12.0$. 策略：\*\*高危预警/转诊\*\* (失败率过高，NIPT 性价比极低)。
+
+
+
+\### 5. Python 核心代码复现 (Minimal Snippet)
+
+
+
+```python
+
+import pandas as pd
+
+import numpy as np
+
+from sklearn.ensemble import HistGradientBoostingRegressor
+
+from sklearn.neighbors import NearestNeighbors
+
+from sklearn.model\_selection import train\_test\_split
+
+from scipy.stats import norm
+
+
+
+\# 1. Load \& Preprocess
+
+def parse\_weeks(s):
+
+&nbsp;   try:
+
+&nbsp;       s = str(s).lower().replace('w', '.')
+
+&nbsp;       return float(s.split('+')\[0]) if '+' in s else float(s)
+
+&nbsp;   except: return np.nan
+
+
+
+df = pd.read\_csv('input\_file\_2.csv') # Ensure file exists
+
+col\_map = {'Y染色体浓度':'Y\_conc', '孕妇BMI':'BMI', '检测孕周':'Gestational\_Age', 
+
+&nbsp;          '体重':'Weight', '身高':'Height', '年龄':'Age'}
+
+data = df.rename(columns={k:v for k,v in col\_map.items() if k in df.columns})
+
+data\['Gestational\_Age'] = data\['Gestational\_Age'].apply(parse\_weeks)
+
+cols = \['Y\_conc', 'BMI', 'Gestational\_Age', 'Weight', 'Height', 'Age']
+
+data = data.dropna(subset=cols)
+
+data = data\[(data\['Y\_conc'] > 0.001) \& (data\['BMI'] < 60) \& (data\['Gestational\_Age'] >= 10)]
+
+\# Outlier removal
+
+data = data\[np.abs(data\['Y\_conc'] - data\['Y\_conc'].mean()) <= 3 \* data\['Y\_conc'].std()]
+
+
+
+\# 2. Monotonic Model Training
+
+features = \['BMI', 'Gestational\_Age', 'Weight', 'Height', 'Age']
+
+\# Constraints: BMI(-1), GA(+1), Weight(-1), Height(0), Age(0)
+
+monotonic\_cst = \[-1, 1, -1, 0, 0] 
+
+model = HistGradientBoostingRegressor(monotonic\_cst=monotonic\_cst, learning\_rate=0.05, max\_iter=500, random\_state=2025)
+
+model.fit(data\[features], data\['Y\_conc'])
+
+
+
+\# Sigma calculation
+
+residuals = data\['Y\_conc'] - model.predict(data\[features])
+
+sigma\_total = np.sqrt(np.std(residuals)\*\*2 + 0.005\*\*2)
+
+
+
+\# 3. Risk Calculation Function
+
+knn = NearestNeighbors(n\_neighbors=50).fit(data\[\['BMI']])
+
+
+
+def get\_strategy(bmi):
+
+&nbsp;   time\_grid = np.linspace(10, 18, 17) # 10.0 to 18.0
+
+&nbsp;   best\_risk = np.inf
+
+&nbsp;   best\_t = -1
+
+&nbsp;   
+
+&nbsp;   # KNN Sampling
+
+&nbsp;   dists, idxs = knn.kneighbors(\[\[bmi]])
+
+&nbsp;   neighbors = data.iloc\[idxs\[0]]
+
+&nbsp;   
+
+&nbsp;   for t in time\_grid:
+
+&nbsp;       # Cost
+
+&nbsp;       cost\_now = 1.0 if t <= 12 else (1.0 + 0.5\*(t-12) if t<=18 else 10.0)
+
+&nbsp;       cost\_fut = 1.0 if t+2 <= 12 else (1.0 + 0.5\*(t+2-12) if t+2<=18 else 10.0)
+
+&nbsp;       
+
+&nbsp;       # Predict Prob
+
+&nbsp;       X\_sim = pd.DataFrame({'BMI': \[bmi]\*50, 'Gestational\_Age': \[t]\*50, 
+
+&nbsp;                             'Weight': bmi\*(neighbors\['Height'].values/100)\*\*2,
+
+&nbsp;                             'Height': neighbors\['Height'].values, 'Age': neighbors\['Age'].values})
+
+&nbsp;       p\_fail = np.mean(norm.cdf((0.04 - model.predict(X\_sim))/sigma\_total))
+
+&nbsp;       
+
+&nbsp;       risk = cost\_now + p\_fail \* (cost\_fut + 20)
+
+&nbsp;       if risk < best\_risk:
+
+&nbsp;           best\_risk = risk
+
+&nbsp;           best\_t = t
+
+&nbsp;           
+
+&nbsp;   return best\_t, best\_risk, p\_fail # Return best\_t, risk, and p\_fail at best\_t
+
+
+
+\# Example Usage
+
+\# t, r, p = get\_strategy(25) # Low Risk
+
+\# t, r, p = get\_strategy(40) # High Risk
+
+```
+
+
+
+\*\*任务请求\*\*：
+
+基于以上存档，请开始 \*\*Problem 4\*\* (女胎异常判定)。
+
+我们需要利用 `input\_file\_3.csv`（女胎数据）和 `input\_file\_2.csv` 中的标签，建立分类模型来判定 13、18、21 号染色体的三体综合征。请注意女胎没有 Y 染色体数据，需利用 Z 值、GC 含量等其他特征。
+
+```
+