Data analysis skills specifically designed for the financial risk control field (#823)

* upload skill datanalysis-credit-risk

* update skill datanalysis-credit-risk

* fix plugin problem

* re-run npm start

* re-run npm start

* change codes description skill.md to english and remove personal path

* try to update readme.md

* Updating readme

---------

Co-authored-by: Aaron Powell <me@aaron-powell.com>
This commit is contained in:
REAL-Madrid01
2026-03-02 16:31:21 +08:00
committed by GitHub
parent 4cf83b0161
commit 0ea5aa1156
5 changed files with 1956 additions and 0 deletions

View File

@@ -0,0 +1,113 @@
---
name: datanalysis-credit-risk
description: Credit risk data cleaning and variable screening pipeline for pre-loan modeling. Use when working with raw credit data that needs quality assessment, missing value analysis, or variable selection before modeling. it covers data loading and formatting, abnormal period filtering, missing rate calculation, high-missing variable removal,low-IV variable filtering, high-PSI variable removal, Null Importance denoising, high-correlation variable removal, and cleaning report generation. Applicable scenarios arecredit risk data cleaning, variable screening, pre-loan modeling preprocessing.
---
# Data Cleaning and Variable Screening
## Quick Start
```bash
# Run the complete data cleaning pipeline
python ".github/skills/datanalysis-credit-risk/scripts/example.py"
```
## Complete Process Description
The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:
1. **Get Data** - Load and format raw data
2. **Organization Sample Analysis** - Statistics of sample count and bad sample rate for each organization
3. **Separate OOS Data** - Separate out-of-sample (OOS) samples from modeling samples
4. **Filter Abnormal Months** - Remove months with insufficient bad sample count or total sample count
5. **Calculate Missing Rate** - Calculate overall and organization-level missing rates for each feature
6. **Drop High Missing Rate Features** - Remove features with overall missing rate exceeding threshold
7. **Drop Low IV Features** - Remove features with overall IV too low or IV too low in too many organizations
8. **Drop High PSI Features** - Remove features with unstable PSI
9. **Null Importance Denoising** - Remove noise features using label permutation method
10. **Drop High Correlation Features** - Remove high correlation features based on original gain
11. **Export Report** - Generate Excel report containing details and statistics of all steps
## Core Functions
| Function | Purpose | Module |
|------|------|----------|
| `get_dataset()` | Load and format data | references.func |
| `org_analysis()` | Organization sample analysis | references.func |
| `missing_check()` | Calculate missing rate | references.func |
| `drop_abnormal_ym()` | Filter abnormal months | references.analysis |
| `drop_highmiss_features()` | Drop high missing rate features | references.analysis |
| `drop_lowiv_features()` | Drop low IV features | references.analysis |
| `drop_highpsi_features()` | Drop high PSI features | references.analysis |
| `drop_highnoise_features()` | Null Importance denoising | references.analysis |
| `drop_highcorr_features()` | Drop high correlation features | references.analysis |
| `iv_distribution_by_org()` | IV distribution statistics | references.analysis |
| `psi_distribution_by_org()` | PSI distribution statistics | references.analysis |
| `value_ratio_distribution_by_org()` | Value ratio distribution statistics | references.analysis |
| `export_cleaning_report()` | Export cleaning report | references.analysis |
## Parameter Description
### Data Loading Parameters
- `DATA_PATH`: Data file path (best are parquet format)
- `DATE_COL`: Date column name
- `Y_COL`: Label column name
- `ORG_COL`: Organization column name
- `KEY_COLS`: Primary key column name list
### OOS Organization Configuration
- `OOS_ORGS`: Out-of-sample organization list
### Abnormal Month Filtering Parameters
- `min_ym_bad_sample`: Minimum bad sample count per month (default 10)
- `min_ym_sample`: Minimum total sample count per month (default 500)
### Missing Rate Parameters
- `missing_ratio`: Overall missing rate threshold (default 0.6)
### IV Parameters
- `overall_iv_threshold`: Overall IV threshold (default 0.1)
- `org_iv_threshold`: Single organization IV threshold (default 0.1)
- `max_org_threshold`: Maximum tolerated low IV organization count (default 2)
### PSI Parameters
- `psi_threshold`: PSI threshold (default 0.1)
- `max_months_ratio`: Maximum unstable month ratio (default 1/3)
- `max_orgs`: Maximum unstable organization count (default 6)
### Null Importance Parameters
- `n_estimators`: Number of trees (default 100)
- `max_depth`: Maximum tree depth (default 5)
- `gain_threshold`: Gain difference threshold (default 50)
### High Correlation Parameters
- `max_corr`: Correlation threshold (default 0.9)
- `top_n_keep`: Keep top N features by original gain ranking (default 20)
## Output Report
The generated Excel report contains the following sheets:
1. **汇总** - Summary information of all steps, including operation results and conditions
2. **机构样本统计** - Sample count and bad sample rate for each organization
3. **分离OOS数据** - OOS sample and modeling sample counts
4. **Step4-异常月份处理** - Abnormal months that were removed
5. **缺失率明细** - Overall and organization-level missing rates for each feature
6. **Step5-有值率分布统计** - Distribution of features in different value ratio ranges
7. **Step6-高缺失率处理** - High missing rate features that were removed
8. **Step7-IV明细** - IV values of each feature in each organization and overall
9. **Step7-IV处理** - Features that do not meet IV conditions and low IV organizations
10. **Step7-IV分布统计** - Distribution of features in different IV ranges
11. **Step8-PSI明细** - PSI values of each feature in each organization each month
12. **Step8-PSI处理** - Features that do not meet PSI conditions and unstable organizations
13. **Step8-PSI分布统计** - Distribution of features in different PSI ranges
14. **Step9-null importance处理** - Noise features that were removed
15. **Step10-高相关性剔除** - High correlation features that were removed
## Features
- **Interactive Input**: Parameters can be input before each step execution, with default values supported
- **Independent Execution**: Each step is executed independently without deleting original data, facilitating comparative analysis
- **Complete Report**: Generate complete Excel report containing details, statistics, and distributions
- **Multi-process Support**: IV and PSI calculations support multi-process acceleration
- **Organization-level Analysis**: Support organization-level statistics and modeling/OOS distinction

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,228 @@
"""Data processing functions module"""
import pandas as pd
import numpy as np
import toad
from typing import List, Dict, Tuple
import tqdm
from datetime import datetime
try:
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
HAS_OPENPYXL = True
except:
HAS_OPENPYXL = False
def get_dataset(data_pth: str, date_colName: str, y_colName: str,
org_colName: str, data_encode: str, key_colNames: List[str],
drop_colNames: List[str] = None,
miss_vals: List[int] = None) -> pd.DataFrame:
"""Load and format data
Args:
data_pth: Data file path
date_colName: Date column name
y_colName: Label column name
org_colName: Organization column name
data_encode: Data encoding
key_colNames: Primary key columns (for deduplication)
drop_colNames: Columns to drop
miss_vals: List of abnormal values to replace with NaN, default [-1, -999, -1111]
"""
if drop_colNames is None:
drop_colNames = []
if miss_vals is None:
miss_vals = [-1, -999, -1111]
# Multi-format reading
for fmt, reader in [('parquet', pd.read_parquet), ('csv', pd.read_csv),
('xlsx', pd.read_excel), ('pkl', pd.read_pickle)]:
try:
data = reader(data_pth)
break
except:
continue
# Replace abnormal values with NaN
data.replace({v: np.nan for v in miss_vals}, inplace=True)
# Deduplication and filtering
data = data[data[y_colName].isin([0, 1])]
data = data.drop_duplicates(subset=key_colNames)
# Drop invalid columns
data.drop(columns=[c for c in drop_colNames if c in data.columns], errors='ignore')
data.drop(columns=[c for c in data.columns if data[c].nunique() <= 1], errors='ignore')
# Rename columns
data.rename(columns={date_colName: 'new_date', y_colName: 'new_target',
org_colName: 'new_org'}, inplace=True)
data['new_date'] = data['new_date'].astype(str).str.replace('-', '', regex=False).str[:8]
data['new_date_ym'] = data['new_date'].str[:6]
return data
def org_analysis(data: pd.DataFrame, oos_orgs: List[str] = None) -> pd.DataFrame:
"""Organization sample statistics analysis
Args:
data: Data
oos_orgs: Out-of-sample organization list, used to identify OOS samples
"""
stat = data.groupby(['new_org', 'new_date_ym']).agg(
单月坏样本数=('new_target', 'sum'),
单月总样本数=('new_target', 'count'),
单月坏样率=('new_target', 'mean')
).reset_index()
# Cumulative statistics
stat['总坏样本数'] = stat.groupby('new_org')['单月坏样本数'].transform('sum')
stat['总样本数'] = stat.groupby('new_org')['单月总样本数'].transform('sum')
stat['总坏样率'] = stat['总坏样本数'] / stat['总样本数']
# Mark whether it is an OOS organization
if oos_orgs and len(oos_orgs) > 0:
stat['样本类型'] = stat['new_org'].apply(lambda x: '贷外' if x in oos_orgs else '建模')
else:
stat['样本类型'] = '建模'
stat = stat.rename(columns={'new_org': '机构', 'new_date_ym': '年月'})
# Sort by sample type (modeling first, OOS last)
stat = stat.sort_values(['样本类型', '机构', '年月'], ascending=[True, True, True])
stat = stat.reset_index(drop=True)
return stat[['机构', '年月', '单月坏样本数', '单月总样本数', '单月坏样率', '总坏样本数', '总样本数', '总坏样率', '样本类型']]
def missing_check(data: pd.DataFrame, channel: Dict[str, List[str]] = None) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""Calculate missing rate - including overall and organization-level missing rates
Returns:
miss_detail: Missing rate details (format: variable, overall, org1, org2, ..., orgn)
miss_ch: Overall missing rate (overall missing rate for each variable)
"""
miss_vals = [-1, -999, -1111]
miss_ch = []
# Exclude non-variable columns: record_id, target, org_info, etc.
exclude_cols = ['new_date', 'new_date_ym', 'new_target', 'new_org', 'record_id', 'target', 'org_info']
cols = [c for c in data.columns if c not in exclude_cols]
# Calculate overall missing rate
for col in tqdm.tqdm(cols, desc="Missing rate"):
rate = ((data[col].isin(miss_vals)) | (data[col].isna())).mean()
miss_ch.append({'变量': col, '整体缺失率': round(rate, 4)})
miss_ch = pd.DataFrame(miss_ch)
# Calculate organization-level missing rates and convert to wide format
orgs = sorted(data['new_org'].unique())
miss_detail_dict = {'变量': []}
miss_detail_dict['整体'] = []
for org in orgs:
miss_detail_dict[org] = []
for col in cols:
miss_detail_dict['变量'].append(col)
# Overall missing rate
overall_rate = ((data[col].isin(miss_vals)) | (data[col].isna())).mean()
miss_detail_dict['整体'].append(round(overall_rate, 4))
# Missing rate for each organization
for org in orgs:
org_data = data[data['new_org'] == org]
rate = ((org_data[col].isin(miss_vals)) | (org_data[col].isna())).mean()
miss_detail_dict[org].append(round(rate, 4))
miss_detail = pd.DataFrame(miss_detail_dict)
# Sort by overall missing rate in descending order
miss_detail = miss_detail.sort_values('整体', ascending=False)
miss_detail = miss_detail.reset_index(drop=True)
return miss_detail, miss_ch
def calculate_iv(data: pd.DataFrame, features: List[str], n_jobs: int = 4) -> pd.DataFrame:
"""Calculate IV value - use toad.transform.Combiner for binning, set number of bins to 5, keep NaN values"""
import tqdm
from joblib import Parallel, delayed
def _calc_iv(f):
try:
# Use toad.transform.Combiner for binning, set number of bins to 5
c = toad.transform.Combiner()
data_temp = data[[f, 'new_target']].copy()
data_temp.columns = ['x', 'y']
data_temp['x_bin'] = c.fit_transform(X=data_temp['x'], y=data_temp['y'], method='dt', n_bins=5, min_samples=0.05/5, empty_separate=True)
# Calculate IV value using binned data
iv_df = toad.quality(data_temp[['x_bin', 'y']], 'y', iv_only=True)
if 'iv' in iv_df.columns and len(iv_df) > 0:
iv_value = iv_df['iv'].iloc[0]
if not np.isnan(iv_value):
return {'变量': f, 'IV': round(iv_value, 4)}
return None
except Exception as e:
print(f" IV calculation error: variable={f}, error={e}")
return None
# Use tqdm to show progress
results = Parallel(n_jobs=n_jobs, verbose=0)(
delayed(_calc_iv)(f) for f in features
)
iv_list = [r for r in results if r is not None]
if len(iv_list) == 0:
print(f" IV calculation result is empty, number of features={len(features)}")
return pd.DataFrame(columns=['变量', 'IV'])
return pd.DataFrame(iv_list).sort_values('IV', ascending=False)
def calculate_corr(data: pd.DataFrame, features: List[str]) -> pd.DataFrame:
"""Calculate correlation matrix"""
corr = data[features].corr().abs()
return corr
def export_report_xlsx(filepath: str, data_name: str, data: pd.DataFrame,
sheet_name: str, description: str = ""):
"""Export xlsx report - supports appending"""
try:
from openpyxl import load_workbook
wb = load_workbook(filepath)
ws = wb.create_sheet(sheet_name)
except:
wb = Workbook()
ws = wb.active
ws.title = sheet_name
# Write description
ws['A1'] = f"Data: {data_name}"
ws['A2'] = f"Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
if description:
ws['A3'] = f"Description: {description}"
# Write data
start_row = 5
for i, col in enumerate(data.columns):
ws.cell(start_row, i+1, col)
for i, row in enumerate(data.values):
for j, val in enumerate(row):
ws.cell(start_row+1+i, j+1, val)
# Styles
header_fill = PatternFill(start_color="366092", end_color="366092", fill_type="solid")
header_font = Font(color="FFFFFF", bold=True)
for cell in ws[start_row]:
cell.fill = header_fill
cell.font = header_font
cell.alignment = Alignment(horizontal='center')
wb.save(filepath)
print(f"[{sheet_name}] Saved to {filepath}")

View File

@@ -0,0 +1,391 @@
#!/usr/bin/env python3
"""
Execution script
Version: 1.0.0
Last modified: 02-03-2026
"""
import os, sys
import time
import pandas as pd
from typing import Dict, List, Optional, Any, Callable
import numpy as np
import multiprocessing
# =============================================================================
# System Configuration
# =============================================================================
CPU_COUNT = multiprocessing.cpu_count()
N_JOBS = max(1, CPU_COUNT - 1) # Multi-process parallel count, keep 1 core for system
def _ensure_references_on_path():
script_dir = os.path.dirname(__file__)
cur = script_dir
for _ in range(8):
candidate = os.path.join(cur, 'references')
if os.path.isdir(candidate):
# add parent folder (which contains `references`) to sys.path
sys.path.insert(0, cur)
return
parent = os.path.dirname(cur)
if parent == cur:
break
cur = parent
# fallback: add a reasonable repo-root guess
sys.path.insert(0, os.path.abspath(os.path.join(script_dir, '..', '..', '..')))
_ensure_references_on_path()
from references.func import get_dataset, missing_check, org_analysis
from references.analysis import (drop_abnormal_ym, drop_highmiss_features,
drop_lowiv_features, drop_highcorr_features,
drop_highpsi_features,
drop_highnoise_features,
export_cleaning_report,
iv_distribution_by_org,
psi_distribution_by_org,
value_ratio_distribution_by_org)
# ==================== Path Configuration (Interactive Input) ====================
# Use 50-column test data as default, support interactive modification in command line
default_data_path = ''
default_output_dir = ''
def _get_path_input(prompt, default):
try:
user_val = input(f"{prompt} (default: {default}): ").strip()
except Exception:
user_val = ''
return user_val if user_val else default
DATA_PATH = _get_path_input('Please enter data file path DATA_PATH', default_data_path)
OUTPUT_DIR = _get_path_input('Please enter output directory OUTPUT_DIR', default_output_dir)
REPORT_PATH = os.path.join(OUTPUT_DIR, '数据清洗报告.xlsx')
# Data column name configuration (adjust according to actual data)
DATE_COL = _get_path_input('Please enter date column name in data', 'apply_date')
Y_COL = _get_path_input('Please enter label column name in data', 'target')
ORG_COL = _get_path_input('Please enter organization column name in data', 'org_info')
# Support multiple primary key column names input (comma or space separated)
def _get_list_input(prompt, default):
try:
user_val = input(f"{prompt} (default: {default}): ").strip()
except Exception:
user_val = ''
if not user_val:
user_val = default
# Support comma or space separation
parts = [p.strip() for p in user_val.replace(',', ' ').split() if p.strip()]
return parts
KEY_COLS = _get_list_input('Please enter primary key column names in data (multiple columns separated by comma or space)', 'record_id')
# ==================== Multi-process Configuration Information ====================
print("=" * 60)
print("Multi-process Configuration")
print("=" * 60)
print(f" Local CPU cores: {CPU_COUNT}")
print(f" Current process count: {N_JOBS}")
print("=" * 60)
# ==================== OOS Organization Configuration (Interactive Input) ====================
# Default out-of-sample organization list, users can input custom list in comma-separated format during interaction
default_oos = [
'orgA', 'orgB', 'orgC', 'orgD', 'orgE',
]
try:
oos_input = input('Please enter out-of-sample organization list, comma separated (press Enter to use default list):').strip()
except Exception:
oos_input = ''
if oos_input:
OOS_ORGS = [s.strip() for s in oos_input.split(',') if s.strip()]
else:
OOS_ORGS = default_oos
os.makedirs(OUTPUT_DIR, exist_ok=True)
# ==================== Interactive Hyperparameter Input ====================
def get_user_input(prompt, default, dtype=float):
"""Get user input, support default value and type conversion"""
while True:
try:
user_input = input(f"{prompt} (default: {default}): ").strip()
if not user_input:
return default
return dtype(user_input)
except ValueError:
print(f" Invalid input, please enter {dtype.__name__} type")
# Record cleaning steps
steps = []
# Store parameters for each step
params = {}
# Timer decorator
def timer(step_name):
"""Timer decorator"""
def decorator(func):
def wrapper(*args, **kwargs):
print(f"\nStarting {step_name}...")
start_time = time.time()
result = func(*args, **kwargs)
elapsed = time.time() - start_time
print(f" {step_name} elapsed: {elapsed:.2f} seconds")
return result
return wrapper
return decorator
# ==================== Step 1: Get Data ====================
print("\n" + "=" * 60)
print("Step 1: Get Data")
print("=" * 60)
step_start = time.time()
# Use configuration from global_parameters
data = get_dataset(
data_pth=DATA_PATH,
date_colName=DATE_COL,
y_colName=Y_COL,
org_colName=ORG_COL,
data_encode='utf-8',
key_colNames=KEY_COLS,
drop_colNames=[],
miss_vals=[-1, -999, -1111]
)
print(f" Original data: {data.shape}")
print(f" Abnormal values replaced with NaN: [-1, -999, -1111]")
print(f" Step 1 elapsed: {time.time() - step_start:.2f} seconds")
# ==================== Step 2: Organization Sample Analysis ====================
print("\n" + "=" * 60)
print("Step 2: Organization Sample Analysis")
print("=" * 60)
step_start = time.time()
org_stat = org_analysis(data, oos_orgs=OOS_ORGS)
steps.append(('机构样本统计', org_stat))
print(f" Organization count: {data['new_org'].nunique()}, Month count: {data['new_date_ym'].nunique()}")
print(f" Out-of-sample organizations: {len(OOS_ORGS)}")
print(f" Step 2 elapsed: {time.time() - step_start:.2f} seconds")
# ==================== Step 3: Separate OOS Data ====================
print("\n" + "=" * 60)
print("Step 3: Separate OOS Data")
print("=" * 60)
step_start = time.time()
oos_data = data[data['new_org'].isin(OOS_ORGS)]
data = data[~data['new_org'].isin(OOS_ORGS)]
print(f" OOS samples: {oos_data.shape[0]} rows")
print(f" Modeling samples: {data.shape[0]} rows")
print(f" OOS organizations: {OOS_ORGS}")
print(f" Step 3 elapsed: {time.time() - step_start:.2f} seconds")
# Create separation information DataFrame
oos_info = pd.DataFrame({'变量': ['OOS样本', '建模样本'], '数量': [oos_data.shape[0], data.shape[0]]})
steps.append(('分离OOS数据', oos_info))
# ==================== Step 4: Filter Abnormal Months (Modeling Data Only) ====================
print("\n" + "=" * 60)
print("Step 4: Filter Abnormal Months (Modeling Data Only)")
print("=" * 60)
print(" Press Enter to use default values")
print("=" * 60)
params['min_ym_bad_sample'] = int(get_user_input("Bad sample count threshold", 10, int))
params['min_ym_sample'] = int(get_user_input("Total sample count threshold", 500, int))
step_start = time.time()
data_filtered, abnormal_ym = drop_abnormal_ym(data.copy(), min_ym_bad_sample=params['min_ym_bad_sample'], min_ym_sample=params['min_ym_sample'])
steps.append(('Step4-异常月份处理', abnormal_ym))
print(f" After filtering: {data_filtered.shape}")
print(f" Parameters: min_ym_bad_sample={params['min_ym_bad_sample']}, min_ym_sample={params['min_ym_sample']}")
if len(abnormal_ym) > 0:
print(f" Dropped months: {abnormal_ym['年月'].tolist()}")
print(f" Removal conditions: {abnormal_ym['去除条件'].tolist()}")
print(f" Step 4 elapsed: {time.time() - step_start:.2f} seconds")
# ==================== Step 5: Calculate Missing Rate ====================
print("\n" + "=" * 60)
print("Step 5: Calculate Missing Rate")
print("=" * 60)
step_start = time.time()
orgs = data['new_org'].unique().tolist()
channel = {'整体': orgs}
miss_detail, miss_channel = missing_check(data, channel=channel)
# miss_detail: Missing rate details (format: feature, overall, org1, org2, ..., orgn)
# miss_channel: Overall missing rate
steps.append(('缺失率明细', miss_detail))
print(f" Feature count: {len(miss_detail['变量'].unique())}")
print(f" Organization count: {len(miss_detail.columns) - 2}") # Subtract '变量' and '整体' columns
print(f" Step 5 elapsed: {time.time() - step_start:.2f} seconds")
# ==================== Step 6: Drop High Missing Rate Features ====================
print("\n" + "=" * 60)
print("Step 6: Drop High Missing Rate Features")
print("=" * 60)
print(" Press Enter to use default values")
print("=" * 60)
params['missing_ratio'] = get_user_input("Missing rate threshold", 0.6)
step_start = time.time()
data_miss, dropped_miss = drop_highmiss_features(data.copy(), miss_channel, threshold=params['missing_ratio'])
steps.append(('Step6-高缺失率处理', dropped_miss))
print(f" Dropped: {len(dropped_miss)}")
print(f" Threshold: {params['missing_ratio']}")
if len(dropped_miss) > 0:
print(f" Dropped features: {dropped_miss['变量'].tolist()[:5]}...")
print(f" Removal conditions: {dropped_miss['去除条件'].tolist()[:5]}...")
print(f" Step 6 elapsed: {time.time() - step_start:.2f} seconds")
# ==================== Step 7: Drop Low IV Features ====================
print("\n" + "=" * 60)
print("Step 7: Drop Low IV Features")
print("=" * 60)
print(" Press Enter to use default values")
print("=" * 60)
params['overall_iv_threshold'] = get_user_input("Overall IV threshold", 0.1)
params['org_iv_threshold'] = get_user_input("Single organization IV threshold", 0.1)
params['max_org_threshold'] = int(get_user_input("Maximum tolerated low IV organization count", 2, int))
step_start = time.time()
# Get feature list (use all features)
features = [c for c in data.columns if c.startswith('i_')]
data_iv, iv_detail, iv_process = drop_lowiv_features(
data.copy(), features,
overall_iv_threshold=params['overall_iv_threshold'],
org_iv_threshold=params['org_iv_threshold'],
max_org_threshold=params['max_org_threshold'],
n_jobs=N_JOBS
)
# iv_detail: IV details (IV value of each feature in each organization and overall)
# iv_process: IV processing table (features that do not meet the conditions)
steps.append(('Step7-IV处理', iv_process))
print(f" Dropped: {len(iv_process)}")
print(f" Parameters: overall_iv_threshold={params['overall_iv_threshold']}, org_iv_threshold={params['org_iv_threshold']}, max_org_threshold={params['max_org_threshold']}")
if len(iv_process) > 0:
print(f" Dropped features: {iv_process['变量'].tolist()[:5]}...")
print(f" Processing reasons: {iv_process['处理原因'].tolist()[:5]}...")
print(f" Step 7 elapsed: {time.time() - step_start:.2f} seconds")
# ==================== Step 8: Drop High PSI Features ====================
print("\n" + "=" * 60)
print("Step 8: Drop High PSI Features (By Organization + Month-by-Month)")
print("=" * 60)
print(" Press Enter to use default values")
print("=" * 60)
params['psi_threshold'] = get_user_input("PSI threshold", 0.1)
params['max_months_ratio'] = get_user_input("Maximum unstable month ratio", 1/3)
params['max_orgs'] = int(get_user_input("Maximum unstable organization count", 6, int))
step_start = time.time()
# Get features before PSI calculation (use all features)
features_for_psi = [c for c in data.columns if c.startswith('i_')]
data_psi, psi_detail, psi_process = drop_highpsi_features(
data.copy(), features_for_psi,
psi_threshold=params['psi_threshold'],
max_months_ratio=params['max_months_ratio'],
max_orgs=params['max_orgs'],
min_sample_per_month=100,
n_jobs=N_JOBS
)
# psi_detail: PSI details (PSI value of each feature in each organization each month)
# psi_process: PSI processing table (features that do not meet the conditions)
steps.append(('Step8-PSI处理', psi_process))
print(f" Dropped: {len(psi_process)}")
print(f" Parameters: psi_threshold={params['psi_threshold']}, max_months_ratio={params['max_months_ratio']:.2f}, max_orgs={params['max_orgs']}")
if len(psi_process) > 0:
print(f" Dropped features: {psi_process['变量'].tolist()[:5]}...")
print(f" Processing reasons: {psi_process['处理原因'].tolist()[:5]}...")
print(f" PSI details: {len(psi_detail)} records")
print(f" Step 8 elapsed: {time.time() - step_start:.2f} seconds")
# ==================== Step 9: Null Importance Denoising ====================
print("\n" + "=" * 60)
print("Step 9: Null Importance Remove High Noise Features")
print("=" * 60)
print(" Press Enter to use default values")
print("=" * 60)
params['n_estimators'] = int(get_user_input("Number of trees", 100, int))
params['max_depth'] = int(get_user_input("Maximum tree depth", 5, int))
params['gain_threshold'] = get_user_input("Gain difference threshold", 50)
step_start = time.time()
# Get feature list (use all features)
features = [c for c in data.columns if c.startswith('i_')]
data_noise, dropped_noise = drop_highnoise_features(data.copy(), features, n_estimators=params['n_estimators'], max_depth=params['max_depth'], gain_threshold=params['gain_threshold'])
steps.append(('Step9-null importance处理', dropped_noise))
print(f" Dropped: {len(dropped_noise)}")
print(f" Parameters: n_estimators={params['n_estimators']}, max_depth={params['max_depth']}, gain_threshold={params['gain_threshold']}")
if len(dropped_noise) > 0:
print(f" Dropped features: {dropped_noise['变量'].tolist()}")
print(f" Step 9 elapsed: {time.time() - step_start:.2f} seconds")
# ==================== Step 10: Drop High Correlation Features (Based on Null Importance Original Gain) ====================
print("\n" + "=" * 60)
print("Step 10: Drop High Correlation Features (Based on Null Importance Original Gain)")
print("=" * 60)
print(" Press Enter to use default values")
print("=" * 60)
params['max_corr'] = get_user_input("Correlation threshold", 0.9)
params['top_n_keep'] = int(get_user_input("Keep top N features by original gain ranking", 20, int))
step_start = time.time()
# Get feature list (use all features)
features = [c for c in data.columns if c.startswith('i_')]
# Get original gain from null importance results
if len(dropped_noise) > 0 and '原始gain' in dropped_noise.columns:
gain_dict = dict(zip(dropped_noise['变量'], dropped_noise['原始gain']))
else:
gain_dict = {}
data_corr, dropped_corr = drop_highcorr_features(data.copy(), features, threshold=params['max_corr'], gain_dict=gain_dict, top_n_keep=params['top_n_keep'])
steps.append(('Step10-高相关性剔除', dropped_corr))
print(f" Dropped: {len(dropped_corr)}")
print(f" Threshold: {params['max_corr']}")
if len(dropped_corr) > 0:
print(f" Dropped features: {dropped_corr['变量'].tolist()}")
print(f" Removal conditions: {dropped_corr['去除条件'].tolist()[:5]}...")
print(f" Step 10 elapsed: {time.time() - step_start:.2f} seconds")
# ==================== Step 11: Export Report ====================
print("\n" + "=" * 60)
print("Step 11: Export Report")
print("=" * 60)
step_start = time.time()
# Calculate IV distribution statistics
print(" Calculating IV distribution statistics...")
iv_distribution = iv_distribution_by_org(iv_detail, oos_orgs=OOS_ORGS)
print(f" IV distribution statistics: {len(iv_distribution)} records")
# Calculate PSI distribution statistics
print(" Calculating PSI distribution statistics...")
psi_distribution = psi_distribution_by_org(psi_detail, oos_orgs=OOS_ORGS)
print(f" PSI distribution statistics: {len(psi_distribution)} records")
# Calculate value ratio distribution statistics (use all features)
print(" Calculating value ratio distribution statistics...")
features_for_value_ratio = [c for c in data.columns if c.startswith('i_')]
value_ratio_distribution = value_ratio_distribution_by_org(data, features_for_value_ratio, oos_orgs=OOS_ORGS)
print(f" Value ratio distribution statistics: {len(value_ratio_distribution)} records")
# Add details and distribution statistics to steps list
steps.append(('Step7-IV明细', iv_detail))
steps.append(('Step7-IV分布统计', iv_distribution))
steps.append(('Step8-PSI明细', psi_detail))
steps.append(('Step8-PSI分布统计', psi_distribution))
steps.append(('Step5-有值率分布统计', value_ratio_distribution))
export_cleaning_report(REPORT_PATH, steps,
iv_detail=iv_detail,
iv_process=iv_process,
psi_detail=psi_detail,
psi_process=psi_process,
params=params,
iv_distribution=iv_distribution,
psi_distribution=psi_distribution,
value_ratio_distribution=value_ratio_distribution)
print(f" Report: {REPORT_PATH}")
print(f" Step 11 elapsed: {time.time() - step_start:.2f} seconds")
# ==================== Summary ====================
print("\n" + "=" * 60)
print("Data Cleaning Completed!")
print("=" * 60)
print(f" Original data: {data.shape[0]} rows")
print(f" Original features: {len([c for c in data.columns if c.startswith('i_')])}")
print(f" Cleaning steps (each step executed independently, data not deleted):")
for name, df in steps:
print(f" - {name}: Dropped {df.shape[0] if hasattr(df, 'shape') else len(df)}")