当前位置：首页 > news >正文

用Python和Pandas复现Lending Club数据分析：从数据清洗到可视化洞察的完整流程

news 2026/5/28 19:41:10

Python+Pandas实战：Lending Club贷款数据分析全流程拆解

在数据科学领域，金融数据分析始终是最具挑战性和商业价值的应用方向之一。Lending Club作为全球知名的P2P借贷平台，其公开数据集为数据分析学习者提供了绝佳的实战素材。本文将带你用Python和Pandas完整复现一个贷款数据分析项目，从原始数据导入到可视化洞察，每个步骤都包含可运行的代码示例和常见问题解决方案。

1. 环境准备与数据加载

开始前确保已安装必要的Python库。推荐使用Anaconda环境，它已经预装了大多数数据分析所需的工具包：

pip install pandas numpy matplotlib seaborn jupyter

数据集可以从Kaggle或Lending Club官网获取。我们使用2007-2015年的贷款数据，包含88万+记录和74个字段。加载数据时常见的编码问题可以通过指定encoding参数解决：

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline # 解决中文显示问题 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # 加载数据（假设文件在当前目录） try: loan_df = pd.read_csv('loan.csv', low_memory=False) print(f"数据加载成功，形状：{loan_df.shape}") except Exception as e: print(f"加载失败：{str(e)}")

初次加载后，建议立即检查数据概览：

loan_df.info(verbose=True, memory_usage='deep')

常见问题排查表：

问题现象	可能原因	解决方案
内存错误	数据量过大	使用`low_memory=False`参数
编码错误	文件编码非UTF-8	尝试`encoding='latin1'`
列名不一致	数据源版本差异	检查列名`loan_df.columns.tolist()`

2. 数据清洗实战技巧

原始金融数据往往包含大量需要清理的"噪音"。以下是关键清洗步骤及其实现：

2.1 列筛选与优化

原始74列中很多字段对分析帮助不大，反而增加处理负担。我们基于业务理解选择25个核心字段：

essential_cols = [ 'id', 'loan_amnt', 'funded_amnt', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'annual_inc', 'issue_d', 'loan_status', 'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'verification_status', 'fico_range_high' ] loan_df = loan_df[essential_cols]

2.2 缺失值处理策略

金融数据中的缺失值需要特别谨慎处理。不同字段应采用不同策略：

# 检查缺失比例 missing_stats = loan_df.isnull().sum() / len(loan_df) print(missing_stats.sort_values(ascending=False)) # 分字段处理 loan_df['emp_length'].fillna('0 years', inplace=True) # 工作年限未知视为0年 loan_df.dropna(subset=['annual_inc'], inplace=True) # 年收入是关键字段，直接删除缺失

2.3 数据类型转换

日期和数值型字段需要正确转换才能用于分析：

# 转换日期字段 loan_df['issue_d'] = pd.to_datetime(loan_df['issue_d']) loan_df['earliest_cr_line'] = pd.to_datetime(loan_df['earliest_cr_line']) # 处理百分比字段 loan_df['int_rate'] = loan_df['int_rate'].str.rstrip('%').astype('float') loan_df['revol_util'] = loan_df['revol_util'].str.rstrip('%').astype('float') # 提取贷款期限数值 loan_df['term_months'] = loan_df['term'].str.extract('(\d+)').astype(int)

3. 探索性数据分析(EDA)

3.1 时间维度分析

首先观察平台随时间的发展趋势：

# 提取年份字段 loan_df['issue_year'] = loan_df['issue_d'].dt.year # 绘制年度贷款量趋势 plt.figure(figsize=(12,6)) sns.countplot(x='issue_year', data=loan_df, palette='Blues_d') plt.title('年度贷款数量趋势') plt.xlabel('年份') plt.ylabel('贷款数量')

关键发现：

2012年后贷款量呈指数级增长
2007-2009年受金融危机影响增长平缓
2015年增速有所放缓

3.2 贷款特征分布

分析贷款金额、期限等核心特征的分布情况：

# 贷款金额分布 plt.figure(figsize=(12,5)) plt.subplot(1,2,1) sns.boxplot(y='loan_amnt', data=loan_df) plt.title('贷款金额箱线图') plt.subplot(1,2,2) sns.histplot(loan_df['loan_amnt'], bins=30, kde=True) plt.title('贷款金额分布直方图')

金额分布特征：

中位数：10,000美元
75分位数：15,000美元
存在少量超过35,000美元的大额贷款

3.3 用户画像分析

通过交叉分析了解典型借款人的特征：

# 工作年限与贷款金额的关系 plt.figure(figsize=(12,6)) sns.boxplot(x='emp_length', y='loan_amnt', data=loan_df, order=['< 1 year', '1 year', '2 years', '3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years']) plt.title('不同工作年限的贷款金额分布') plt.xticks(rotation=45)

用户画像特征：

工作10年以上的用户贷款金额分布最广
工作1年内的用户贷款金额集中在小额区间
工作3-5年是贷款活跃群体

4. 风险分析建模

4.1 坏账定义与识别

金融数据分析的核心是风险识别。首先需要明确定义什么是坏账：

# 定义坏账状态 bad_status = [ 'Charged Off', 'Default', 'Late (31-120 days)', 'Does not meet the credit policy. Status:Charged Off' ] loan_df['loan_condition'] = np.where( loan_df['loan_status'].isin(bad_status), 'Bad Loan', 'Good Loan' ) # 计算整体坏账率 bad_loan_rate = len(loan_df[loan_df['loan_condition'] == 'Bad Loan']) / len(loan_df) print(f"整体坏账率：{bad_loan_rate:.2%}")

4.2 风险因素关联分析

探索不同因素与坏账率的关联程度：

# 信用等级与坏账率 grade_stats = loan_df.groupby('grade')['loan_condition'].value_counts(normalize=True).unstack() grade_stats['Bad Loan'].plot(kind='bar', figsize=(10,6)) plt.title('不同信用等级的坏账率') plt.ylabel('坏账比例')

风险洞察：

信用等级G的坏账率最高(约30%)
等级A-C的坏账率呈线性上升趋势
等级D后坏账风险显著增加

4.3 多维风险特征交叉

使用交叉表分析多个风险因素的组合影响：

# 创建交叉分析表 risk_crosstab = pd.crosstab( index=[loan_df['grade'], loan_df['emp_length']], columns=loan_df['loan_condition'], normalize='index' ) # 可视化热力图 plt.figure(figsize=(12,8)) sns.heatmap(risk_crosstab['Bad Loan'].unstack(), annot=True, fmt=".1%", cmap='Reds') plt.title('信用等级与工作年限的坏账率热力图')

高级洞察：

工作年限短的G级贷款坏账风险极高(>40%)
A级贷款中工作10年以上的坏账率反而更高
E-F级贷款中工作3-5年的群体风险突出

5. 高级可视化技巧

5.1 动态趋势可视化

使用Seaborn的relplot展示多维时间趋势：

# 按年度和信用等级分析贷款量 plt.figure(figsize=(14,7)) sns.relplot( x='issue_year', y='loan_amnt', hue='grade', size='loan_amnt', sizes=(40, 400), alpha=.5, palette='muted', height=6, data=loan_df.sample(10000) # 抽样提高性能 ) plt.title('信用等级与贷款金额的年度趋势')

5.2 交互式可视化(可选)

虽然本文主要使用静态可视化，但可以轻松转换为交互式图表：

# 需要安装plotly：pip install plotly import plotly.express as px fig = px.scatter(loan_df.sample(10000), x='annual_inc', y='loan_amnt', color='loan_condition', hover_data=['grade', 'emp_length'], title='收入vs贷款金额(按贷款状态)') fig.show()

5.3 自定义主题美化

提升图表专业度的样式技巧：

# 设置专业风格 sns.set_style("whitegrid") sns.set_palette("husl") plt.figure(figsize=(12,6)) ax = sns.barplot(x='grade', y='loan_amnt', hue='loan_condition', data=loan_df, estimator=np.median) plt.title('不同信用等级的贷款金额中位数(按贷款状态)', pad=20) plt.xlabel('信用等级', labelpad=10) plt.ylabel('贷款金额中位数', labelpad=10) sns.despine(left=True, bottom=True) ax.legend(title='贷款状态', frameon=False)

6. 分析报告自动化

6.1 关键指标自动计算

将核心指标计算封装为函数：

def calculate_kpis(df): """计算关键绩效指标""" kpis = { '总贷款数': len(df), '总贷款金额': df['loan_amnt'].sum(), '平均贷款金额': df['loan_amnt'].mean(), '坏账率': len(df[df['loan_condition'] == 'Bad Loan']) / len(df), '平均利率': df['int_rate'].mean(), '最多贷款目的': df['purpose'].value_counts().idxmax() } return pd.Series(kpis) # 按年度计算KPI annual_kpis = loan_df.groupby('issue_year').apply(calculate_kpis) print(annual_kpis)

6.2 自动生成分析摘要

使用Python自动生成分析结论：

def generate_summary(df): latest_year = df['issue_year'].max() summary = f""" Lending Club数据分析摘要({df['issue_year'].min()}-{latest_year})： - 共分析 {len(df):,} 笔贷款，总金额 ${df['loan_amnt'].sum()/1e6:.1f} 百万 - 平均贷款金额为 ${df['loan_amnt'].mean():,.0f} - 整体坏账率为 {len(df[df['loan_condition']=='Bad Loan'])/len(df):.1%} - {latest_year}年最常贷款目的: {df[df['issue_year']==latest_year]['purpose'].value_counts().idxmax()} - 高风险组合：{df.groupby(['grade','emp_length'])['loan_condition'] .value_counts(normalize=True) .loc[:,:,'Bad Loan'].idxmax()} """ return summary print(generate_summary(loan_df))

6.3 分析结果导出

将关键结果导出为Excel，便于商业汇报：

# 创建Excel writer with pd.ExcelWriter('lending_club_analysis.xlsx') as writer: # 导出原始数据样本 loan_df.sample(1000).to_excel(writer, sheet_name='Sample Data', index=False) # 导出年度统计 annual_stats = loan_df.groupby('issue_year').agg({ 'loan_amnt': ['count', 'sum', 'mean'], 'int_rate': 'mean', 'loan_condition': lambda x: (x == 'Bad Loan').mean() }) annual_stats.to_excel(writer, sheet_name='Annual Stats') # 导出风险交叉表 risk_crosstab.to_excel(writer, sheet_name='Risk Analysis')

7. 项目经验与优化建议

在实际操作中，处理88万行数据可能会遇到性能问题。以下是几个性能优化技巧：

# 内存优化 - 转换数据类型 def reduce_mem_usage(df): """迭代式降低DataFrame内存占用""" start_mem = df.memory_usage().sum() / 1024**2 print(f"初始内存占用: {start_mem:.2f} MB") for col in df.columns: col_type = df[col].dtype if col_type != object: c_min = df[col].min() c_max = df[col].max() if str(col_type)[:3] == 'int': if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: df[col] = df[col].astype(np.int8) elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: df[col] = df[col].astype(np.int16) elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: df[col] = df[col].astype(np.int32) elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: df[col] = df[col].astype(np.int64) else: if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: df[col] = df[col].astype(np.float16) elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: df[col] = df[col].astype(np.float32) else: df[col] = df[col].astype(np.float64) end_mem = df.memory_usage().sum() / 1024**2 print(f"优化后内存占用: {end_mem:.2f} MB (减少 {100 * (start_mem - end_mem) / start_mem:.1f}%)") return df loan_df = reduce_mem_usage(loan_df)

对于真正大规模数据集，建议考虑以下进阶方案：