当前位置：首页 > news >正文

Spark微博舆情分析系统情感分析爬虫 Hadoop和Hive 贴吧资料双平台讲解视频大内容 Hadoop ✅

news 2026/6/11 13:06:07

博主介绍选择放心、选择安心毕业✌就是：✌全网粉丝10W+，前互联网大厂软件研发、集结硕博英豪成立软件开发工作室，专注于计算机相关专业方案实战6年之久，累计开发方案作品上万套。凭借丰富的经验与专业实力，已帮助成千上万的学生顺利毕业，选择我们，就
> 想要获取完整文章或者源码，或者代做，拉到文章底部即可与我联系了。
点击查看作者主页，了解更多项目！
感兴趣的可以先收藏起来，点赞、关注不迷路，大家在毕设选题，计划以及论文编写等相关问题都可以给我留言咨询，希望援助同学们顺利毕业。

1、毕业设计：2026年计算机专业毕业设计选题汇总（建议收藏）✅

2、最全计算机专业毕业设计选题大全（建议收藏）✅

1、项目介绍

Spark微博舆情分析框架情感分析爬虫 Hadoop和Hive 贴吧数据双平台讲解视频大数据毕业设计

技术栈：

论坛数据（百度、微博）
Python语言、requests爬虫工艺、 Django框架、SnowNLP 情感分析、MySQL数据库、Echarts可视化
Hadoop、 spark、hive 大数据技术、虚拟机

2、项目界面

（1）首页–信息概况
在这里插入图片描述

（2）贴吧用户地址分布分析、微博用户地址分布分析（中国地图）
在这里插入图片描述
（3）帖子分析

3、项目说明

项目功能模块介绍

一、数据采集模块

微博爬虫
- spiderWeiboNav.py：从微博导航分组接口获取分类信息并保存到本地文件。
- spiderWeibo.py：根据分类信息爬取微博文章的详细内容。
- spiderWeiboDetail.py：爬取微博评论资料并保存。
- changeData.py：对微博数据进行清理，去除换行符等。
贴吧爬虫
- spiderTieba.py：爬取百度贴吧指定主题的资料。
- spiderTiebaDetail.py：爬取帖子的回复内容。
- hotWordDeal.py：对帖子内容进行词频统计并提取热词。

三、数据存储与管理模块

贴吧数据中心
- 存储和管理贴吧相关的数据。
微博数据中心
- 存储和管理微博相关的数据。
评论中心
- 存储和管理微博和贴吧的评论信息。

四、用户交互模块

注册登录
- 供应用户注册和登录作用，方便用户使用系统。
后台管理
- 提供后台管理功能，方便管理员管理数据和用户权限。

五、技术架构

数据采集：使用 Python 的 requests 爬虫技术。
情感分析：使用 SnowNLP 进行情感分析。
数据存储：利用 MySQL 数据库存储数据。
大数据处理：使用 Hadoop、Spark 和 Hive 进行大数据处理。
可视化：使用 Echarts 进行数据可视化。
Web 框架：使用 Django 框架构建前端界面。

4、核心代码

#coding:utf8
# 导包
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import StringType, StructField, StructType, IntegerType, FloatType
from pyspark.sql.functions import count, mean, col, sum, when, max, min, avg, to_timestamp, current_timestamp, unix_timestamp, to_date, expr, coalesce, lit
if __name__ == '__main__':
# 构建 SparkSession
spark = SparkSession.builder.appName("sparkSQL").master("local[*]").\
config("spark.sql.shuffle.partitions", 2). \
config("spark.sql.warehouse.dir", "hdfs://node1:8020/user/hive/warehouse"). \
config("hive.metastore.uris", "thrift://node1:9083"). \
enableHiveSupport().\
getOrCreate()
# 读取数据表
tiebadata = spark.read.table('tiebadata')
weibodata = spark.read.table('weibodata')
tiebaComment = spark.read.table('tiebaComment')
weiboComment = spark.read.table('weiboComment')
tiebaHotword = spark.read.table('tiebaHotword')
weiboHotword = spark.read.table('weiboHotword')
# 需求一：时间统计
# 将时间列转换为日期格式
tiebadata = tiebadata.withColumn('postTime', to_date(col("postTime"), "yyyy-MM-dd HH:mm:ss"))
weibodata = weibodata.withColumn('createdAt', to_date(col("createdAt"), "yyyy-MM-dd"))
# 按日期分组统计帖子数量，并计算与当前日期的天数差
result1 = tiebadata.groupby("postTime").agg(count('*').alias("count"))
result2 = weibodata.groupby("createdAt").agg(count('*').alias("count"))
result1 = result1.withColumn("days_diff", expr("datediff(current_date(), postTime)"))
result2 = result2.withColumn("days_diff", expr("datediff(current_date(), createdAt)"))
result1 = result1.orderBy("days_diff")
result2 = result2.orderBy("days_diff")
# 重命名列并合并两个结果
result1 = result1.withColumnRenamed("count", "post_count")
result2 = result2.withColumnRenamed("count", "created_count")
combined_result = result1.join(result2, result1.postTime == result2.createdAt, "outer")\
.select(
coalesce(result1.postTime, result2.createdAt).alias("date"),
coalesce(result1.post_count, lit(0)).alias("post_count"),
coalesce(result2.created_count, lit(0)).alias("created_count"),
)
combined_result = combined_result.orderBy(col("date").desc())
# 将结果保存到 MySQL 和 Hive
combined_result.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "dateNum"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
combined_result.write.mode("overwrite").saveAsTable("dateNum", "parquet")
spark.sql("select * from dateNum").show()
# 需求二：类型统计
# 统计微博和贴吧帖子的类型分布
result3 = weibodata.groupby("type").count()
result4 = tiebadata.groupby("area").count()
# 将结果保存到 MySQL 和 Hive
result3.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "weiboTypeCount"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result3.write.mode("overwrite").saveAsTable("weiboTypeCount", "parquet")
spark.sql("select * from weiboTypeCount").show()
result4.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "tiebaTypeCount"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result4.write.mode("overwrite").saveAsTable("tiebaTypeCount", "parquet")
spark.sql("select * from tiebaTypeCount").show()
# 需求三：帖子排序点赞
# 按点赞数降序排列，取前 10 条记录
result5 = weibodata.orderBy(col("likeNum").desc()).limit(10)
result6 = tiebadata.orderBy(col("likeNum").desc()).limit(10)
# 将结果保存到 MySQL 和 Hive
result5.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "weiboLikeNum"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result5.write.mode("overwrite").saveAsTable("weiboLikeNum", "parquet")
spark.sql("select * from weiboLikeNum").show()
result6.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "tiebaLikeNum"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result6.write.mode("overwrite").saveAsTable("tiebaLikeNum", "parquet")
spark.sql("select * from tiebaLikeNum").show()
# 需求四：点赞区间分类
# 将微博和贴吧帖子的点赞数分为不同区间
weibodata_like_category = weibodata.withColumn(
"likeCategory",
when(col("likeNum").between(0, 1000), '0-1000')
.when(col("likeNum").between(1000, 2000), '1000-2000')
.when(col("likeNum").between(2000, 5000), '2000-5000')
.when(col("likeNum").between(5000, 10000), '5000-10000')
.when(col("likeNum").between(10000, 20000), '10000-20000')
.otherwise('20000以上')
)
result7 = weibodata_like_category.groupby("likeCategory").count()
tiebadata_like_category = tiebadata.withColumn(
"likeCategory",
when(col("likeNum").between(0, 50), '0-50')
.when(col("likeNum").between(50, 100), '50-100')
.when(col("likeNum").between(100, 200), '100-200')
.when(col("likeNum").between(200, 500), '200-500')
.when(col("likeNum").between(500, 1000), '500-1000')
.otherwise('2000以上')
)
result8 = tiebadata_like_category.groupby("likeCategory").count()
# 将结果保存到 MySQL 和 Hive
result7.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "wLikeCategory"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result7.write.mode("overwrite").saveAsTable("wLikeCategory", "parquet")
spark.sql("select * from wLikeCategory").show()
# sql
result8.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "tLikeCategory"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result8.write.mode("overwrite").saveAsTable("tLikeCategory", "parquet")
spark.sql("select * from tLikeCategory").show()
# 需求五：评论量分析
# 将微博和贴吧的评论量分为不同区间
weibodata_com_category = weibodata.withColumn(
"comCategory",
when(col("commentsLen").between(0, 10), '0-10')
.when(col("commentsLen").between(10, 50), '10-50')
.when(col("commentsLen").between(50, 100), '50-100')
.when(col("commentsLen").between(100, 500), '100-500')
.when(col("commentsLen").between(500, 1000), '500-1000')
.otherwise('1000以上')
)
tiebadata_com_category = tiebadata.withColumn(
"comCategory",
when(col("replyNum").between(0, 10), '0-10')
.when(col("replyNum").between(10, 50), '10-50')
.when(col("replyNum").between(50, 100), '50-100')
.when(col("replyNum").between(100, 500), '100-500')
.when(col("replyNum").between(500, 1000), '500-1000')
.otherwise('1000以上')
)
result9 = weibodata_com_category.groupby("comCategory").count()
result10 = tiebadata_com_category.groupby("comCategory").count()
# 合并微博和贴吧的评论量统计结果
combined_result2 = result9.join(result10, result9.comCategory == result10.comCategory, "outer") \
.select(
coalesce(result9.comCategory, result10.comCategory).alias("category"),
coalesce(result9['count'], lit(0)).alias("weibo_count"),
coalesce(result10['count'], lit(0)).alias("tieba_count"),
)
# 将结果保存到 MySQL 和 Hive
combined_result2.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "ComCategory"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
combined_result2.write.mode("overwrite").saveAsTable("ComCategory", "parquet")
spark.sql("select * from ComCategory").show()
# 需求六：评论分析
# 对微博评论的点赞数进行区间分类
weibodata_comLike_category = weiboComment.withColumn(
"comLikeCategory",
when(col("likesCounts").between(0, 10), '0-10')
.when(col("likesCounts").between(10, 50), '10-50')
.when(col("likesCounts").between(50, 100), '50-100')
.when(col("likesCounts").between(100, 500), '100-500')
.when(col("likesCounts").between(500, 1000), '500-1000')
.otherwise('1000以上')
)
result11 = weibodata_comLike_category.groupby("comLikeCategory").count()
# 将结果保存到 MySQL 和 Hive
result11.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "comLikeCat"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result11.write.mode("overwrite").saveAsTable("comLikeCat", "parquet")
spark.sql("select * from comLikeCat").show()
# 需求七：评论性别分析
# 统计微博评论的性别分布
result12 = weiboComment.groupby("authorGender").count()
# 将结果保存到 MySQL 和 Hive
result12.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "comGender"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result12.write.mode("overwrite").saveAsTable("comGender", "parquet")
spark.sql("select * from comGender").show()
# 需求八：评论地址分析
# 统计微博和贴吧评论的来源地址分布
result13 = weiboComment.groupby("authorAddress").count()
result14 = tiebaComment.groupby("comAddress").count()
# 将结果保存到 MySQL 和 Hive
result13.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "weiboAddress"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result13.write.mode("overwrite").saveAsTable("weiboAddress", "parquet")
spark.sql("select * from weiboAddress").show()
result14.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "tiebaAddress"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result14.write.mode("overwrite").saveAsTable("tiebaAddress", "parquet")
spark.sql("select * from tiebaAddress").show()
# 需求九：帖子情感得分分析
# 对微博和贴吧帖子的情感得分进行分类
weibodata_scores_category = weibodata.withColumn(
"emoCategory",
when(col("scores").between(0, 0.45), '消极')
.when(col("scores").between(0.45, 0.55), '中性')
.when(col("scores").between(0.55, 1), '积极')
.otherwise('未知')
)
tiebadata_scores_category = tiebadata.withColumn(
"emoCategory",
when(col("scores").between(0, 0.45), '消极')
.when(col("scores").between(0.45, 0.55), '中性')
.when(col("scores").between(0.55, 1), '积极')
.otherwise('未知')
)
result15 = weibodata_scores_category.groupby("emoCategory").count()
result16 = tiebadata_scores_category.groupby("emoCategory").count()
# 将结果保存到 MySQL 和 Hive
result15.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "weiboEmoCount"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result15.write.mode("overwrite").saveAsTable("weiboEmoCount", "parquet")
spark.sql("select * from weiboEmoCount").show()
# sql
result16.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "tiebaEmoCount"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result16.write.mode("overwrite").saveAsTable("tiebaEmoCount", "parquet")
spark.sql("select * from tiebaEmoCount").show()
# 需求十：热词情感分析
# 对微博和贴吧的热词情感得分进行分类
weibodata_Hotscores_category = weiboHotword.withColumn(
"emoCategory",
when(col("scores").between(0, 0.45), '消极')
.when(col("scores").between(0.45, 0.55), '中性')
.when(col("scores").between(0.55, 1), '积极')
.otherwise('未知')
)
tiebadata_Hotscores_category = tiebaHotword.withColumn(
"emoCategory",
when(col("scores").between(0, 0.45), '消极')
.when(col("scores").between(0.45, 0.55), '中性')
.when(col("scores").between(0.55, 1), '积极')
.otherwise('未知')
)
result17 = weibodata_Hotscores_category.groupby("emoCategory").count()
result18 = tiebadata_Hotscores_category.groupby("emoCategory").count()
# 将结果保存到 MySQL 和 Hive
result17.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "weiboHotEmoCount"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result17.write.mode("overwrite").saveAsTable("weiboHotEmoCount", "parquet")
spark.sql("select * from weiboHotEmoCount").show()
result18.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "tiebaHotEmoCount"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result18.write.mode("overwrite").saveAsTable("tiebaHotEmoCount", "parquet")
spark.sql("select * from tiebaHotEmoCount").show()
# 需求十一：热词得分区间分析
# 对微博和贴吧的热词情感得分进行区间分类
weiboHotWord_range_category = weiboHotword.withColumn(
"scoreCategory",
when(col("scores").between(0, 0.1), '0-0.1')
.when(col("scores").between(0.1, 0.2), '0.1-0.2')
.when(col("scores").between(0.2, 0.3), '0.2-0.3')
.when(col("scores").between(0.3, 0.4), '0.3-0.4')
.when(col("scores").between(0.4, 0.5), '0.4-0.5')
.when(col("scores").between(0.5, 0.6), '0.5-0.6')
.when(col("scores").between(0.7, 0.8), '0.7-0.8')
.when(col("scores").between(0.8, 0.9), '0.8-0.9')
.when(col("scores").between(0.9, 1), '0.9-1')
.otherwise('overRange')
)
tiebaHotWord_range_category = tiebaHotword.withColumn(
"scoreCategory",
when(col("scores").between(0, 0.1), '0-0.1')
.when(col("scores").between(0.1, 0.2), '0.1-0.2')
.when(col("scores").between(0.2, 0.3), '0.2-0.3')
.when(col("scores").between(0.3, 0.4), '0.3-0.4')
.when(col("scores").between(0.4, 0.5), '0.4-0.5')
.when(col("scores").between(0.5, 0.6), '0.5-0.6')
.when(col("scores").between(0.7, 0.8), '0.7-0.8')
.when(col("scores").between(0.8, 0.9), '0.8-0.9')
.when(col("scores").between(0.9, 1), '0.9-1')
.otherwise('overRange')
)
result19 = weiboHotWord_range_category.groupby("scoreCategory").count()
result20 = tiebaHotWord_range_category.groupby("scoreCategory").count()
# 将结果保存到 MySQL 和 Hive
result19.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "tiebaScoreCount"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result19.write.mode("overwrite").saveAsTable("tiebaScoreCount", "parquet")
spark.sql("select * from tiebaScoreCount").show()
result20.write.mode("overwrite"). \
format("jdbc"). \
option("url", "jdbc:mysql://node1:3306/bigdata?useSSL=false&useUnicode=true&charset=utf8"). \
option("dbtable", "weiboScoreCount"). \
option("user", "root"). \
option("password", "root"). \
option("encoding", "utf-8"). \
save()
result20.write.mode("overwrite").saveAsTable("weiboScoreCount", "parquet")
spark.sql("select * from weiboScoreCount").show()