吃鸡排名预测-研究方案报告

比赛链接:

https://aistudio.baidu.com/aistudio/competition/detail/155/0/introduction


0x00 赛题任务

​ 构建吃鸡排名预测模型,输入每位玩家的统计信息、队友统计信息、本局其他玩家的统计信息,预测最终的游戏排名。这里的排名是按照队伍排名,若多位玩家在 PUBG一局游戏中组队,则最终排名相同。

  1. 赛题训练集案例如下:

​ 训练集 5万数据,共 150w 行
​ 测试集共 5000条数据,共 50w 行

  1. 赛题数据文件总大小
    150MB,数据均为 csv 格式,列使用逗号分
    测试集中 label字段 team_placement 为空,需要选手预测。
  2. 完整的数据字段含义如下:
    match_id:本局游戏的 id
    team_id:本局游戏中队伍 id,表示在每局游戏中队伍信息
    game_size:本局队伍数量
    party_size:本局游戏中队伍人数
    player_assists:玩家助攻数
    player_dbno:玩家击倒数
    player_dist_ride:玩家车辆行驶距离
    player_dist_walk:玩家不幸距离
    player_dmg:输出伤害值
    player_kills:玩家击杀数
    player_name:玩家名称,在训练集和测试集中全局唯一
    kill_distance_x_min:击杀另一位选手时最小的 x坐标间隔
    kill_distance_x_max:击杀另一位选手时最大的 x坐标间隔
    kill_distance_y_min:击杀另一位选手时最小的 y坐标间隔
    kill_distance_y_max:击杀另一位选手时最大的 x坐标间隔
    team_placement:队伍排名
  3. 评估指标
    比赛得分使用绝对回归误差 MAE进行评分,数值越低精度越高。

0x01 Baseline 代码分析

  1. 数据处理部分
1
2
3
4
train_df = train_df.drop(['match_id', 'team_id'], axis=1)
test_df = test_df.drop(['match_id', 'team_id'], axis=1)
train_df = train_df.fillna(0)
test_df = test_df.fillna(0)

​ 可以看到,Baseline 将match_idteam_id删除掉了,并将Nan值以0表示。

  1. 模型搭建部分
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class Regressor(paddle.nn.Layer):
# self代表类的实例自身
def __init__(self):
# 初始化父类中的一些参数
super(Regressor, self).__init__()

self.fc1 = paddle.nn.Linear(in_features=13, out_features=40)
self.fc2 = paddle.nn.Linear(in_features=40, out_features=20)
self.fc3 = paddle.nn.Linear(in_features=20, out_features=1)

self.relu = paddle.nn.ReLU()

# 网络的前向计算
def forward(self, inputs):
x = self.fc1(inputs)
x = self.relu(x)
x = self.fc2(x)
x = self.relu(x)
x = self.fc3(x)
x = self.relu(x)
return x
opt = paddle.optimizer.SGD(learning_rate=0.01, parameters=model.parameters())

​ 使用了3层全连接,非线性用了ReLu函数,优化器使用了SGD优化器,学习率为0.01。

  1. 模型训练部分
1
2
3
4
EPOCH_NUM = 20   # 设置外层循环次数
BATCH_SIZE = 1000 # 设置batch大小
training_data = train_df.iloc[:-10000].values.astype(np.float32)
val_data = train_df.iloc[-10000:].values.astype(np.float32)

​ 训练轮数为20,batch大小为1000,训练集大小为1490000,测试集大小为10000。


0x02 基于 Baseline 的改进

改进的点:

  1. 对队伍进行聚合,每一个队伍所有人的数据进行平均,并将其当作一条数据。

    原因:由于我们要预测每个队伍的排名,队伍中的每个人的排名是相同的,因此将队伍中所有人的数据进行平均,更好的反映了队伍的总体表现对排名的影响。

  2. team_replacement作等比例缩小处理,都除每轮的game_size,保证处理后的值处于[0,1]之间。

    原因:Baseline提供的可供优化的点之一为归一化处理team_replacement,应该是针对不同类型的比赛,相同排名队伍的数据其实差别很大,那么归一化之后,模型就能更好地预测出排名。

  3. 训练集大小为0.8*总数据,测试集大小为0.2*总数据

    原因:训练集测试集划分一般都是8:2。

  4. 使用了paddle官方文档中最好的优化器Lamb,并设置学习率变化的scheduler,学习率为0.01。

  5. 网络结构采用了14层全连接网络,激活函数仍使用ReLu

    原因:数据量很大,所以网络结构多加几层。但是怕自己时长不够,所以只设置了14层。

  6. 训练轮数设置为200轮。

    原因:轮数越大越好。

  7. [game_size, player_name, match_id, team_id]去除。

    原因:team_replacement归一化处理后,game_size用处不大。player_name经过查看数据,是从小到大排列的,用处不大。match_id, team_id也是从小到大排列的,用处不大。

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
################
## 95.28% ##
################
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
# @Time : 2022/11/18 23:41:11
# @Author: wd-2711
'''
import pandas as pd
import numpy as np
import paddle
import time

class PUBGRegressor(paddle.nn.Layer):
"""数据量很大,建议尝试深层神经网络"""
def __init__(self):
super(PUBGRegressor, self).__init__()
self.fc1 = paddle.nn.Linear(in_features=11, out_features=64)
self.fc2 = paddle.nn.Linear(in_features=64, out_features=128)
self.fc3 = paddle.nn.Linear(in_features=128, out_features=256)
self.fc4 = paddle.nn.Linear(in_features=256, out_features=512)
self.fc5 = paddle.nn.Linear(in_features=512, out_features=1024)
self.fc6 = paddle.nn.Linear(in_features=1024, out_features=2048)
self.fc7 = paddle.nn.Linear(in_features=2048, out_features=2048)
self.fc8 = paddle.nn.Linear(in_features=2048, out_features=2048)
self.fc9 = paddle.nn.Linear(in_features=2048, out_features=1024)
self.fc10 = paddle.nn.Linear(in_features=1024, out_features=512)
self.fc11 = paddle.nn.Linear(in_features=512, out_features=256)
self.fc12 = paddle.nn.Linear(in_features=256, out_features=128)
self.fc13 = paddle.nn.Linear(in_features=128, out_features=64)
self.fc14 = paddle.nn.Linear(in_features=64, out_features=1)
self.relu = paddle.nn.ReLU()

def forward(self, inputs):
x = self.relu(self.fc1(inputs))
x = self.relu(self.fc2(x))
x = self.relu(self.fc3(x))
x = self.relu(self.fc4(x))
x = self.relu(self.fc5(x))
x = self.relu(self.fc6(x))
x = self.relu(self.fc7(x))
x = self.relu(self.fc8(x))
x = self.relu(self.fc9(x))
x = self.relu(self.fc10(x))
x = self.relu(self.fc11(x))
x = self.relu(self.fc12(x))
x = self.relu(self.fc13(x))
x = self.fc14(x)
return x

print("[+] final v2")
st = time.time()
train_df = pd.read_csv('data/data137263/pubg_train.csv.zip')
test_df = pd.read_csv('data/data137263/pubg_test.csv.zip')

# 填充nan值,避免训练出错
train_df = train_df.fillna(0)
test_df = test_df.fillna(0)

# 删除 player_name 这一列
train_df = train_df.drop(['player_name'], axis = 1)
test_df = test_df.drop(['player_name'], axis = 1)

# 根据队伍进行聚合,并对标签作归一化处理
# 1.train_df
new_train_df = train_df.groupby(['match_id', 'team_id'], as_index = False).agg(np.mean)
new_train_df['team_placement'] = new_train_df['team_placement'] / new_train_df['game_size']
train_df = pd.merge(train_df, new_train_df, on = ['match_id', 'team_id'], how = "outer")
train_df = train_df.drop(train_df.columns[2:15], axis = 1)
# 2.test_df
new_test_df = test_df.groupby(['match_id', 'team_id'], as_index = False).agg(np.mean)
test_df = pd.merge(test_df, new_test_df, on = ['match_id', 'team_id'], how = "outer")
test_df = test_df.drop(test_df.columns[2:14], axis = 1)

# 将列的数值归一化 将每一列数据除以最大值
for col in train_df.columns[3:-1]:
train_df[col] /= train_df[col].max()
for col in test_df.columns[3:]:
test_df[col] /= test_df[col].max()
print("[+] data processing cost {:.2f}s".format(time.time() - st))

# 声明定义好的线性回归模型 实例化
model = PUBGRegressor()

# 开启模型训练模式
model.train()

# 定义优化算法,使用随机梯度下降SGD
scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate = 0.01, T_max = 200)
opt = paddle.optimizer.Lamb(learning_rate = scheduler, lamb_weight_decay = 0.01,beta1 = 0.9, beta2 = 0.999, epsilon = 1e-06,parameters = model.parameters(), grad_clip = None, name = None)

EPOCH_NUM = 200 # 设置外层循环次数 尽量大
BATCH_SIZE = 1000 # 设置batch大小 根据硬件调整
print(f"[+] EPOCH {EPOCH_NUM}, BATCH_SIZE {BATCH_SIZE}")
# 优化算法
training_data = train_df.iloc[:-1 * int(train_df.shape[0] * 0.2)].values.astype(np.float32)
val_data = train_df.iloc[-1 * int(train_df.shape[0] * 0.2):].values.astype(np.float32)
min_loss = 100

# 定义外层循环
for epoch_id in range(EPOCH_NUM):
st = time.time()
# 在每轮迭代开始之前,将训练数据的顺序随机的打乱
np.random.shuffle(training_data)

# 将训练数据进行拆分,每个batch包含10条数据
mini_batches = [training_data[k:min(k+BATCH_SIZE, len(training_data))] for k in range(0, len(training_data), BATCH_SIZE)][:-1]

train_loss = []
for iter_id, mini_batch in enumerate(mini_batches):
# 清空梯度变量,以备下一轮计算
opt.clear_grad()

x = np.array(mini_batch[:, 3:-1])
y = np.array(mini_batch[:, -1])

# 将numpy数据转为飞桨动态图tensor的格式
features = paddle.to_tensor(x)
y = paddle.to_tensor(y)

# 前向计算
predicts = model(features)

# 计算损失
loss = paddle.nn.functional.l1_loss(predicts, label = y)
avg_loss = paddle.mean(loss)
train_loss.append(avg_loss.numpy())

# 反向传播,计算每层参数的梯度值
avg_loss.backward()

# 更新参数,根据设置好的学习率迭代一步
opt.step()

mini_batches = [val_data[k:min(k+BATCH_SIZE, len(training_data))] for k in range(0, len(val_data), BATCH_SIZE)][:-1]
val_loss = []
for iter_id, mini_batch in enumerate(mini_batches):
x = np.array(mini_batch[:, 3:-1])
y = np.array(mini_batch[:, -1])

features = paddle.to_tensor(x)
y = paddle.to_tensor(y)

predicts = model(features)
loss = paddle.nn.functional.l1_loss(predicts, label = y)
avg_loss = paddle.mean(loss)
val_loss.append(avg_loss.numpy())

print(f'Epoch {epoch_id}, train MAE {np.mean(train_loss) * 50:.3f}, val MAE {np.mean(val_loss) * 50:.3f}, timecost {time.time() - st:.2f}s')
if min_loss > np.mean(val_loss):
min_loss = np.mean(val_loss)
paddle.save(model.state_dict(), 'best-pubg.model')
print("min loss: ", min_loss * 50)

# 模型预测
model.eval()
test_data = paddle.to_tensor(test_df.iloc[:, 3:].values.astype(np.float32))
test_predict = model(test_data)
test_predict = test_predict.numpy().squeeze()*test_df.iloc[:, 2].to_numpy()
test_predict = test_predict.round().astype(int)

pd.DataFrame({
'team_placement': test_predict
}).to_csv('submission_PUBG_final.csv', index=None)

# 将submission.csv压缩为zip格式
!zip submission_PUBG_final.zip submission_PUBG_final.csv

结果:

image-20221121183633519


0x03 基于CART决策树

​ CART(Classification And Regression Tree),可以处理分类问题,也可以处理回归问题。CART使用基尼系数作为划分标准。

代码实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
################
## 95.19% ##
################
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
# @Time : 2022/11/20 20:45:43
# @Author: wd-2711
'''

import pandas as pd
from sklearn import model_selection
from sklearn import tree
from sklearn import metrics
import time

def gridSearch(args: list):
"""
搜索最优参数值
output -> {'max_depth': 11, 'min_samples_leaf': 28, 'min_samples_split': 72}
"""
[X_train, _, y_train, _, _] = args

# 预设各参数的不同选项值
# max_depth = [i+5 for i in range(20)]
max_depth = [11]
min_samples_split = [(i+5)*2 for i in range(31, 32)]
min_samples_leaf = [(i+5)*2 for i in range(1, 10)]
parameters = {'max_depth':max_depth, 'min_samples_split':min_samples_split, 'min_samples_leaf':min_samples_leaf}

# 网格搜索法,测试不同的参数值
print("[+] searching...")
grid_dtcateg = model_selection.GridSearchCV(estimator = tree.DecisionTreeRegressor(), param_grid = parameters, cv=10)

# 模型拟合
print("[+] fitting...")
grid_dtcateg.fit(X_train, y_train)

# 返回最佳组合的参数值
print(grid_dtcateg.best_params_)

def desicionTree(args: list):
"""
构建决策树
"""
[X_train, X_test, y_train, y_test, test_df] = args

# 构建用于回归的决策树
CART_Reg = tree.DecisionTreeRegressor(max_depth = 11, min_samples_leaf = 28, min_samples_split = 72)

# 回归树拟合
CART_Reg.fit(X_train, y_train)

# 模型在划分测试集上的预测
pred = CART_Reg.predict(X_test)

# 计算衡量模型好坏的MAE值
print("[+] MAE loss: {:.3f}".format(metrics.median_absolute_error(y_test, pred)))

# 模型在真正测试集上的预测
pred_2 = CART_Reg.predict(test_df)
pred_2 = pred_2.round().astype(int)


pd.DataFrame({
'team_placement': pred_2
}).to_csv('submission.csv', index=None)

if __name__ == "__main__":
train_df = pd.read_csv("./data/pubg_train.csv")
test_df = pd.read_csv("./data/pubg_test.csv")

train_df = train_df.fillna(0)
test_df = test_df.fillna(0)

# split trainset and testset
X_train, X_test, y_train, y_test = model_selection.train_test_split(train_df[train_df.columns[:-1]], train_df[train_df.columns[-1]], test_size = 0.25, random_state = 711)

args = [X_train, X_test, y_train, y_test, test_df]

# 搜索最优参数值
# gridSearch(args)

# 根据最优参数值生成树
st = time.time()
desicionTree(args)
print("[+] timecost: {:.2f}s".format(time.time() - st))

​ 通过网格搜索出最优参数值:{'max_depth': 11, 'min_samples_leaf': 28, 'min_samples_split': 72},在测试集上输出:

1
2
[+] MAE loss: 3.790
[+] timecost: 13.15

​ 提交之后,发现正确率为95.19%。

image-20221121184004571

留言

2022-11-21

© 2024 wd-z711

⬆︎TOP