旺衰与分类算法 桃扇骨 2023-02-25 13:30 2阅读 0赞 触类旁通不是一件容易的事情,很多例子都是两个指标来确定分类,因为这样可以通过二维图有个清晰的认知。这里拿八字命理中最有争议的强弱论,试试用机器学习算法看看效果如何,因为我也可以才接触算法不久,故也对算法进行说明。 这里不搞什么加权,因为你怎么加权,总有争议。只按照八字中天干以及支藏天干,对日主的生助克泄耗做分析。而利用机器学习中的算法,就相对客观了多。这里1:生、2:助、3:克、4:泄、5:耗 数据生成参考[八字生助克泄耗数据生成][Link 1] ![1][] **1 决策树** **1.1 随机森林** 计算得到交叉熵均值和模型准确率评分,通过调参得到效果如下表所示,我随意调整了两个参数,模型得分居然相差了20个点,很神奇,这个是为什么呢?参数调整有没有啥经验策略呢?[随机森林(Random Forest)][Random Forest] <table> <thead> <tr> <th>参数</th> <th>交叉熵均值</th> <th>模型得分</th> </tr> </thead> <tbody> <tr> <td>RandomForestClassifier(max_depth=6,n_estimators=100,random_state=6)</td> <td>0.730452560922195</td> <td>0.766831</td> </tr> <tr> <td>RandomForestClassifier(max_features=0.2,n_estimators=100)</td> <td>0.9554445958928852</td> <td>0.9572016460905349</td> </tr> <tr> <td>RandomForestClassifier(max_features=0.2,n_estimators=200)</td> <td>0.9562679021291374</td> <td>0.9588348765432099</td> </tr> <tr> <td>将1.2中决策树模型改成随机森林,采用one-hot编码,n_estimators=100</td> <td>0.9630319021784692</td> <td>0.9647087191358025</td> </tr> <tr> <td>将1.2中决策树模型改成随机森林,采用one-hot编码,n_estimators=200</td> <td>0.9630319021784692</td> <td>0.9647087191358025</td> </tr> <tr> <td>将1.2中决策树模型改成随机森林,采用one-hot编码,n_estimators=300</td> <td>0.9535918844663577</td> <td>0.9565779320987654</td> </tr> </tbody> </table> `n_estimators`从100到200,得分并没有太大变化。 以前我以为超参要一个个试出来,原来还可以通过网格搜索来进行,参考[sklearn随机森林调参小结][sklearn],注意并不是什么场景都适合`scoring = 'roc_auc'`,参见[机器学习:multiclass format is not supported][multiclass format is not supported],[roc\_auc][roc_auc]是一个评价标准,auc是roc曲线下方的面积,面积越大则分类评估模型就越好 下方并不是最好的代码,应该参考[Python机器学习笔记:Grid SearchCV(网格搜索)][Python_Grid SearchCV] 网格搜索用于调参,那么[集成学习voting Classifier在sklearn中的实现][voting Classifier_sklearn],怎可以调试哪个模型最优 param_estis = { 'n_estimators': range(80, 250, 10)} gs = ms.GridSearchCV(estimator=se.RandomForestClassifier(min_samples_split=100, min_samples_leaf=20, max_depth=8, max_features='sqrt', random_state=10) , param_grid = param_estis, cv = 5) gs.fit(x_train, y_train) print('评估分为:{},参数:{},分值:{},结束时间为:{}'.format(gs.grid_scores_, gs.best_params_, gs.best_score_, time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) 决策树算法可以使用不熟悉的数据集合,从中提取出一系列规则。[随机森林过拟合问题][Link 2] <table> <thead> <tr> <th>维度</th> <th>说明</th> <th>备注</th> </tr> </thead> <tbody> <tr> <td>优点</td> <td>对中间值的缺失不敏感,可以处理不相关特征数据</td> <td></td> </tr> <tr> <td>缺点</td> <td>会产生过度匹配问题</td> <td></td> </tr> <tr> <td>数据类型</td> <td>数值型和标称型</td> <td>标称型数据是有限目标,生克关系、旺衰即是标称型数据,<a href="https://www.cnblogs.com/cnkai/p/7755097.html" rel="nofollow">标称型特征编码</a></td> </tr> </tbody> </table> [sklearn随机森林-分类参数详解][sklearn_-] import pandas as pd import sklearn.model_selection as ms import sklearn.ensemble as se import time import warnings warnings.filterwarnings('ignore') def get_data(): ''' 获取数据 :return: ''' data = pd.read_csv('sample.csv') return data.get_values()[:,2:],data['ws'] def train(): X,Y = get_data() print('加载数据集完成') x_train, x_test, y_train, y_test = ms.train_test_split(X, Y, test_size=0.2, random_state=1) print('测试集划分完成') # print('开始建立随机森林模型:{}'.format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) )) model = se.RandomForestClassifier(max_depth=6,n_estimators=100,random_state=6) cv = ms.cross_val_score(model,x_train,y_train,cv=4,scoring='f1_weighted') print("交叉熵:{}".format(cv.mean())) print('开始训练:{}'.format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) model.fit(x_train,y_train) s = model.score(x_test,y_test) print('评估分为:{},结束时间为:{}'.format(s,time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) **1.2 决策树** 参考[Python学习教程:决策树算法(三)sklearn决策树实战][Python_sklearn],空值处理参考[Pandas的数据清洗-填充NaN][Pandas_-_NaN],直接使用示例会有下面的问题 y的数据类型他不认识`ValueError: Unknown label type: 'unknown'` ![1][1 1] 计算后得到的评分是`0.9634934413580247`,貌似比随机森林效果还要好。[机器学习:决策树(基尼系数][Link 3],信息熵和基尼系统得到的评分效果差不多 <table> <thead> <tr> <th>系数</th> <th>评分</th> </tr> </thead> <tbody> <tr> <td>gini</td> <td>0.9634934413580247</td> </tr> <tr> <td>entropy</td> <td>0.9638406635802469</td> </tr> </tbody> </table> import pandas as pd from sklearn import tree import sklearn.model_selection as ms import sklearn.feature_extraction as fe import sklearn.ensemble as se import time import warnings warnings.filterwarnings('ignore') from sklearn import tree def train_small(): data = pd.read_csv('sample1.csv') data.fillna('无', inplace=True) print(data.sample(20)) # 特征向量化 vec = fe.DictVectorizer(sparse=False) feature = data[['yg', 'mg', 'hg', 'yz_b', 'yz_z', 'yz_y', 'mz_b' , 'mz_z', 'mz_y', 'dz_b', 'dz_z', 'dz_y', 'hz_b', 'hz_z', 'hz_y']] print(data['ws'].unique()) X = vec.fit_transform(feature.to_dict(orient='record')) # 通过astype进行类型转换 Y = data['ws'].astype('float') print('加载数据集完成') # criterion有两位:基尼系数gini和信息熵entropy model = tree.DecisionTreeClassifier(criterion='gini') x_train, x_test, y_train, y_test = ms.train_test_split(X, Y, test_size=0.15, random_state=1) print('测试集划分完成') print('开始训练:{}'.format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) model.fit(x_train, y_train) s = model.score(x_test, y_test) print('评估分为:{},结束时间为:{}'.format(s, time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) ![1][1 2] **2 贝叶斯** [朴素贝叶斯-分类及Sklearn库实现(1)机器学习实战][-_Sklearn_1],从下面看出贝叶斯算法并不高,对数据可能比较敏感,计算速度倒是相当快 <table> <thead> <tr> <th>算法模型</th> <th>评分</th> </tr> </thead> <tbody> <tr> <td>GaussianNB</td> <td>0.6287422839506173</td> </tr> <tr> <td>MultinomialNB</td> <td>0.48348765432098767</td> </tr> <tr> <td>BernoulliNB</td> <td>0.4255272633744856</td> </tr> </tbody> </table> **2.1 高斯朴素贝叶斯** def train_gauss(): X, Y = get_data() print('加载数据集完成') x_train, x_test, y_train, y_test = ms.train_test_split(X, Y, test_size=0.15, random_state=1) model = GaussianNB() print('开始训练:{}'.format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) model.fit(x_train, y_train) s = model.score(x_test, y_test) print('评估分为:{},结束时间为:{}'.format(s, time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) [Link 1]: https://blog.csdn.net/warrah/article/details/107314454 [1]: /images/20230209/946c43ee0fb14d599638dcfdfb2dd2e9.png [Random Forest]: https://www.cnblogs.com/maybe2030/p/4585705.html [sklearn]: https://blog.csdn.net/nowfuture/article/details/81745177 [multiclass format is not supported]: https://blog.csdn.net/u013884777/article/details/81169008 [roc_auc]: https://blog.csdn.net/qq_20011607/article/details/81712811 [Python_Grid SearchCV]: https://www.cnblogs.com/wj-1314/p/10422159.html [voting Classifier_sklearn]: https://blog.csdn.net/m0_37725003/article/details/81095555 [Link 2]: https://blog.csdn.net/u010429286/article/details/100101768 [sklearn_-]: https://blog.csdn.net/R18830287035/article/details/89257857 [Python_sklearn]: https://blog.csdn.net/qq_42992919/article/details/99558710 [Pandas_-_NaN]: http://liao.cpython.org/pandas21/ [1 1]: /images/20230209/1fcff4f337e84715b6ed91831e240ac0.png [Link 3]: http://www.mamicode.com/info-detail-2412736.html [1 2]: /images/20230209/522c0636a85649fdaa05f5b6e5c400f4.png [-_Sklearn_1]: https://blog.csdn.net/BIT_666/article/details/79702066
还没有评论,来说两句吧...