旺衰与分类算法

桃扇骨 2023-02-25 13:30 47阅读 0赞

触类旁通不是一件容易的事情，很多例子都是两个指标来确定分类，因为这样可以通过二维图有个清晰的认知。这里拿八字命理中最有争议的强弱论，试试用机器学习算法看看效果如何，因为我也可以才接触算法不久，故也对算法进行说明。  
这里不搞什么加权，因为你怎么加权，总有争议。只按照八字中天干以及支藏天干，对日主的生助克泄耗做分析。而利用机器学习中的算法，就相对客观了多。这里1：生、2：助、3：克、4：泄、5：耗  
数据生成参考[八字生助克泄耗数据生成][Link 1]  
![1][]  
**1 决策树**  
**1.1 随机森林**  
计算得到交叉熵均值和模型准确率评分，通过调参得到效果如下表所示，我随意调整了两个参数，模型得分居然相差了20个点，很神奇，这个是为什么呢？参数调整有没有啥经验策略呢？[随机森林（Random Forest）][Random Forest]

<table> 
 <thead> 
  <tr> 
   <th>参数</th> 
   <th>交叉熵均值</th> 
   <th>模型得分</th> 
  </tr> 
 </thead> 
 <tbody> 
  <tr> 
   <td>RandomForestClassifier(max_depth=6,n_estimators=100,random_state=6)</td> 
   <td>0.730452560922195</td> 
   <td>0.766831</td> 
  </tr> 
  <tr> 
   <td>RandomForestClassifier(max_features=0.2,n_estimators=100)</td> 
   <td>0.9554445958928852</td> 
   <td>0.9572016460905349</td> 
  </tr> 
  <tr> 
   <td>RandomForestClassifier(max_features=0.2,n_estimators=200)</td> 
   <td>0.9562679021291374</td> 
   <td>0.9588348765432099</td> 
  </tr> 
  <tr> 
   <td>将1.2中决策树模型改成随机森林，采用one-hot编码，n_estimators=100</td> 
   <td>0.9630319021784692</td> 
   <td>0.9647087191358025</td> 
  </tr> 
  <tr> 
   <td>将1.2中决策树模型改成随机森林，采用one-hot编码，n_estimators=200</td> 
   <td>0.9630319021784692</td> 
   <td>0.9647087191358025</td> 
  </tr> 
  <tr> 
   <td>将1.2中决策树模型改成随机森林，采用one-hot编码，n_estimators=300</td> 
   <td>0.9535918844663577</td> 
   <td>0.9565779320987654</td> 
  </tr> 
 </tbody> 
</table>

`n_estimators`从100到200，得分并没有太大变化。  
以前我以为超参要一个个试出来，原来还可以通过网格搜索来进行,参考[sklearn随机森林调参小结][sklearn],注意并不是什么场景都适合`scoring = 'roc_auc'`,参见[机器学习：multiclass format is not supported][multiclass format is not supported]，[roc\_auc][roc_auc]是一个评价标准，auc是roc曲线下方的面积，面积越大则分类评估模型就越好  
下方并不是最好的代码，应该参考[Python机器学习笔记：Grid SearchCV（网格搜索）][Python_Grid SearchCV]  
网格搜索用于调参，那么[集成学习voting Classifier在sklearn中的实现][voting Classifier_sklearn]，怎可以调试哪个模型最优

param_estis = { 'n_estimators': range(80, 250, 10)}
        gs = ms.GridSearchCV(estimator=se.RandomForestClassifier(min_samples_split=100,
                                                                 min_samples_leaf=20, max_depth=8, max_features='sqrt',
                                                                 random_state=10)
            , param_grid = param_estis, cv = 5)
    gs.fit(x_train, y_train)
    print('评估分为：{},参数：{},分值:{},结束时间为：{}'.format(gs.grid_scores_, gs.best_params_, gs.best_score_,
                                                    time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))

决策树算法可以使用不熟悉的数据集合，从中提取出一系列规则。[随机森林过拟合问题][Link 2]

<table> 
 <thead> 
  <tr> 
   <th>维度</th> 
   <th>说明</th> 
   <th>备注</th> 
  </tr> 
 </thead> 
 <tbody> 
  <tr> 
   <td>优点</td> 
   <td>对中间值的缺失不敏感，可以处理不相关特征数据</td> 
   <td></td> 
  </tr> 
  <tr> 
   <td>缺点</td> 
   <td>会产生过度匹配问题</td> 
   <td></td> 
  </tr> 
  <tr> 
   <td>数据类型</td> 
   <td>数值型和标称型</td> 
   <td>标称型数据是有限目标，生克关系、旺衰即是标称型数据，<a href="https://www.cnblogs.com/cnkai/p/7755097.html" rel="nofollow">标称型特征编码</a></td> 
  </tr> 
 </tbody> 
</table>

[sklearn随机森林-分类参数详解][sklearn_-]

import pandas as pd
    import sklearn.model_selection as ms
    import sklearn.ensemble as se
    import time
    import warnings
    warnings.filterwarnings('ignore')
    
    def get_data():
        ''' 获取数据 :return: '''
        data = pd.read_csv('sample.csv')
        return data.get_values()[:,2:],data['ws']
    def train():
        X,Y = get_data()
        print('加载数据集完成')
        x_train, x_test, y_train, y_test = ms.train_test_split(X, Y, test_size=0.2, random_state=1)
        print('测试集划分完成')
        #
        print('开始建立随机森林模型:{}'.format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) ))
        model = se.RandomForestClassifier(max_depth=6,n_estimators=100,random_state=6)
        cv = ms.cross_val_score(model,x_train,y_train,cv=4,scoring='f1_weighted')
        print("交叉熵：{}".format(cv.mean()))
        print('开始训练：{}'.format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
        model.fit(x_train,y_train)
        s = model.score(x_test,y_test)
        print('评估分为：{},结束时间为：{}'.format(s,time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))

**1.2 决策树**  
参考[Python学习教程：决策树算法（三）sklearn决策树实战][Python_sklearn]，空值处理参考[Pandas的数据清洗-填充NaN][Pandas_-_NaN]，直接使用示例会有下面的问题  
y的数据类型他不认识`ValueError: Unknown label type: 'unknown'`  
![1][1 1]  
计算后得到的评分是`0.9634934413580247`,貌似比随机森林效果还要好。[机器学习：决策树（基尼系数][Link 3]，信息熵和基尼系统得到的评分效果差不多

<table> 
 <thead> 
  <tr> 
   <th>系数</th> 
   <th>评分</th> 
  </tr> 
 </thead> 
 <tbody> 
  <tr> 
   <td>gini</td> 
   <td>0.9634934413580247</td> 
  </tr> 
  <tr> 
   <td>entropy</td> 
   <td>0.9638406635802469</td> 
  </tr> 
 </tbody> 
</table>

import pandas as pd
    from sklearn import tree
    import sklearn.model_selection as ms
    import sklearn.feature_extraction as fe
    import sklearn.ensemble as se
    import time
    import warnings
    warnings.filterwarnings('ignore')
    
    from sklearn import tree
    def train_small():
        data = pd.read_csv('sample1.csv')
        data.fillna('无', inplace=True)
        print(data.sample(20))
        # 特征向量化
        vec = fe.DictVectorizer(sparse=False)
        feature = data[['yg', 'mg', 'hg', 'yz_b', 'yz_z', 'yz_y', 'mz_b'
            , 'mz_z', 'mz_y', 'dz_b', 'dz_z', 'dz_y', 'hz_b', 'hz_z', 'hz_y']]
        print(data['ws'].unique())
        X = vec.fit_transform(feature.to_dict(orient='record'))
        # 通过astype进行类型转换
        Y = data['ws'].astype('float')
        print('加载数据集完成')
        # criterion有两位：基尼系数gini和信息熵entropy
        model = tree.DecisionTreeClassifier(criterion='gini')
        x_train, x_test, y_train, y_test = ms.train_test_split(X, Y, test_size=0.15, random_state=1)
        print('测试集划分完成')
        print('开始训练：{}'.format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
        model.fit(x_train, y_train)
        s = model.score(x_test, y_test)
        print('评估分为：{},结束时间为：{}'.format(s, time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))

![1][1 2]  
**2 贝叶斯**  
[朴素贝叶斯-分类及Sklearn库实现（1）机器学习实战][-_Sklearn_1]，从下面看出贝叶斯算法并不高，对数据可能比较敏感，计算速度倒是相当快

<table> 
 <thead> 
  <tr> 
   <th>算法模型</th> 
   <th>评分</th> 
  </tr> 
 </thead> 
 <tbody> 
  <tr> 
   <td>GaussianNB</td> 
   <td>0.6287422839506173</td> 
  </tr> 
  <tr> 
   <td>MultinomialNB</td> 
   <td>0.48348765432098767</td> 
  </tr> 
  <tr> 
   <td>BernoulliNB</td> 
   <td>0.4255272633744856</td> 
  </tr> 
 </tbody> 
</table>

**2.1 高斯朴素贝叶斯**

def train_gauss():
        X, Y = get_data()
        print('加载数据集完成')
        x_train, x_test, y_train, y_test = ms.train_test_split(X, Y, test_size=0.15, random_state=1)
        model = GaussianNB()
        print('开始训练：{}'.format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
        model.fit(x_train, y_train)
        s = model.score(x_test, y_test)
        print('评估分为：{},结束时间为：{}'.format(s, time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))

[Link 1]: https://blog.csdn.net/warrah/article/details/107314454
[1]: /images/20230209/946c43ee0fb14d599638dcfdfb2dd2e9.png
[Random Forest]: https://www.cnblogs.com/maybe2030/p/4585705.html
[sklearn]: https://blog.csdn.net/nowfuture/article/details/81745177
[multiclass format is not supported]: https://blog.csdn.net/u013884777/article/details/81169008
[roc_auc]: https://blog.csdn.net/qq_20011607/article/details/81712811
[Python_Grid SearchCV]: https://www.cnblogs.com/wj-1314/p/10422159.html
[voting Classifier_sklearn]: https://blog.csdn.net/m0_37725003/article/details/81095555
[Link 2]: https://blog.csdn.net/u010429286/article/details/100101768
[sklearn_-]: https://blog.csdn.net/R18830287035/article/details/89257857
[Python_sklearn]: https://blog.csdn.net/qq_42992919/article/details/99558710
[Pandas_-_NaN]: http://liao.cpython.org/pandas21/
[1 1]: /images/20230209/1fcff4f337e84715b6ed91831e240ac0.png
[Link 3]: http://www.mamicode.com/info-detail-2412736.html
[1 2]: /images/20230209/522c0636a85649fdaa05f5b6e5c400f4.png
[-_Sklearn_1]: https://blog.csdn.net/BIT_666/article/details/79702066