142 lines (133 with data), 11.6 kB
ţ2÷Y4 Ń @ s˘ d Z d d l m Z m Z m Z d d l m Z d d l m Z d d l m Z d d l m Z d d l
m Z d d l m
Z
d d l Z d d
l m Z e d â Z d Z d
Z d g Z d Z d Z d Z d Z d Z Gd d ä d â Z d S)z4It is an interface for ranking features importance.
Ú )┌List┌TypeVar┌Any)┌ensemble)┌feature_selection)┌tree)┌svm)┌SVR)┌RandomizedLogisticRegressionN)┌ CONSTANTS┌ DataFramezMohsen Mesgarpourz-Copyright 2016, https://github.com/mesgarpour┌GPLz1.1zmohsen.mesgarpour@gmail.com┌Releasec @ sE e Z d Z d d ä Z d e e e e e d d d É ä Z d e e e e e d d d É ä Z
e e e e d d
d É ä Z e e e e d d d
É ä Z e e e e d d d É ä Z
d d e e e e e e d d d É ä Z e e e e d d d É ä Z e e e e d d d É ä Z e e e e e d d d É ä Z d S) ┌FeatureSelectionc C s) t j t j â | _ | j j t â d S)z.Initialise the objects and constants.
N)┌logging┌ getLoggerr ┌app_name┌_FeatureSelection__logger┌debug┌__name__)┌selfę r ˙NC:\Users\eagle\Documents\GitHub\Analytics_UoW\TCARER\Stats\FeatureSelection.py┌__init__, s zFeatureSelection.__init__Ú )┌features_indep_df┌feature_target┌n_jobs┌kwargs┌returnc K s5 | j j d â t j d | | Ź } | j | | â S)u` Use Brieman Random Forest Classifier to rank features.
Attributes:
model.estimators_
model.classes_
model.n_classes_
model.n_features_
model.n_outputs_
model.feature_importances_
:param features_indep_df: the independent features, which are inputted into the model.
:param feature_target: the target feature, which is being estimated.
:param n_jobs: number of CPUs to use during the resampling. If ÔÇś-1ÔÇÖ, use all the CPUs.
:param kwargs: n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, bootstrap=True,
oob_score=False, random_state=None, verbose=0, warm_start=False, class_weight=None
:return: the importance ranking model.
z'Run Random Forest Classifier (Brieman).r )r r r ZRandomForestClassifier┌fit)r r r r r ┌
classifierr r r ┌rank_random_forest_breiman2 s z+FeatureSelection.rank_random_forest_breimanc K s2 | j j d â t d | | Ź } | j | | â S)uť Use Randomized Logistic Regression to rank features.
Attributes:
model.scores_
model.all_scores_
:param features_indep_df: the independent features, which are inputted into the model.
:param feature_target: the target feature, which is being estimated.
:param n_jobs: number of CPUs to use during the resampling. If ÔÇś-1ÔÇÖ, use all the CPUs.
:param kwargs: C=1, scaling=0.5, sample_fraction=0.75, n_resampling=200, selection_threshold=0.25, tol=0.001,
fit_intercept=True, verbose=False, normalize=True, random_state=None, pre_dispatch='3*n_jobs'
:return: the importance ranking model.
zRun Random Logistic Regression.r )r r r
r )r r r r r r! r r r ┌rank_random_logistic_regressionL s z0FeatureSelection.rank_random_logistic_regression)r r r r c K s/ | j j d â t j | Ź } | j | | â S)ap Use Scalable Linear Support Vector Machine for classification.
In C-Support Vector Classification (SVC), the C parameter trades off misclassification of training examples
against simplicity of the decision surface.
Attributes:
model.support_
model.support_vectors_
model.n_support_
model.dual_coef_
model.coef_
model.intercept_
:param features_indep_df: the independent features, which are inputted into the model.
:param feature_target: the target feature, which is being estimated.
:param kwargs: C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False,
tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None,
random_state=None
:return: the importance ranking model.
z$Run C-Support Vector Classification.)r r r ZSVCr )r r r r r! r r r ┌rank_svm_c_supporta s z#FeatureSelection.rank_svm_c_supportc K s/ | j j d â t j | Ź } | j | | â S)aŘ Use Brieman decision tree classifier to rank features.
Attributes:
model.classes_
model.feature_importances_
model.max_features_
model.n_classes_
model.n_features_
model.n_outputs_
model.tree_
:param features_indep_df: the independent features, which are inputted into the model.
:param feature_target: the target feature, which is being estimated.
:param kwargs: criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None,
min_impurity_split=1e-07, class_weight=None, presort=False
:return: the importance ranking model.
z'Run Decision Tree Classifier (Brieman).)r r r ZDecisionTreeClassifierr )r r r r r! r r r ┌rank_tree_brieman{ s z"FeatureSelection.rank_tree_briemanc K s/ | j j d â t j | Ź } | j | | â S)a) Use Gradient Boosted Regression Trees (GBRT) to rank features.
Attributes:
model.feature_importances_
model.train_score_
model.loss_
model.init
model.estimators_
:param features_indep_df: the independent features, which are inputted into the model.
:param feature_target: the target feature, which is being estimated.
:param kwargs: loss='ls', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse',
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_split=1e-07,
init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False,
presort='auto'
:return: the importance ranking model.
z-Run Gradient Boosted Regression Trees (GBRT).)r r r ZGradientBoostingRegressorr )r r r r r! r r r ┌rank_tree_gbrtö s zFeatureSelection.rank_tree_gbrt┌linear)r r ┌kernelr r r c K sJ | j j d â t d | â } t j d | d | | Ź } | j | | â S)uŹ Select top features using recursive feature elimination and cross-validated selection of the best number
of features, to rank features.
Attributes:
model.n_features_
model.support_
model.ranking_
model.grid_scores_
model.estimator_
:param features_indep_df: the independent features, which are inputted into the model.
:param feature_target: the target feature, which is being estimated.
:param kernel: Specifies the kernel type to be used in the algorithm. It must be one of ÔÇślinearÔÇÖ, ÔÇśpolyÔÇÖ,
ÔÇśrbfÔÇÖ, ÔÇśsigmoidÔÇÖ, ÔÇśprecomputedÔÇÖ or a callable. If none is given, ÔÇśrbfÔÇÖ will be used.
:param n_jobs: number of CPUs to use during the resampling. If ÔÇś-1ÔÇÖ, use all the CPUs.
:param kwargs: step=1, cv=None, scoring=None, verbose=0
:return: the feature selection model.
z7Run Feature Ranking with Recursive Feature Elimination.r( ┌ estimatorr )r r r r ┌RFECVr )r r r r( r r r) ┌selectorr r r ┌selector_logistic_rfeČ s z&FeatureSelection.selector_logistic_rfe)r r ┌kbestr c C s) | j j d â | j | | t j | â S)u: Select features according to the k highest scores, using 'chi2':
Chi-squared stats of non-negative features for classification tasks.
Attributes:
model.scores_
model.pvalues_
:param features_indep_df: the independent features, which are inputted into the model.
:param feature_target: the target feature, which is being estimated.
:param kbest: number of top features to select. The ÔÇťallÔÇŁ option bypasses selection, for use in a parameter
search.
:return: the feature selection model.
z@Select features according to the k highest scores, using 'chi2'.)r r ┌5_FeatureSelection__selector_univarite_selection_kbestr ┌chi2)r r r r- r r r ┌'selector_univarite_selection_kbest_chi2╚ s
z8FeatureSelection.selector_univarite_selection_kbest_chi2c C s) | j j d â | j | | t j | â S)u8 Select features according to the k highest scores, using 'f_classif':
ANOVA F-value between label/feature for classification tasks.
Attributes:
model.scores_
model.pvalues_
:param features_indep_df: the independent features, which are inputted into the model.
:param feature_target: the target feature, which is being estimated.
:param kbest: number of top features to select. The ÔÇťallÔÇŁ option bypasses selection, for use in a parameter
search.
:return: the feature selection model.
zESelect features according to the k highest scores, using 'f_classif'.)r r r. r ┌ f_classif)r r r r- r r r ┌,selector_univarite_selection_kbest_f_classifŮ s z=FeatureSelection.selector_univarite_selection_kbest_f_classif)r r ┌
score_funcr- r c C sU | j j d â t t | â | j d â } t j d | d | â } | j | | â S)uv Select features according to the k highest scores.
Attributes:
model.scores_
model.pvalues_
:param features_indep_df: the independent features, which are inputted into the model.
:param feature_target: the target feature, which is being estimated.
:param score_func: Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or
a single array with scores.
:param kbest: number of top features to select. The ÔÇťallÔÇŁ option bypasses selection, for use in a parameter
search.
:return: the feature selection model.
z<Run Univariate Feature Selection with Configurable Strategy.r r3 ┌k)r r ┌int┌float┌shaper ┌SelectKBestr )r r r r3 r- r+ r r r Z$__selector_univarite_selection_kbest˛ s
z5FeatureSelection.__selector_univarite_selection_kbestNÚ r9 r9 )r ┌
__module__┌__qualname__r ┌PandasDataFramer r5 r ┌objectr"