[973ab6]: / Stats / __pycache__ / PreProcess.cpython-35.pyc

Download this file

248 lines (247 with data), 25.4 kB



97÷YäéŃ@sbdZddlmZmZmZmZddlmZddlm	Z	ddl
mZddlm
Z
ddlmZddlZddlZddlZddlZddlZdd	lmZdd
lmZddlmZddlmZddl Z ed
âZ!edâZ"dZ#dZ$dgZ%dZ&dZ'dZ(dZ)dZ*GddädâZ+dS)zĽIt is an interface for the developed pre-processing functions (factoring and near-zero-variance,
high-linear-correlation) and statistical summaries.
Ú)┌Dict┌List┌TypeVar┌Any)┌	CONSTANTS)┌PyConfigParser)┌ReadersWriters)┌FactoringThread)┌TransformThreadN)┌OrderedDict)┌feature_selection)┌stats)┌partial┌	DataFramerzMohsen Mesgarpourz-Copyright 2016, https://github.com/mesgarpour┌GPLz1.1zmohsen.mesgarpour@gmail.com┌Releasec
@sŤeZdZedddÉäZeeeedddÉäZeeeedddÉäZd	ee	e	e	e
ed
ddÉäZee	e	ed
ddÉäZee	e	ed
ddÉäZ
d	ee	e	e	e
ed
ddÉäZee	e	ed
ddÉäZee	e	ed
ddÉäZd	deeee
e	eee	gdddÉäZdeeee	eee	gdddÉäZdeeee	eee	gddd ÉäZd!d"eeeee
eegd#d$d%ÉäZd&d"eeeee
eegd'd(d)ÉäZd*d+d"eeeeee
eegd,d-d.ÉäZeeeeee
e	gd/d0d1ÉäZd2ee	e
eeed3d4d5ÉäZdS)6┌
PreProcess)┌output_pathcCs>tjtjâ|_|jjtâ||_tâ|_	dS)zJInitialise the objects and constants.
        :param output_path:
        N)
┌logging┌	getLoggerr┌app_name┌_PreProcess__logger┌debug┌__name__┌_PreProcess__output_pathr┌_PreProcess__readers_writers)┌selfręr˙HC:\Users\eagle\Documents\GitHub\Analytics_UoW\TCARER\Stats\PreProcess.py┌__init__5s	zPreProcess.__init__)┌df┌includes┌	file_name┌returncCs|jjdâd}|jjd|jd|dgddâx┘|D]Đ}||krE|jjd|jd|dd|gdd	âtj||â}tjd
|ddůdfd|ddůd
fiâ}|j	dddâ}|jjd|jd|d|dd	dd	âqEW|S)a/Calculate the odds ratio for all the features that are included and all the categorical states.
        :param df: the features dataframe.
        :param includes: the name of included features.
        :param file_name: the name of the summary output file.
        :return: the summary output.
        z)Produce statistics for discrete features.N┌path┌title┌data┌appendFzFeature NameT┌valuer┌freqÚ┌	ascending┌header)
rrr┌save_csvrr
┌itemfreq┌pdr┌sort_values)rr r!r"┌	summaries┌f_namerrr┌stats_discrete_df?s*

;zPreProcess.stats_discrete_dfcCs
|jjdâd}|jjd|jd|dgddâx┼|D]Ż}||krE|jjd|jd|dd|gdd	â||jtjâjd
ddd
ddgâj	â}tj
j|âj	â}|jjd|jd|d|dd	dd	âqEW|S)aCalculate the descriptive statistics for all the included continuous features.
        :param df: the features dataframe.
        :param includes: the name of included features.
        :param file_name: the name of the summary output file.
        :return: the summary output.
        z+Produce statistics for continuous features.Nr$r%r&r'FzFeature NameT┌percentilesgÜÖÖÖÖÖę?gđ?gÓ?gŔ?gffffffţ?r,)rrrr-r┌applyr/┌
to_numeric┌describe┌	transpose┌Series┌to_frame)rr r!r"r1r2rrr┌stats_continuous_df`s*

zPreProcess.stats_continuous_dfF)r ┌categories_dic┌
labels_dic┌
dtypes_dic┌threadedr#cCs|jjdât|â}|dk	r@|j|||â}n|j|||â}g}x.|jâD] }|t||jââ7}qhW|jjâ|ľ}	t	j
|	ddgâj}	x|D]
}
|	|
=q├W|j|ddâ}t	j|g|ddâ}|j
|	â}|S)aCategorise groups of features that are selected.
        :param df: the features dataframe.
        :param categories_dic: the dictionary of the categorical states for the included features.
        :param labels_dic: the dictionary of the features names of the categorised features.
        :param dtypes_dic: the dictionary of the dtypes of the categorised features.
        :param threaded: indicates if it is multi-threaded.
        :return: the inputted dataframe with categorised features (if applicable).
        zCategorise groups of features.T┌indexr┌axisr*)rrr┌(_PreProcess__factoring_group_wise_series┌*_PreProcess__factoring_group_wise_threaded┌keys┌list┌dtypes┌to_dictr/r┌drop┌concat┌astype)rr r<r=r>r?┌pool_df_encoded┌labels_encoded┌label_group┌
dtype_orig┌labelrrr┌factoring_group_wiseüs 
zPreProcess.factoring_group_wise)r r<r=r#cCsş|jjdât|||â}g}y1x*|jâD]}|j|j|ââq8WWnMtk
rĘ}z-|jjtdt	|âât
jâWYdd}~XnX|S)abCategorise a group of features that are selected (single-threaded).
        :param df: the features dataframe.
        :param categories_dic: the dictionary of the categorical states for the included features.
        :param labels_dic: the dictionary of the features names of the categorised features.
        :return: the categorised features.
        z0Categorise groups of features (single-threaded).z - Invalid configuration(s): N)rrr	rDr'┌factor_arr_multiple┌
ValueError┌errorr┌str┌sys┌exit)rr r<r=┌factoring_threadrKrM┌	exceptionrrrZ__factoring_group_wise_seriesĘs
z(PreProcess.__factoring_group_wise_seriescCs┴|jjdât|||â}yKtjdtjâdâĆ(}|jt|jâ|j	ââ}WdQRXWnMt
k
r╝}z-|jjtdt
|ââtjâWYdd}~XnX|S)aaCategorise a group of features that are selected (multi-threaded).
        :param df: the features dataframe.
        :param categories_dic: the dictionary of the categorical states for the included features.
        :param labels_dic: the dictionary of the features names of the categorised features.
        :return: the categorised features.
        z/Categorise groups of features (multi-threaded).┌	processesr*Nz - Invalid configuration(s): )rrr	┌mp┌Pool┌	cpu_count┌maprrQrDrRrSrrTrUrV)rr r<r=rW┌poolrKrXrrrZ__factoring_group_wise_threadedżs
&z*PreProcess.__factoring_group_wise_threadedc
Csˇ|jjdât|â}|dk	r@|j|||â}n|j|||â}t|jââ}|jjâ|ľ}t	j
|ddgâj}x|D]
}	||	=q×W|j|ddâ}t	j|g|ddâ}|j
|â}|S)a■Categorise features that are selected.
        :param df: the features dataframe.
        :param categories_dic: the dictionary of the categorical states for the included features.
        :param labels_dic: the dictionary of the features names of the categorised features.
        :param dtypes_dic: the dictionary of the dtypes of the categorised features.
        :param threaded: indicates if it is multi-threaded.
        :return: the inputted dataframe with categorised features (if applicable).
        zCategorise.Tr@rrAr*)rrr┌*_PreProcess__factoring_feature_wise_series┌,_PreProcess__factoring_feature_wise_threadedrErDrFrGr/rrHrIrJ)
rr r<r=r>r?rKrLrNrOrrr┌factoring_feature_wiseËs
z!PreProcess.factoring_feature_wisecCsş|jjdât|||â}g}y1x*|jâD]}|j|j|ââq8WWnMtk
rĘ}z-|jjtdt	|âât
jâWYdd}~XnX|S)aWCategorise features that are selected (single-threaded).
        :param df: the features dataframe.
        :param categories_dic: the dictionary of the categorical states for the included features.
        :param labels_dic: the dictionary of the features names of the categorised features.
        :return: the categorised features.
        zCategorise (single-threaded).z - Invalid configuration(s): N)rrr	rDr'┌
factor_arrrRrSrrTrUrV)rr r<r=rWrKrMrXrrrZ__factoring_feature_wise_series°s
z*PreProcess.__factoring_feature_wise_seriescCs▒|jjdât|||â}y;tjâĆ(}|jt|jâ|jââ}WdQRXWnMt	k
rČ}z-|jj
tdt|âât
jâWYdd}~XnX|S)aVCategorise features that are selected (multi-threaded).
        :param df: the features dataframe.
        :param categories_dic: the dictionary of the categorical states for the included features.
        :param labels_dic: the dictionary of the features names of the categorised features.
        :return: the categorised features.
        zCategorise (multi-threaded).Nz - Invalid configuration(s): )rrr	rZr[r]rrbrDrRrSrrTrUrV)rr r<r=rWr^rKrXrrrZ!__factoring_feature_wise_threadeds
&z,PreProcess.__factoring_feature_wise_threadedN)r ┌excludes┌transform_typer?┌method_args┌kwargsr#c
s|jjdâtłâëçfddć|jjDâ}|dkrPtân|}|jjâ}x|D]}	d||	<qlWtj	|ddgâj}|j
|â}|dkrÎ|j||||Ź\}}n!|j|||||Ź\}}||fS)	a¤Transform the included features, using the selected and configured method.
        :param df: the features dataframe.
        :param excludes: the name of excluded features.
        :param transform_type: the transformation type (options: 'scale', 'robust_scale', 'max_abs_scalar',
        'normalizer', 'kernel_centerer', 'yeo_johnson', 'box_cox')
        :param threaded: indicates if it is multi-threaded.
        :param method_args: the transformation arguments, which needs to preserved if it is applied to more than
        one data set.
        :param kwargs: the input argument for the selected transformation function.
        :return: the inputted dataframe with transformed features (if applicable).
        zTransform Features.cs"g|]}|łkr|ĹqSrr)┌.0rO)rcrr˙
<listcomp>7s	z+PreProcess.transform_df.<locals>.<listcomp>N┌f8r@rF)
r┌info┌set┌columns┌values┌dictrFrGr/rrJ┌ _PreProcess__transform_df_series┌"_PreProcess__transform_df_threaded)
rr rcrdr?rerfr!rNrOr)rcr┌transform_df#s
!!zPreProcess.transform_df)r r!rdrerfr#c	Ks|jjdât|Ź}|dkr1tân|}yu|dkrmxb|D]}|j|||âqMWn>|dkráx/|D]}|j|||âqÇWn|dkrËxŘ|D]}|j|||âq│Wně|dkrx╔|D]}|j|||âqŠWną|dkr9xľ|D]}|j|||âqWnr|dkrlxc|D]}|j	|||âqLWn?|d	krčx0|D]}|j
|||âqWnt|âéWnMtk
rű}z-|jj
td
t|ââtjâWYdd}~XnX||fS)a|Transform the included features, using the selected and configured method (single-threaded).
        :param df: the features dataframe.
        :param includes: the name of included features.
        :param transform_type: the transformation type (options: 'scale', 'robust_scale', 'max_abs_scalar',
        'normalizer', 'kernel_centerer', 'yeo_johnson', 'box_cox')
        :param method_args: the transformation arguments, which needs to preserved if it is applied to more than
        one data set.
        :param kwargs: the input argument for the selected transformation function.
        :return: the transformed feature.
        z%Transform features (single-threaded).N┌scale┌robust_scale┌max_abs_scalar┌
normalizer┌kernel_centerer┌yeo_johnson┌box_coxz - Invalid configuration(s): )rrr
rn┌transform_scale_arr┌transform_robust_scale_arr┌transform_max_abs_scalar_arr┌transform_normalizer_arr┌transform_kernel_centerer_arr┌transform_yeo_johnson_arr┌transform_box_cox_arr┌	ExceptionrRrSrrTrUrV)	rr r!rdrerf┌transform_thread┌namerXrrrZ__transform_df_seriesHs<






z PreProcess.__transform_df_seriesc
Ksq|jjdâtjâ}|jtt||j||jj	j
ââââ}t|Ź}|dkrrtân|}yxtjdtj
âdâĆU}	|dkr╚|	jt|j||â|ân |dkr÷|	jt|j||â|ân˛|dkr$|	jt|j||â|ân─|dkrR|	jt|j||â|ânľ|d	krÇ|	jt|j||â|ânh|d
kr«|	jt|j||â|ân:|dkr▄|	jt|j||â|ânt|âéWdQRXWnMtk
r?}
z-|jjtdt|
ââtjâWYdd}
~
XnXx$|jâD]\}}|||<qMW||fS)
aVTransform the included features, using the selected and configured method (multi-threaded).
        :param df: the features dataframe.
        :param includes: the name of included features.
        :param transform_type: the transformation arguments, which needs to preserved if it is applied to more than
        one data set.
        :param method_args: the transformation arguments, which needs to preserved if it is applied to more than
        one data set.
        :param kwargs: the input argument for the selected transformation function.
        :return: the transformed feature.
        z$Transform features (multi-threaded).NrYr*rrrsrtrurvrwrxz - Invalid configuration(s): )rrrZ┌ManagerrnrE┌ziprl┌Trm┌tolistr
r[r\r]rryrzr{r|r}r~rrÇrRrSrrTrUrV┌items)
rr r!rdrerf┌manager┌dtrür^rX┌k┌vrrrZ__transform_df_threadedzs85"""""""z"PreProcess.__transform_df_threadedgffffffţ?T)r rcr"┌thresh_corr_cut┌	to_searchr#c
s|jjdâd}|ł}tłâëg}tâ}	|dkrk|çfddć|jDâjddâ}x§|jjD]š}
t|t||
â|kj	â}t
|âdkrÇy|j|
âWnto┘t
k
rŠYnXtj||â}x-|D]%}||	jâkr|j|âqWt
|âd	krÇ||	|
<|jjd
|
dt|ââqÇW|j||	|tjj|j|dââ}xłD]}
||
||
<qáWttj|j	âârÔ|jd
dâ}|dkr||	d<||	d<||	fS)aśFind and optionally remove the selected highly linearly correlated features.
        The Pearson correlation coefficient was calculated for all the pair of variables to measure linear dependence
        between them.
        :param df: the features dataframe.
        :param excludes: the name of excluded features.
        :param file_name: the name of the summary output file.
        :param thresh_corr_cut: the numeric value for the pair-wise absolute correlation cutoff. e.g. 0.95.
        :param to_search: to search or use the saved configuration.
        :return: the inputted dataframe with exclusion of features that were selected to be removed.
        z=Remove features with high linear correlation (if applicable).NTcs"g|]}|łkr|ĹqSrr)rg┌col)rcrrrhĂs	z9PreProcess.high_linear_correlation_df.<locals>.<listcomp>┌method┌pearsonr*rzHigh Linear Correlation: z ~ z.inirHzFeatures MatcheszCorrelation Matrix)rrrkrrl┌corrrmrE┌absr@┌len┌removerR┌AttributeError┌np┌union1drDrjrT┌_PreProcess__remove┌osr$┌joinr┌any┌isnan┌reset_index)rr rcr"rîrŹrĹ┌df_excludes┌matchesr1rOZmatches_temp┌matchrér)rcr┌high_linear_correlation_dfşs>
	,#

&.


z%PreProcess.high_linear_correlation_dfgÜÖÖÖÖÖę?)r rcr"┌thresh_variancerŹr#csü|jjdâ||}t|â}g}tâëtâ}x-|jjâD]}	|	ł|jj|	â<qNW|dkrŮtj|â}
|
j	ddâ}çfddć|Dâ}x'|D]}
|
|kr╗||
g7}q╗W|j
|dt|âi|tj
j|j|dââ}x|D]}||||<qWttj|jââra|jddâ}|dkrw||d	<||fS)
aëFind and optionally remove the selected near-zero-variance features (Scikit algorithm).
        Feature selector that removes all low-variance features.
        This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be
        used for unsupervised learning.
        :param df: the features dataframe.
        :param excludes: the name of excluded features.
        :param file_name: the name of the summary output file.
        :param thresh_variance: Features with a training-set variance lower than this threshold will be removed.
        The default is to keep all features with non-zero variance, i.e. remove the features that have the same
        value in all samples.
        :param to_search: to search or use the saved configuration.
        :return: the inputted dataframe with exclusion of features that were selected to be removed.
        zPRemove features with near-zero-variance (if applicable), using Scikit algorithm.T┌indicescsg|]}ł|ĹqSrr)rgr@)rúrrrhs	z7PreProcess.near_zero_var_df_sklearn.<locals>.<listcomp>┌NZVz.inirHzFeatures Matches)rrrkrrlrm┌get_locrZVarianceThresholdZget_supportrśrErÖr$rÜrrŤrľrťr@rŁ)rr rcr"rórŹr×rčr1rOZ
variances_Zmatches_indicesZmatches_labelsrárér)rúr┌near_zero_var_df_sklearnŠs.
		
:

z#PreProcess.near_zero_var_df_sklearnÚdiŔ)r rcr"┌thresh_unique_cut┌thresh_freq_cutrŹr#c
Cs¬|jjdâ||}t|â}g}tâ}	|dkrx├|jjD]Á}
t||
jdtt	jt
t	j
fâsŁt	jt	j||
âârş||
g7}qN|j
||
|
|||â\}|	|
<|dkrN||
g7}|jjd|
âqNW|j|dt|âi|tjj|j|dââ}x|D]}||||<qHWtt	j|jâârŐ|jddâ}|dkrá||	d<||	fS)	a╬Find and optionally remove the selected near-zero-variance features (custom algorithm).
        The features that had constant counts less than or equal a threshold may be filtered out,
        to exclude highly constants and near-zero variances.
        Rules are as the following:
         - Frequency ratio: The frequency of the most prevalent value over the second most frequent value to be
           greater than a threshold;
         - Percent of unique values: The number of unique values divided by the total number of samples to be greater
           than the threshold.
        :param df: the features dataframe.
        :param excludes: the name of excluded features.
        :param file_name: the name of the summary output file.
        :param thresh_unique_cut: the cutoff for the percentage of distinct values out of the number of total samples
        (upper limit). e.g. 10 * 100 / 100.
        :param thresh_freq_cut: the cutoff for the ratio of the most common value to the second most common value
        (lower limit). e.g. 95/5.
        :param to_search: to search or use the saved configuration.
        :return: the inputted dataframe with exclusion of features that were selected to be removed.
        zPRemove features with near-zero-variance (if applicable), using custom algorithm.TrzNear Zero Variance: rĄz.inirHzFeatures Matches)rrrkrrlrm┌
isinstance┌iloc┌intrľ┌floatrť┌sum┌_PreProcess__near_zero_varrjrśrErÖr$rÜrrŤr@rŁ)
rr rcr"rĘręrŹr×rčr1rOrárérrr┌near_zero_var_dfs0
	-
#
:

zPreProcess.near_zero_var_df)┌arrrOrcrĘręr#cCs7|jjdâtj|ddâ\}}t|âdkr_ddt|âdt|âifSt|ddâ}||kr┼t|âdtt|ââ|kr┼ddt|âdt|âifS||kr|d	t|dâ|krddt|âdt|âifSd
dt|âdt|âifSdS)a8Assess a single feature for near-zero-variance (custom algorithm).
        The features that had constant counts less than or equal a threshold may be filtered out,
        to exclude highly constants and near-zero variances.
        Rules are as the following:
         - Frequency ratio: The frequency of the most prevalent value over the second most frequent value to be
           greater than a threshold;
         - Percent of unique values: The number of unique values divided by the total number of samples to be greater
           than the threshold.

        :param arr: the feature value.
        :param label: the feature name.
        :param excludes: the name of excluded features.
        :param thresh_unique_cut: the cutoff for the percentage of distinct values out of the number of total samples
        (upper limit). e.g. 10 * 100 / 100.
        :param thresh_freq_cut: the cutoff for the ratio of the most common value to the second most common value
        (lower limit). e.g. 95/5.
        :return: indicates if the feature has near-zero-variance.
        z@Find near-zero-variance (if applicable), using custom algorithm.┌
return_countsTr*┌unique┌counts┌reverseržrFN)rrrľr│rôrE┌sortedrş)rr▒rOrcrĘręr│r┤rrrZ__near_zero_varTs"2"*"zPreProcess.__near_zero_var┌features)r ┌dict_matchesrŹr$┌sectionr#c	s°|jjdât|tjâ}|dkrx|jâ|j||â|jjd|â}|dkrx|jâłS|j	â|j
|â}|jjddj|ââçfddć|jâDâ}t
|âd	kr˘łj|d
dâëłS)aŽConfirm removals and if confirmed, then re-read the selected features, then remove
        :param df: the features dataframe.
        :param dict_matches: the matched features.
        :param to_search: to search or use the saved configuration.
        :param path: the file path to the configuration file.
        :param section: the section name in the configuration file.
        :return: the updated features.
        z/Confirm removals and implement removal process.Tz:the features defined in the following file to be removed: FzThe feature removal list: ˙,cs/g|]%}|D]}|łkr|ĹqqSrr)rgrMrO)r rrrhŤs	z'PreProcess.__remove.<locals>.<listcomp>rrAr*)rrrrr┌reset┌
write_dictr┌question_overwrite┌refresh┌	read_dictrÜrmrôrH)	rr rŞrŹr$r╣┌config┌response┌labelsr)r rZ__removeys"
	


zPreProcess.__remove)r┌
__module__┌__qualname__rTr┌PandasDataFramerr3r;r┌boolrPrBrCrar_r`rrqrorprş┌CollectionsOrderedDictrírŽr░rkr»rśrrrrr4sR
#'#'!#12-8-206%r),┌__doc__┌typingrrrr┌Configs.CONSTANTSr┌ReadersWriters.PyConfigParserr┌ReadersWriters.ReadersWritersrZStats.FactoringThreadr	ZStats.TransformThreadr
rÖrU┌numpyrľ┌pandasr/┌multiprocessingrZ┌collectionsrZsklearnr┌scipy.statsr
┌	functoolsrrr┼ră┌
__author__┌
__copyright__┌__credits__┌__license__┌__version__┌__maintainer__┌	__email__┌
__status__rrrrr┌<module>s6"