Modeling
Modeling is largely static code and suitable for most use cases.
The primary way to use this code is through a YAML configuration file to run model experiments.
You can optionally create additional model types in model.py, provided they output a list of features for further reporting.
See the model.run_model_training for a list of steps undertaken in processing.
ndj_pipeline.model
Contains custom ML model functions and pipeline for running modeling.
- ndj_pipeline.model.baseline(train, test, features, config)
Compares Actual results to a naive baseline.
This will compare “Actual” results to a pre-calculated baseline column, assumed to be prepared earlier in transform.py.
It is always worthwhile to compare ML results to simple models such as an average, or group-by average. This frames any model results as meaningful improvements over a simple rule.
Creates the results DataFrame based on test data, with “Actual” and “Predicted” fields. Calls metric creation.
- Parameters
train (
DataFrame
) – training dataframe containing config specified target and numeric features from featurestest (
DataFrame
) – training dataframe containing config specified target and numeric features from featuresfeatures (
List
[str
]) – List of columns to use in model trainingconfig (
Dict
[str
,Any
]) – Loaded model experiment config, for model parameters
- Return type
List
[str
]- Returns
List of strings indicating important features to use for further reporting
- ndj_pipeline.model.gbr(train, test, features, config)
Train a Gradient Boosted Regression.
Trains using specified train dataframe and list of simple and dummy features.
Creates the results DataFrame based on test data, with “Actual” and “Predicted” fields. Calls metric creation.
- Parameters
train (
DataFrame
) – training dataframe containing config specified target and numeric features from featurestest (
DataFrame
) – training dataframe containing config specified target and numeric features from featuresfeatures (
List
[str
]) – List of columns to use in model trainingconfig (
Dict
[str
,Any
]) – Loaded model experiment config, for model parameters
- Return type
List
[str
]- Returns
List of strings indicating important features to use for further reporting
- ndj_pipeline.model.main()
Main command line entry to model training.
Can be run from command line using… python -m ndj_pipeline.model -p {path_to_experiment.yaml}
- Return type
None
- ndj_pipeline.model.ols(train, test, features, config)
Train a Ordinary Least Squares Regression.
Trains using specified train dataframe and list of simple and dummy features.
Creates the results DataFrame based on test data, with “Actual” and “Predicted” fields. Calls metric creation.
- Parameters
train (
DataFrame
) – training dataframe containing config specified target and numeric features from featurestest (
DataFrame
) – training dataframe containing config specified target and numeric features from featuresfeatures (
List
[str
]) – List of columns to use in model trainingconfig (
Dict
[str
,Any
]) – Loaded model experiment config, for model parameters
- Return type
List
[str
]- Returns
List of strings indicating important features to use for further reporting
- ndj_pipeline.model.run_model_training(model_config)
Run all modeling transformations.
Includes the following steps:
Create model and config folder
Loads data and sets index
Create config specified dummy features
Filters rows according to config
Splits data into train/test
Filters target variable in train data
Prepares missing data replacement
Optionally saves data
Trains model according to model specifications
Produce metrics and plots
- Parameters
model_config (
Dict
[str
,Any
]) – Loaded model experiment config- Return type
None
ndj_pipeline.prep
Set of operations to transform pandas DataFrames given user specified config.
These operations are ordered according to ndj_pipeline.model.run_model_training.
- ndj_pipeline.prep.apply_feature_aggregates(df, aggregates)
Applies feature aggregates to a DataFrame’s missing values.
Performs validations and raises errors as part of process.
- Parameters
df (
DataFrame
) – Pandas dataframe. Must include same columns as aggregates.aggregates (
Series
) – Pandas series with labels and aggregate values.
- Return type
DataFrame
- Returns
Pandas Dataframe with missing data replaced.
- Raises
ValueError – If unable to apply aggregation value to a column. This is commonly due to the datatypes not working i.e. float into Int64.
- ndj_pipeline.prep.apply_filtering(df, model_config)
Filters dataframe according to config specified labels.
Any row containing a specified label is filtered from the data.
- Parameters
df (
DataFrame
) – Pandas dataframe, must contain _filter column with string type.model_config (
Dict
[str
,Any
]) – Loaded model experiment config, specifically for list of filter labels.
- Return type
DataFrame
- Returns
Pandas dataframe with optionally assigned index
- Raises
ValueError – Expects ‘_filter’ column in processed data.
- ndj_pipeline.prep.collate_features(model_config, dummy_features)
Saves and returns final list of simple and dummy features.
- Return type
List
[str
]
- ndj_pipeline.prep.create_compressed_dummies(df, dummy, min_dummy)
Creates enhanced feature dummies for a single dataframe column.
Improves on standard pandas.get_dummies by combining low-incidence dummy columns into a single _other_combined column. Dummy columns are named according to {col_name}_##_{value}
- Parameters
df (
DataFrame
) – Pandas dataframe, must contain specified dummy column.dummy (
str
) – string label of DataFrame to create dummy featuresmin_dummy (
float
) – minimum percentage incidence to create standalone dummy feature, otherwise group into _other_combined.
- Return type
Tuple
[DataFrame
,List
[str
]]- Returns
Dummified Pandas DataFrame for a single feature. Also returns dummy column names as a list of strings.
- ndj_pipeline.prep.create_dummy_features(df, model_config)
Create dummy features for each config specified dummy_variable.
Iterates through all specified dummy features and adds to DataFrame.
- Parameters
df (
DataFrame
) – Pandas dataframe, must contain specified dummy columns.model_config (
Dict
[str
,Any
]) – Loaded model experiment config, specifically for list of dummy features.
- Return type
Tuple
[DataFrame
,List
[str
]]- Returns
Pandas DataFrame with original data plus all new dummy fields. Also returns full list of created dummy column names
- ndj_pipeline.prep.filter_target(df, model_config)
Filters Dataframe to ensure no missing data in target variable.
- Parameters
df (
DataFrame
) – Pandas dataframe, must contain config specified target colunn.model_config (
Dict
[str
,Any
]) – Loaded model experiment config, specifically for target column name.
- Return type
DataFrame
- Returns
Filtered Pandas DataFrame.
- ndj_pipeline.prep.get_simple_feature_aggregates(df, model_config)
Generates config specified feature aggregates.
These are used to inform missing data replacement strategy. Ideally this is run on training data, and used to replace train and test data.
Performs validations and raises errors as part of process.
- Parameters
df (
DataFrame
) – Pandas dataframe. Must include columns specified in the simple_features section of config, and these must be numeric type columns with no infinite values.model_config (
Dict
[str
,Any
]) – Loaded model experiment config, specifically for simple_features dictionary of column names and aggregation strategy.
- Returns
value of aggregation.
- Return type
Pandas Series with specified feature columns
- Raises
ValueError – If any features contain infinate values that need fixing.
- ndj_pipeline.prep.load_data_and_key(model_config)
Uses config to load data and assign key.
- Parameters
model_config (
Dict
[str
,Any
]) – Loaded model experiment config, specifically for data path and index column(s)- Return type
DataFrame
- Returns
Pandas dataframe with optionally assigned index
- ndj_pipeline.prep.save_data(train, test, model_config)
Saves train, test and combined datasets to the model folder as parquet.
- Return type
None
- ndj_pipeline.prep.split(df, model_config)
Create train test split using model config.
Config may specify a pre-calculated column present in the DataFrame, or use sklearn style split params. No config results in no split, with the creation of an empty test dataframe.
- Parameters
df (
DataFrame
) – Pandas dataframe, must contain a pre-calculated split column if this is specified in the model_config.model_config (
Dict
[str
,Any
]) – Loaded model experiment config, specifically for split approach, either a field or sklearn style params.
- Return type
Tuple
[DataFrame
,DataFrame
]- Returns
Two Pandas DataFrames intended for training, test sets.