Modeling

ndj_pipeline.model
ndj_pipeline.prep

Modeling is largely static code and suitable for most use cases.

The primary way to use this code is through a YAML configuration file to run model experiments.

You can optionally create additional model types in model.py, provided they output a list of features for further reporting.

See the model.run_model_training for a list of steps undertaken in processing.

ndj_pipeline.model

Contains custom ML model functions and pipeline for running modeling.

ndj_pipeline.model.baseline(train, test, features, config)

Compares Actual results to a naive baseline.

This will compare “Actual” results to a pre-calculated baseline column, assumed to be prepared earlier in transform.py.

It is always worthwhile to compare ML results to simple models such as an average, or group-by average. This frames any model results as meaningful improvements over a simple rule.

Creates the results DataFrame based on test data, with “Actual” and “Predicted” fields. Calls metric creation.

Parameters

train (DataFrame) – training dataframe containing config specified target and numeric features from features
test (DataFrame) – training dataframe containing config specified target and numeric features from features
features (List[str]) – List of columns to use in model training
config (Dict[str, Any]) – Loaded model experiment config, for model parameters

Return type

List[str]

Returns

List of strings indicating important features to use for further reporting

ndj_pipeline.model.gbr(train, test, features, config)

Train a Gradient Boosted Regression.

Trains using specified train dataframe and list of simple and dummy features.

Creates the results DataFrame based on test data, with “Actual” and “Predicted” fields. Calls metric creation.

Parameters

train (DataFrame) – training dataframe containing config specified target and numeric features from features
test (DataFrame) – training dataframe containing config specified target and numeric features from features
features (List[str]) – List of columns to use in model training
config (Dict[str, Any]) – Loaded model experiment config, for model parameters

Return type

List[str]

Returns

List of strings indicating important features to use for further reporting

ndj_pipeline.model.main()

Main command line entry to model training.

Can be run from command line using… python -m ndj_pipeline.model -p {path_to_experiment.yaml}

Return type: None

ndj_pipeline.model.ols(train, test, features, config)

Train a Ordinary Least Squares Regression.

Trains using specified train dataframe and list of simple and dummy features.

Creates the results DataFrame based on test data, with “Actual” and “Predicted” fields. Calls metric creation.

Parameters

train (DataFrame) – training dataframe containing config specified target and numeric features from features
test (DataFrame) – training dataframe containing config specified target and numeric features from features
features (List[str]) – List of columns to use in model training
config (Dict[str, Any]) – Loaded model experiment config, for model parameters

Return type

List[str]

Returns

List of strings indicating important features to use for further reporting

ndj_pipeline.model.run_model_training(model_config)

Run all modeling transformations.

Includes the following steps:

Create model and config folder
Loads data and sets index
Create config specified dummy features
Filters rows according to config
Splits data into train/test
Filters target variable in train data
Prepares missing data replacement
Optionally saves data
Trains model according to model specifications
Produce metrics and plots

Parameters: model_config (Dict[str, Any]) – Loaded model experiment config
Return type: None

ndj_pipeline.prep

Set of operations to transform pandas DataFrames given user specified config.

These operations are ordered according to ndj_pipeline.model.run_model_training.

ndj_pipeline.prep.apply_feature_aggregates(df, aggregates)

Applies feature aggregates to a DataFrame’s missing values.

Performs validations and raises errors as part of process.

Parameters

df (DataFrame) – Pandas dataframe. Must include same columns as aggregates.
aggregates (Series) – Pandas series with labels and aggregate values.

Return type

DataFrame

Returns

Pandas Dataframe with missing data replaced.

Raises

ValueError – If unable to apply aggregation value to a column. This is commonly due to the datatypes not working i.e. float into Int64.

ndj_pipeline.prep.apply_filtering(df, model_config)

Filters dataframe according to config specified labels.

Any row containing a specified label is filtered from the data.

Parameters

df (DataFrame) – Pandas dataframe, must contain _filter column with string type.
model_config (Dict[str, Any]) – Loaded model experiment config, specifically for list of filter labels.

Return type

DataFrame

Returns

Pandas dataframe with optionally assigned index

Raises

ValueError – Expects ‘_filter’ column in processed data.

ndj_pipeline.prep.collate_features(model_config, dummy_features)

Saves and returns final list of simple and dummy features.

Return type: List[str]

ndj_pipeline.prep.create_compressed_dummies(df, dummy, min_dummy)

Creates enhanced feature dummies for a single dataframe column.

Improves on standard pandas.get_dummies by combining low-incidence dummy columns into a single _other_combined column. Dummy columns are named according to {col_name}_##_{value}

Parameters

df (DataFrame) – Pandas dataframe, must contain specified dummy column.
dummy (str) – string label of DataFrame to create dummy features
min_dummy (float) – minimum percentage incidence to create standalone dummy feature, otherwise group into _other_combined.

Return type

Tuple[DataFrame, List[str]]

Returns

Dummified Pandas DataFrame for a single feature. Also returns dummy column names as a list of strings.

ndj_pipeline.prep.create_dummy_features(df, model_config)

Create dummy features for each config specified dummy_variable.

Iterates through all specified dummy features and adds to DataFrame.

Parameters

df (DataFrame) – Pandas dataframe, must contain specified dummy columns.
model_config (Dict[str, Any]) – Loaded model experiment config, specifically for list of dummy features.

Return type

Tuple[DataFrame, List[str]]

Returns

Pandas DataFrame with original data plus all new dummy fields. Also returns full list of created dummy column names

ndj_pipeline.prep.filter_target(df, model_config)

Filters Dataframe to ensure no missing data in target variable.

Parameters

df (DataFrame) – Pandas dataframe, must contain config specified target colunn.
model_config (Dict[str, Any]) – Loaded model experiment config, specifically for target column name.

Return type

DataFrame

Returns

Filtered Pandas DataFrame.

ndj_pipeline.prep.get_simple_feature_aggregates(df, model_config)

Generates config specified feature aggregates.

These are used to inform missing data replacement strategy. Ideally this is run on training data, and used to replace train and test data.

Performs validations and raises errors as part of process.

Parameters

df (DataFrame) – Pandas dataframe. Must include columns specified in the simple_features section of config, and these must be numeric type columns with no infinite values.
model_config (Dict[str, Any]) – Loaded model experiment config, specifically for simple_features dictionary of column names and aggregation strategy.

Returns

value of aggregation.

Return type

Pandas Series with specified feature columns

Raises

ValueError – If any features contain infinate values that need fixing.

ndj_pipeline.prep.load_data_and_key(model_config)

Uses config to load data and assign key.

Parameters: model_config (Dict[str, Any]) – Loaded model experiment config, specifically for data path and index column(s)
Return type: DataFrame
Returns: Pandas dataframe with optionally assigned index

ndj_pipeline.prep.save_data(train, test, model_config)

Saves train, test and combined datasets to the model folder as parquet.

Return type: None

ndj_pipeline.prep.split(df, model_config)

Create train test split using model config.

Config may specify a pre-calculated column present in the DataFrame, or use sklearn style split params. No config results in no split, with the creation of an empty test dataframe.

Parameters

df (DataFrame) – Pandas dataframe, must contain a pre-calculated split column if this is specified in the model_config.
model_config (Dict[str, Any]) – Loaded model experiment config, specifically for split approach, either a field or sklearn style params.

Return type

Tuple[DataFrame, DataFrame]

Returns

Two Pandas DataFrames intended for training, test sets.