Modeling

Modeling is largely static code and suitable for most use cases.

The primary way to use this code is through a YAML configuration file to run model experiments.

You can optionally create additional model types in model.py, provided they output a list of features for further reporting.

See the model.run_model_training for a list of steps undertaken in processing.

ndj_pipeline.model

Contains custom ML model functions and pipeline for running modeling.

ndj_pipeline.model.baseline(train, test, features, config)

Compares Actual results to a naive baseline.

This will compare “Actual” results to a pre-calculated baseline column, assumed to be prepared earlier in transform.py.

It is always worthwhile to compare ML results to simple models such as an average, or group-by average. This frames any model results as meaningful improvements over a simple rule.

Creates the results DataFrame based on test data, with “Actual” and “Predicted” fields. Calls metric creation.

Parameters
  • train (DataFrame) – training dataframe containing config specified target and numeric features from features

  • test (DataFrame) – training dataframe containing config specified target and numeric features from features

  • features (List[str]) – List of columns to use in model training

  • config (Dict[str, Any]) – Loaded model experiment config, for model parameters

Return type

List[str]

Returns

List of strings indicating important features to use for further reporting

ndj_pipeline.model.gbr(train, test, features, config)

Train a Gradient Boosted Regression.

Trains using specified train dataframe and list of simple and dummy features.

Creates the results DataFrame based on test data, with “Actual” and “Predicted” fields. Calls metric creation.

Parameters
  • train (DataFrame) – training dataframe containing config specified target and numeric features from features

  • test (DataFrame) – training dataframe containing config specified target and numeric features from features

  • features (List[str]) – List of columns to use in model training

  • config (Dict[str, Any]) – Loaded model experiment config, for model parameters

Return type

List[str]

Returns

List of strings indicating important features to use for further reporting

ndj_pipeline.model.main()

Main command line entry to model training.

Can be run from command line using… python -m ndj_pipeline.model -p {path_to_experiment.yaml}

Return type

None

ndj_pipeline.model.ols(train, test, features, config)

Train a Ordinary Least Squares Regression.

Trains using specified train dataframe and list of simple and dummy features.

Creates the results DataFrame based on test data, with “Actual” and “Predicted” fields. Calls metric creation.

Parameters
  • train (DataFrame) – training dataframe containing config specified target and numeric features from features

  • test (DataFrame) – training dataframe containing config specified target and numeric features from features

  • features (List[str]) – List of columns to use in model training

  • config (Dict[str, Any]) – Loaded model experiment config, for model parameters

Return type

List[str]

Returns

List of strings indicating important features to use for further reporting

ndj_pipeline.model.run_model_training(model_config)

Run all modeling transformations.

Includes the following steps:

  • Create model and config folder

  • Loads data and sets index

  • Create config specified dummy features

  • Filters rows according to config

  • Splits data into train/test

  • Filters target variable in train data

  • Prepares missing data replacement

  • Optionally saves data

  • Trains model according to model specifications

  • Produce metrics and plots

Parameters

model_config (Dict[str, Any]) – Loaded model experiment config

Return type

None

ndj_pipeline.prep

Set of operations to transform pandas DataFrames given user specified config.

These operations are ordered according to ndj_pipeline.model.run_model_training.

ndj_pipeline.prep.apply_feature_aggregates(df, aggregates)

Applies feature aggregates to a DataFrame’s missing values.

Performs validations and raises errors as part of process.

Parameters
  • df (DataFrame) – Pandas dataframe. Must include same columns as aggregates.

  • aggregates (Series) – Pandas series with labels and aggregate values.

Return type

DataFrame

Returns

Pandas Dataframe with missing data replaced.

Raises

ValueError – If unable to apply aggregation value to a column. This is commonly due to the datatypes not working i.e. float into Int64.

ndj_pipeline.prep.apply_filtering(df, model_config)

Filters dataframe according to config specified labels.

Any row containing a specified label is filtered from the data.

Parameters
  • df (DataFrame) – Pandas dataframe, must contain _filter column with string type.

  • model_config (Dict[str, Any]) – Loaded model experiment config, specifically for list of filter labels.

Return type

DataFrame

Returns

Pandas dataframe with optionally assigned index

Raises

ValueError – Expects ‘_filter’ column in processed data.

ndj_pipeline.prep.collate_features(model_config, dummy_features)

Saves and returns final list of simple and dummy features.

Return type

List[str]

ndj_pipeline.prep.create_compressed_dummies(df, dummy, min_dummy)

Creates enhanced feature dummies for a single dataframe column.

Improves on standard pandas.get_dummies by combining low-incidence dummy columns into a single _other_combined column. Dummy columns are named according to {col_name}_##_{value}

Parameters
  • df (DataFrame) – Pandas dataframe, must contain specified dummy column.

  • dummy (str) – string label of DataFrame to create dummy features

  • min_dummy (float) – minimum percentage incidence to create standalone dummy feature, otherwise group into _other_combined.

Return type

Tuple[DataFrame, List[str]]

Returns

Dummified Pandas DataFrame for a single feature. Also returns dummy column names as a list of strings.

ndj_pipeline.prep.create_dummy_features(df, model_config)

Create dummy features for each config specified dummy_variable.

Iterates through all specified dummy features and adds to DataFrame.

Parameters
  • df (DataFrame) – Pandas dataframe, must contain specified dummy columns.

  • model_config (Dict[str, Any]) – Loaded model experiment config, specifically for list of dummy features.

Return type

Tuple[DataFrame, List[str]]

Returns

Pandas DataFrame with original data plus all new dummy fields. Also returns full list of created dummy column names

ndj_pipeline.prep.filter_target(df, model_config)

Filters Dataframe to ensure no missing data in target variable.

Parameters
  • df (DataFrame) – Pandas dataframe, must contain config specified target colunn.

  • model_config (Dict[str, Any]) – Loaded model experiment config, specifically for target column name.

Return type

DataFrame

Returns

Filtered Pandas DataFrame.

ndj_pipeline.prep.get_simple_feature_aggregates(df, model_config)

Generates config specified feature aggregates.

These are used to inform missing data replacement strategy. Ideally this is run on training data, and used to replace train and test data.

Performs validations and raises errors as part of process.

Parameters
  • df (DataFrame) – Pandas dataframe. Must include columns specified in the simple_features section of config, and these must be numeric type columns with no infinite values.

  • model_config (Dict[str, Any]) – Loaded model experiment config, specifically for simple_features dictionary of column names and aggregation strategy.

Returns

value of aggregation.

Return type

Pandas Series with specified feature columns

Raises

ValueError – If any features contain infinate values that need fixing.

ndj_pipeline.prep.load_data_and_key(model_config)

Uses config to load data and assign key.

Parameters

model_config (Dict[str, Any]) – Loaded model experiment config, specifically for data path and index column(s)

Return type

DataFrame

Returns

Pandas dataframe with optionally assigned index

ndj_pipeline.prep.save_data(train, test, model_config)

Saves train, test and combined datasets to the model folder as parquet.

Return type

None

ndj_pipeline.prep.split(df, model_config)

Create train test split using model config.

Config may specify a pre-calculated column present in the DataFrame, or use sklearn style split params. No config results in no split, with the creation of an empty test dataframe.

Parameters
  • df (DataFrame) – Pandas dataframe, must contain a pre-calculated split column if this is specified in the model_config.

  • model_config (Dict[str, Any]) – Loaded model experiment config, specifically for split approach, either a field or sklearn style params.

Return type

Tuple[DataFrame, DataFrame]

Returns

Two Pandas DataFrames intended for training, test sets.