Data preparation

The following provides a reference guide to the data preparation steps.

It includes two python modules; data checks and transformations.

It is useful to separate these concerns, as once you have confidence in the schema, data typing, and consistency of data, you can confidently prepare feature engineering.

In addition, having feature engineering separate from the modeling step enhances the ability to create config generated model experimentation.

While there is some boilerplate code to help with logging and command line interface, functions created here are usually completely bespoke to the needs of the data analysis. The examples show the types of information to check, and required columns for config.

ndj_pipeline.data_checks

Schemas and assertions for raw data inputs.

Schema checks are useful to standardize certain information in raw data including:

  • Clean column names (lowercase, underscore)

  • Ensure an understanding of primary keys of data (uniqueness)

  • Set data types of columns, i.e. boolean->int, datetime, string, nullable integer

  • Pandera schema checks to attempt to lock some checks about a datafile. This helps to quickly assess if a file has changed, or a similar file is same or different. Some examples:

    • Contains exact list of columns (no more no less)

    • Nullable column

    • Data type correct

    • Data range checks (must not be 0 or less, must contain only these values)

Schemas can often be re-applied to similar data files, i.e. tabs of an excel, or train/test data.

ndj_pipeline.data_checks.check_titanic()

Data schema and typing validations.

Return type

DataFrame

Returns

Loaded pandas dataframe with typing and schema checks.

ndj_pipeline.transform

Data transformations (ETL) from raw (checked) files into single, feature rich dataframe.

Can be run from command line, or as imported functions in other python scripts and jupyter notebooks.

ndj_pipeline.transform.create_titanic_features(df)

Feature engineering, including _filter and split columns.

Fully custom pandas code for titanic data feature engineering.

Parameters

df (DataFrame) – Pre-validated and checked Pandas DataFrame.

Return type

None

ndj_pipeline.transform.main()

Run transformations from command line using…

python -m ndj_pipeline.transform

Return type

None

ndj_pipeline.transform.run()

Perform all data transformation steps.

Return type

None