Data preparation
The following provides a reference guide to the data preparation steps.
It includes two python modules; data checks and transformations.
It is useful to separate these concerns, as once you have confidence in the schema, data typing, and consistency of data, you can confidently prepare feature engineering.
In addition, having feature engineering separate from the modeling step enhances the ability to create config generated model experimentation.
While there is some boilerplate code to help with logging and command line interface, functions created here are usually completely bespoke to the needs of the data analysis. The examples show the types of information to check, and required columns for config.
ndj_pipeline.data_checks
Schemas and assertions for raw data inputs.
Schema checks are useful to standardize certain information in raw data including:
Clean column names (lowercase, underscore)
Ensure an understanding of primary keys of data (uniqueness)
Set data types of columns, i.e. boolean->int, datetime, string, nullable integer
Pandera schema checks to attempt to lock some checks about a datafile. This helps to quickly assess if a file has changed, or a similar file is same or different. Some examples:
Contains exact list of columns (no more no less)
Nullable column
Data type correct
Data range checks (must not be 0 or less, must contain only these values)
Schemas can often be re-applied to similar data files, i.e. tabs of an excel, or train/test data.
- ndj_pipeline.data_checks.check_titanic()
Data schema and typing validations.
- Return type
DataFrame
- Returns
Loaded pandas dataframe with typing and schema checks.
ndj_pipeline.transform
Data transformations (ETL) from raw (checked) files into single, feature rich dataframe.
Can be run from command line, or as imported functions in other python scripts and jupyter notebooks.
- ndj_pipeline.transform.create_titanic_features(df)
Feature engineering, including _filter and split columns.
Fully custom pandas code for titanic data feature engineering.
- Parameters
df (
DataFrame
) – Pre-validated and checked Pandas DataFrame.- Return type
None
- ndj_pipeline.transform.main()
Run transformations from command line using…
python -m ndj_pipeline.transform
- Return type
None
- ndj_pipeline.transform.run()
Perform all data transformation steps.
- Return type
None