Automating Data Cleaning for Nonprofits: A Python Journey
Background
ReFED is an organization that estimates the amount of surplus food, measured in tons of food, that is produced at each sector where food is grown, processed, and/or sold. These sectors are Farm, Manufacturing, Retail, Food Service, and Residential. They utilize a variety of input datasets to calculate how much ‘extra’ food is present in each sector as well as the reasons (what we refer to as ‘causes’) for that extra food, and where that food typically goes (landfill, compost, food donation, etc).
As part of a project, I was asked to automate the cleaning and processing of raw data that allows them to calculate the tons of surplus food produced at the farm that was grown but never harvested as well as a breakdown of reasons (causes) for why that food was not harvested.
The sample dataset provided included the reasons food might not have been harvested and the rate for each reason. ReFED uses these breakdowns to help food producers identify interventions that can minimize the production of extra food whenever possible.
Project Requirements
The final submission should be a zipped file directory that includes:
Scripts, jupyter notebooks, and/or other tools or files necessary to load, process, and analyze data
A readme file that describes your approach to the problem, a description of your solution’s
architecture, and instructions on how to run the data pipeline and replicate the analysis
A version(s) of the raw data
A version(s) of the ‘clean’ raw data (pre-calculations)
A version(s) of the final or ‘production’ data (aka a dataset that contains row-level raw data plus all of the calculated fields, such as tons_never_harvested and the calculated intermediary features, and tons_never_harvested by cause
At least two visualizations
Work should predominantly be in Python, although I was welcome to use SQL for storing data, should you so desire (although not necessary)
The file directory should be a git repository and should include at least two ‘versions’ of the input and output datasets (you will be loading some data, and then updating that data, and we need to be
able to track how that data and/or code has changed)
The code should be fully executable and your analysis should be replicable (i.e. another person should be able to fork this repo if it was on Github and run it).
Code should include inline comments where appropriate
Simply Python
To accommodate these requirements I created a simple Python data processing pipeline. The pipeline allowed users to perform a selection of predefined data cleaning and processing steps. Those steps included, but are not limited to:
Load a CSV file from a user-specified directory and remove the index
Create and validate the schema of the dataset
Generate a hash ID that uniquely identifies each row of data allowing for the removal of duplicates (and ease of storing in a DB)
Users can elect to run the script by passing --merge True, which will concatenate all files processed on each run.
Architecture
A basic Python class object holds a selection of methods that can be used to clean and process input data. The methods are then chained together using Python's built-in .pipe functionality.
pipeline = CleaningPipeline()
cleaned_data = (
pipeline.open_csv()
.pipe(pipeline.validate_farm_data_schema)
.pipe(pipeline.lowercase_str_cols)
.pipe(pipeline.drop_nan_rows_in_required_cols)
.pipe(pipeline.make_hash_id, HASH_COL_LS)
.pipe(pipeline.remove_dups)
)
The project, which can be viewed in full on GitHub here, demonstrates a minimal approach to solving a common data problem faced by many nonprofit organizations with small teams. Although more robust solutions could be applied, its often the simple, straightforward approach that proves flexible and easy for small teams to build upon.