Template for creating a dataset from an existing dataset using a single function¶
This example creates a dataset from the covid-19-epidemiology dataset created in the notebook that demos how to create a dataset from a single .csv file.
To access functionality from the src module throughout this notebook, use your project module, whatever you have named it.
Basic imports¶
# Basic utility functions
import logging
import pathlib
from functools import partial
from src.log import logger
from src.data import Dataset
from src import paths, helpers
# Optionally set to debug log level
#logger.setLevel(logging.DEBUG)
%load_ext autoreload
%autoreload 2
Load existing dataset¶
ds = Dataset.load('covid-19-epidemiology')
ds.data.shape
print(ds.README)
print(ds.LICENSE)
Create a function that we want to transform by¶
Here let's do something extremely simple, subselect by key which reflects a geographic region.
We will use this function to create a derived dataset. As such, let's save it in the project module (src in this case) in transformer_functions.py.
project_path = paths['project_path']
%%writefile -a $project_path/src/data/transformer_functions.py
def subselect_by_key(df, key):
"""
Filter dataframe by key and return resulting dataframe.
"""
return df[df.key == key]
from src.data.transformer_functions import subselect_by_key
subselect_by_key.__module__
df = ds.data.copy()
For example, CA will give us the numbers for Canada:
key_df = subselect_by_key(df, 'CA')
key_df.shape
Here are some trends:
key_df[['date', 'new_confirmed']].plot();
key_df[['date', 'new_deceased']].plot();
Create a derived dataset¶
Let's create a dataset that's just the Canadian epidimelogical numbers. To do so, we only need to apply a single function to the existing data.
Here is the information we need to create a dataset using helpers.dataset_from_single_function():
source_dataset_name
dataset_name
data_function
added_readme_txt
We'll want our data_function to be defined in the project module (in this case src) for reproducibility reasons (which we've already done with subselect_by_key above).
key = 'CA'
source_dataset_name = 'covid-19-epidemiology'
dataset_name = f'covid-19-epidemiology-{key}'
data_function = partial(subselect_by_key, key=key)
added_readme_txt = f"""The dataset {dataset_name} is the subselection \
to the {key} dataset."""
# test out the function
data_function(df).shape
Use the helper function to create the derived dataset¶
ds = helpers.dataset_from_single_function(
source_dataset_name=source_dataset_name,
dataset_name=dataset_name,
data_function=data_function,
added_readme_txt=added_readme_txt,
overwrite_catalog=True)
dataset_name
ds = Dataset.load(dataset_name)
ds.data.shape
print(ds.README)
print(ds.LICENSE)
ds.data[['date', 'new_confirmed']].plot();
Check-in the new catalog files¶
Finally check in the new catalog files.