Template for creating a dataset from an existing dataset using a single function¶

This example creates a dataset from the covid-19-epidemiology dataset created in the notebook that demos how to create a dataset from a single .csv file.

To access functionality from the src module throughout this notebook, use your project module, whatever you have named it.

Basic imports¶

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
# Basic utility functions
import logging
import pathlib
from functools import partial

from src.log import logger
from src.data import Dataset
from src import paths, helpers
# Basic utility functions
import logging
import pathlib
from functools import partial

from src.log import logger
from src.data import Dataset
from src import paths, helpers

In [ ]:

            
                Copied!
                
# Optionally set to debug log level
#logger.setLevel(logging.DEBUG)
# Optionally set to debug log level
#logger.setLevel(logging.DEBUG)

In [ ]:

            
                Copied!
                
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

Load existing dataset¶

In [ ]:

            
                Copied!
                
ds = Dataset.load('covid-19-epidemiology')
ds = Dataset.load('covid-19-epidemiology')

In [ ]:

            
                Copied!
                
ds.data.shape
ds.data.shape

In [ ]:

            
                Copied!
                
print(ds.README)
print(ds.README)

In [ ]:

            
                Copied!
                
print(ds.LICENSE)
print(ds.LICENSE)

Create a function that we want to transform by¶

Here let's do something extremely simple, subselect by key which reflects a geographic region.

We will use this function to create a derived dataset. As such, let's save it in the project module (src in this case) in transformer_functions.py.

In [ ]:

            
                Copied!
                
project_path = paths['project_path']
project_path = paths['project_path']

In [ ]:

            
                Copied!
                
%%writefile -a $project_path/src/data/transformer_functions.py

def subselect_by_key(df, key):
    """
    Filter dataframe by key and return resulting dataframe.
    """
    return df[df.key == key]
%%writefile -a $project_path/src/data/transformer_functions.py

def subselect_by_key(df, key):
    """
    Filter dataframe by key and return resulting dataframe.
    """
    return df[df.key == key]

In [ ]:

            
                Copied!
                
from src.data.transformer_functions import subselect_by_key
from src.data.transformer_functions import subselect_by_key

In [ ]:

            
                Copied!
                
subselect_by_key.__module__
subselect_by_key.__module__

In [ ]:

            
                Copied!
                
df = ds.data.copy()
df = ds.data.copy()

For example, CA will give us the numbers for Canada:

In [ ]:

            
                Copied!
                
key_df = subselect_by_key(df, 'CA')
key_df.shape
key_df = subselect_by_key(df, 'CA')
key_df.shape

Here are some trends:

In [ ]:

            
                Copied!
                
key_df[['date', 'new_confirmed']].plot();
key_df[['date', 'new_confirmed']].plot();

In [ ]:

            
                Copied!
                
key_df[['date', 'new_deceased']].plot();
key_df[['date', 'new_deceased']].plot();

Create a derived dataset¶

Let's create a dataset that's just the Canadian epidimelogical numbers. To do so, we only need to apply a single function to the existing data.

Here is the information we need to create a dataset using helpers.dataset_from_single_function():

source_dataset_name
dataset_name
data_function
added_readme_txt

We'll want our data_function to be defined in the project module (in this case src) for reproducibility reasons (which we've already done with subselect_by_key above).

In [ ]:

            
                Copied!
                
key = 'CA'
key = 'CA'

In [ ]:

            
                Copied!
                
source_dataset_name = 'covid-19-epidemiology'
dataset_name = f'covid-19-epidemiology-{key}'
data_function = partial(subselect_by_key, key=key)
source_dataset_name = 'covid-19-epidemiology'
dataset_name = f'covid-19-epidemiology-{key}'
data_function = partial(subselect_by_key, key=key)

In [ ]:

            
                Copied!
                
added_readme_txt = f"""The dataset {dataset_name} is the subselection \
to the {key} dataset."""
added_readme_txt = f"""The dataset {dataset_name} is the subselection \
to the {key} dataset."""

In [ ]:

            
                Copied!
                
# test out the function
data_function(df).shape
# test out the function
data_function(df).shape

Use the helper function to create the derived dataset¶

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
ds = helpers.dataset_from_single_function(
        source_dataset_name=source_dataset_name,
        dataset_name=dataset_name,
        data_function=data_function,
        added_readme_txt=added_readme_txt,
        overwrite_catalog=True)
ds = helpers.dataset_from_single_function(
        source_dataset_name=source_dataset_name,
        dataset_name=dataset_name,
        data_function=data_function,
        added_readme_txt=added_readme_txt,
        overwrite_catalog=True)

In [ ]:

            
                Copied!
                
dataset_name
dataset_name

In [ ]:

            
                Copied!
                
ds = Dataset.load(dataset_name)
ds = Dataset.load(dataset_name)

In [ ]:

            
                Copied!
                
ds.data.shape
ds.data.shape

In [ ]:

            
                Copied!
                
print(ds.README)
print(ds.README)

In [ ]:

            
                Copied!
                
print(ds.LICENSE)
print(ds.LICENSE)

In [ ]:

            
                Copied!
                
ds.data[['date', 'new_confirmed']].plot();
ds.data[['date', 'new_confirmed']].plot();

Check-in the new catalog files¶

Finally check in the new catalog files.