Easydata
  • Home
  • Opinions
  • Datasets

Example Notebooks

  • Basic Starter Notebook
  • Template for creating a dataset from a single .csv file
  • Template for creating a dataset from an existing dataset using a single function
    • Basic imports
    • Load existing dataset
    • Create a function that we want to transform by
    • Create a derived dataset
      • Use the helper function to create the derived dataset
    • Check-in the new catalog files
  • General template for create a new dataset from scratch
  • General template for creating a derived dataset (aka. adding an edge to the DatasetGraph)
Easydata
  • »
  • Example Notebooks »
  • Template for creating a dataset from an existing dataset using a single function
  • Edit on GitHub

Template for creating a dataset from an existing dataset using a single function¶

This example creates a dataset from the covid-19-epidemiology dataset created in the notebook that demos how to create a dataset from a single .csv file.

To access functionality from the src module throughout this notebook, use your project module, whatever you have named it.

Basic imports¶

In [ ]:
Copied!
# Basic utility functions
import logging
import pathlib
from functools import partial

from src.log import logger
from src.data import Dataset
from src import paths, helpers
# Basic utility functions import logging import pathlib from functools import partial from src.log import logger from src.data import Dataset from src import paths, helpers
In [ ]:
Copied!
# Optionally set to debug log level
#logger.setLevel(logging.DEBUG)
# Optionally set to debug log level #logger.setLevel(logging.DEBUG)
In [ ]:
Copied!
%load_ext autoreload
%autoreload 2
%load_ext autoreload %autoreload 2

Load existing dataset¶

In [ ]:
Copied!
ds = Dataset.load('covid-19-epidemiology')
ds = Dataset.load('covid-19-epidemiology')
In [ ]:
Copied!
ds.data.shape
ds.data.shape
In [ ]:
Copied!
print(ds.README)
print(ds.README)
In [ ]:
Copied!
print(ds.LICENSE)
print(ds.LICENSE)

Create a function that we want to transform by¶

Here let's do something extremely simple, subselect by key which reflects a geographic region.

We will use this function to create a derived dataset. As such, let's save it in the project module (src in this case) in transformer_functions.py.

In [ ]:
Copied!
project_path = paths['project_path']
project_path = paths['project_path']
In [ ]:
Copied!
%%writefile -a $project_path/src/data/transformer_functions.py

def subselect_by_key(df, key):
    """
    Filter dataframe by key and return resulting dataframe.
    """
    return df[df.key == key]
%%writefile -a $project_path/src/data/transformer_functions.py def subselect_by_key(df, key): """ Filter dataframe by key and return resulting dataframe. """ return df[df.key == key]
In [ ]:
Copied!
from src.data.transformer_functions import subselect_by_key
from src.data.transformer_functions import subselect_by_key
In [ ]:
Copied!
subselect_by_key.__module__
subselect_by_key.__module__
In [ ]:
Copied!
df = ds.data.copy()
df = ds.data.copy()

For example, CA will give us the numbers for Canada:

In [ ]:
Copied!
key_df = subselect_by_key(df, 'CA')
key_df.shape
key_df = subselect_by_key(df, 'CA') key_df.shape

Here are some trends:

In [ ]:
Copied!
key_df[['date', 'new_confirmed']].plot();
key_df[['date', 'new_confirmed']].plot();
In [ ]:
Copied!
key_df[['date', 'new_deceased']].plot();
key_df[['date', 'new_deceased']].plot();

Create a derived dataset¶

Let's create a dataset that's just the Canadian epidimelogical numbers. To do so, we only need to apply a single function to the existing data.

Here is the information we need to create a dataset using helpers.dataset_from_single_function():

source_dataset_name
dataset_name
data_function
added_readme_txt

We'll want our data_function to be defined in the project module (in this case src) for reproducibility reasons (which we've already done with subselect_by_key above).

In [ ]:
Copied!
key = 'CA'
key = 'CA'
In [ ]:
Copied!
source_dataset_name = 'covid-19-epidemiology'
dataset_name = f'covid-19-epidemiology-{key}'
data_function = partial(subselect_by_key, key=key)
source_dataset_name = 'covid-19-epidemiology' dataset_name = f'covid-19-epidemiology-{key}' data_function = partial(subselect_by_key, key=key)
In [ ]:
Copied!
added_readme_txt = f"""The dataset {dataset_name} is the subselection \
to the {key} dataset."""
added_readme_txt = f"""The dataset {dataset_name} is the subselection \ to the {key} dataset."""
In [ ]:
Copied!
# test out the function
data_function(df).shape
# test out the function data_function(df).shape

Use the helper function to create the derived dataset¶

In [ ]:
Copied!
ds = helpers.dataset_from_single_function(
        source_dataset_name=source_dataset_name,
        dataset_name=dataset_name,
        data_function=data_function,
        added_readme_txt=added_readme_txt,
        overwrite_catalog=True)
ds = helpers.dataset_from_single_function( source_dataset_name=source_dataset_name, dataset_name=dataset_name, data_function=data_function, added_readme_txt=added_readme_txt, overwrite_catalog=True)
In [ ]:
Copied!
dataset_name
dataset_name
In [ ]:
Copied!
ds = Dataset.load(dataset_name)
ds = Dataset.load(dataset_name)
In [ ]:
Copied!
ds.data.shape
ds.data.shape
In [ ]:
Copied!
print(ds.README)
print(ds.README)
In [ ]:
Copied!
print(ds.LICENSE)
print(ds.LICENSE)
In [ ]:
Copied!
ds.data[['date', 'new_confirmed']].plot();
ds.data[['date', 'new_confirmed']].plot();

Check-in the new catalog files¶

Finally check in the new catalog files.

Previous Next

CCBY 4.0

Built with MkDocs using a theme provided by Read the Docs.