Easydata
  • Home
  • Opinions
  • Datasets

Example Notebooks

  • Basic Starter Notebook
  • Template for creating a dataset from a single .csv file
  • Template for creating a dataset from an existing dataset using a single function
  • General template for create a new dataset from scratch
  • General template for creating a derived dataset (aka. adding an edge to the DatasetGraph)
    • Basic imports
    • Source dataset
    • Create and add your transfomer function
    • Create the new edge in the transformer graph
    • Check-in the new dataset
Easydata
  • »
  • Example Notebooks »
  • General template for creating a derived dataset (aka. adding an edge to the DatasetGraph)
  • Edit on GitHub

General template for creating a derived dataset (aka. adding an edge to the DatasetGraph)¶

This example creates the dataset from in the Add-csv-template.ipynb example, but does it completely generally without using the functions in helpers and builds on the New-Dataset-Template.ipynb example. Any derived dataset can be added in this way as an edge in the DatasetGraph.

Basic imports¶

In [ ]:
Copied!
%load_ext autoreload
%autoreload 2
%load_ext autoreload %autoreload 2
In [ ]:
Copied!
# Basic utility functions
import logging
import os
import pathlib
from src.log import logger
from src import paths
from src.utils import list_dir
from functools import partial

# data functions
from src.data import DataSource, Dataset, DatasetGraph
from src import helpers
# Basic utility functions import logging import os import pathlib from src.log import logger from src import paths from src.utils import list_dir from functools import partial # data functions from src.data import DataSource, Dataset, DatasetGraph from src import helpers
In [ ]:
Copied!
# Optionally set to debug log level
logger.setLevel(logging.DEBUG)
# Optionally set to debug log level logger.setLevel(logging.DEBUG)

Source dataset¶

In [ ]:
Copied!
source_ds_name = 'covid-19-epidemiology-raw'
source_ds_name = 'covid-19-epidemiology-raw'
In [ ]:
Copied!
source_ds = Dataset.load(source_ds_name)
source_ds = Dataset.load(source_ds_name)
In [ ]:
Copied!
source_ds.FILESET
source_ds.FILESET

Create and add your transfomer function¶

Here we'll use a pre-built transformer function csv_to_pandas, but normally you would place your new transformer function in {your_project_module}/data/transformer_functions.py as in the Add-Derived-Dataset.ipynb example.

Transformer functions take a dict of Datasets of the form {ds_name: ds} as input and outputs a new dict of Datasets of the same form.

In [ ]:
Copied!
from src.data.transformer_functions import csv_to_pandas
from src.data import serialize_transformer_pipeline
from src.data.transformer_functions import csv_to_pandas from src.data import serialize_transformer_pipeline
In [ ]:
Copied!
## Fill this in for your dataset
ds_name = 'covid-19-epidemiology'
transformers = [partial(csv_to_pandas,
                        output_map={ds_name:'epidemiology.csv'})]
## Fill this in for your dataset ds_name = 'covid-19-epidemiology' transformers = [partial(csv_to_pandas, output_map={ds_name:'epidemiology.csv'})]

Create the new edge in the transformer graph¶

In [ ]:
Copied!
dag = DatasetGraph(catalog_path=paths['catalog_path'])
dag = DatasetGraph(catalog_path=paths['catalog_path'])
In [ ]:
Copied!
dag.add_edge(input_dataset=source_ds_name,
             output_dataset=ds_name,
             transformer_pipeline=serialize_transformer_pipeline(transformers),
             overwrite_catalog=True)
dag.add_edge(input_dataset=source_ds_name, output_dataset=ds_name, transformer_pipeline=serialize_transformer_pipeline(transformers), overwrite_catalog=True)
In [ ]:
Copied!
%%time
ds = Dataset.from_catalog(ds_name)
%%time ds = Dataset.from_catalog(ds_name)
In [ ]:
Copied!
%%time
ds = Dataset.load(ds_name)
%%time ds = Dataset.load(ds_name)
In [ ]:
Copied!
print(ds.README)
print(ds.README)
In [ ]:
Copied!
print(ds.LICENSE)
print(ds.LICENSE)
In [ ]:
Copied!
ds.data.shape
ds.data.shape
In [ ]:
Copied!
ds.data.head()
ds.data.head()

Check-in the new dataset¶

Finally, check in the new catalog files.

Previous

CCBY 4.0

Built with MkDocs using a theme provided by Read the Docs.