General template for creating a derived dataset (aka. adding an edge to the DatasetGraph)¶
This example creates the dataset from in the Add-csv-template.ipynb example, but does it completely generally without using the functions in helpers and builds on the New-Dataset-Template.ipynb example. Any derived dataset can be added in this way as an edge in the DatasetGraph.
Basic imports¶
In [ ]:
Copied!
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2
In [ ]:
Copied!
# Basic utility functions
import logging
import os
import pathlib
from src.log import logger
from src import paths
from src.utils import list_dir
from functools import partial
# data functions
from src.data import DataSource, Dataset, DatasetGraph
from src import helpers
# Basic utility functions
import logging
import os
import pathlib
from src.log import logger
from src import paths
from src.utils import list_dir
from functools import partial
# data functions
from src.data import DataSource, Dataset, DatasetGraph
from src import helpers
In [ ]:
Copied!
# Optionally set to debug log level
logger.setLevel(logging.DEBUG)
# Optionally set to debug log level
logger.setLevel(logging.DEBUG)
Source dataset¶
In [ ]:
Copied!
source_ds_name = 'covid-19-epidemiology-raw'
source_ds_name = 'covid-19-epidemiology-raw'
In [ ]:
Copied!
source_ds = Dataset.load(source_ds_name)
source_ds = Dataset.load(source_ds_name)
In [ ]:
Copied!
source_ds.FILESET
source_ds.FILESET
Create and add your transfomer function¶
Here we'll use a pre-built transformer function csv_to_pandas, but normally you would place your new transformer function in {your_project_module}/data/transformer_functions.py as in the Add-Derived-Dataset.ipynb example.
Transformer functions take a dict of Datasets of the form {ds_name: ds} as input and outputs a new dict of Datasets of the same form.
In [ ]:
Copied!
from src.data.transformer_functions import csv_to_pandas
from src.data import serialize_transformer_pipeline
from src.data.transformer_functions import csv_to_pandas
from src.data import serialize_transformer_pipeline
In [ ]:
Copied!
## Fill this in for your dataset
ds_name = 'covid-19-epidemiology'
transformers = [partial(csv_to_pandas,
output_map={ds_name:'epidemiology.csv'})]
## Fill this in for your dataset
ds_name = 'covid-19-epidemiology'
transformers = [partial(csv_to_pandas,
output_map={ds_name:'epidemiology.csv'})]
Create the new edge in the transformer graph¶
In [ ]:
Copied!
dag = DatasetGraph(catalog_path=paths['catalog_path'])
dag = DatasetGraph(catalog_path=paths['catalog_path'])
In [ ]:
Copied!
dag.add_edge(input_dataset=source_ds_name,
output_dataset=ds_name,
transformer_pipeline=serialize_transformer_pipeline(transformers),
overwrite_catalog=True)
dag.add_edge(input_dataset=source_ds_name,
output_dataset=ds_name,
transformer_pipeline=serialize_transformer_pipeline(transformers),
overwrite_catalog=True)
In [ ]:
Copied!
%%time
ds = Dataset.from_catalog(ds_name)
%%time
ds = Dataset.from_catalog(ds_name)
In [ ]:
Copied!
%%time
ds = Dataset.load(ds_name)
%%time
ds = Dataset.load(ds_name)
In [ ]:
Copied!
print(ds.README)
print(ds.README)
In [ ]:
Copied!
print(ds.LICENSE)
print(ds.LICENSE)
In [ ]:
Copied!
ds.data.shape
ds.data.shape
In [ ]:
Copied!
ds.data.head()
ds.data.head()
Check-in the new dataset¶
Finally, check in the new catalog files.