Easydata
  • Home
  • Opinions
  • Datasets

Example Notebooks

  • Basic Starter Notebook
  • Template for creating a dataset from a single .csv file
  • Template for creating a dataset from an existing dataset using a single function
  • General template for create a new dataset from scratch
    • Basic imports
    • Create a DataSource
      • Create a process function
    • Create a Dataset from the DataSource
    • Check-in the new dataset
  • General template for creating a derived dataset (aka. adding an edge to the DatasetGraph)
Easydata
  • »
  • Example Notebooks »
  • General template for create a new dataset from scratch
  • Edit on GitHub

General template for create a new dataset from scratch¶

This example creates the same raw dataset as in the Add-csv-template.ipynb example, but does it completely generally without using a function from helpers. Any (non-derived) dataset can be added in this way.

We'll use this as an example of a non-manual download.

Basic imports¶

In [ ]:
Copied!
%load_ext autoreload
%autoreload 2
%load_ext autoreload %autoreload 2
In [ ]:
Copied!
# Basic utility functions
import logging
import os
import pathlib
from pprint import pprint

from src.log import logger
from src import paths
from src.utils import list_dir
from functools import partial

# data functions
from src.data import DataSource, Dataset, DatasetGraph, Catalog
from src import helpers
# Basic utility functions import logging import os import pathlib from pprint import pprint from src.log import logger from src import paths from src.utils import list_dir from functools import partial # data functions from src.data import DataSource, Dataset, DatasetGraph, Catalog from src import helpers
In [ ]:
Copied!
# Optionally set to debug log level
logger.setLevel(logging.DEBUG)
# Optionally set to debug log level logger.setLevel(logging.DEBUG)

Create a DataSource¶

In [ ]:
Copied!
ds_name = 'covid-19-epidemiology-raw'
dsrc = DataSource(ds_name)
ds_name = 'covid-19-epidemiology-raw' dsrc = DataSource(ds_name)
In [ ]:
Copied!
url = 'https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv'
url = 'https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv'
In [ ]:
Copied!
filename = 'epidemiology.csv' # path relative to paths['raw_data_path'] for the file
filename = 'epidemiology.csv' # path relative to paths['raw_data_path'] for the file
In [ ]:
Copied!
license = """
[CC-BY 4.0](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/output/CC-BY)
"""
license = """ [CC-BY 4.0](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/output/CC-BY) """
In [ ]:
Copied!
metadata = """
The epidemiology table from Google's [COVID-19 Open-Data dataset](https://github.com/GoogleCloudPlatform/covid-19-open-data). 

The full dataset contains datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world. The data is at the spatial resolution of states/provinces for most regions and at county/municipality resolution for many countries such as Argentina, Brazil, Chile, Colombia, Czech Republic, Mexico, Netherlands, Peru, United Kingdom, and USA. All regions are assigned a unique location key, which resolves discrepancies between ISO / NUTS / FIPS codes, etc. The different aggregation levels are:

    0: Country
    1: Province, state, or local equivalent
    2: Municipality, county, or local equivalent
    3: Locality which may not follow strict hierarchical order, such as "city" or "nursing homes in X location"

There are multiple types of data:

    Outcome data Y(i,t), such as cases, tests, hospitalizations, deaths and recoveries, for region i and time t
    Static covariate data X(i), such as population size, health statistics, economic indicators, geographic boundaries
    Dynamic covariate data X(i,t), such as mobility, search trends, weather, and government interventions

The data is drawn from multiple sources, as listed below, and stored in separate tables as CSV files grouped by context, which can be easily merged due to the use of consistent geographic (and temporal) keys as it is done for the main table.

One of these files is the epidemiology.csv file used here.
"""
metadata = """ The epidemiology table from Google's [COVID-19 Open-Data dataset](https://github.com/GoogleCloudPlatform/covid-19-open-data). The full dataset contains datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world. The data is at the spatial resolution of states/provinces for most regions and at county/municipality resolution for many countries such as Argentina, Brazil, Chile, Colombia, Czech Republic, Mexico, Netherlands, Peru, United Kingdom, and USA. All regions are assigned a unique location key, which resolves discrepancies between ISO / NUTS / FIPS codes, etc. The different aggregation levels are: 0: Country 1: Province, state, or local equivalent 2: Municipality, county, or local equivalent 3: Locality which may not follow strict hierarchical order, such as "city" or "nursing homes in X location" There are multiple types of data: Outcome data Y(i,t), such as cases, tests, hospitalizations, deaths and recoveries, for region i and time t Static covariate data X(i), such as population size, health statistics, economic indicators, geographic boundaries Dynamic covariate data X(i,t), such as mobility, search trends, weather, and government interventions The data is drawn from multiple sources, as listed below, and stored in separate tables as CSV files grouped by context, which can be easily merged due to the use of consistent geographic (and temporal) keys as it is done for the main table. One of these files is the epidemiology.csv file used here. """

This example uses add_url, but there are other options such as add_manual_download and add_google_drive.

In [ ]:
Copied!
dsrc.add_url(url=url, file_name=filename, unpack_action='copy')
dsrc.add_metadata(contents=metadata, force=True)
dsrc.add_metadata(contents=license, kind='LICENSE', force=True)
dsrc.add_url(url=url, file_name=filename, unpack_action='copy') dsrc.add_metadata(contents=metadata, force=True) dsrc.add_metadata(contents=license, kind='LICENSE', force=True)
In [ ]:
Copied!
dsrc.file_dict
dsrc.file_dict

Create a process function¶

By default, we recommend that you use the process_fileset_files functionality and then use a transformer function to create a derived dataset, but you can optionally create your own.

In [ ]:
Copied!
from src.data.fileset import process_fileset_files
process_function = process_fileset_files
process_function_kwargs = {'file_glob':'*.csv',
                           'do_copy': True,
                           'fileset_dir': ds_name+'.fileset',
                           'extract_dir': ds_name}
from src.data.fileset import process_fileset_files process_function = process_fileset_files process_function_kwargs = {'file_glob':'*.csv', 'do_copy': True, 'fileset_dir': ds_name+'.fileset', 'extract_dir': ds_name}
In [ ]:
Copied!
help(process_function)
help(process_function)
In [ ]:
Copied!
dsrc.process_function = partial(process_function, **process_function_kwargs)
dsrc.process_function = partial(process_function, **process_function_kwargs)
In [ ]:
Copied!
dsrc.update_catalog()
dsrc.update_catalog()
In [ ]:
Copied!
dsc = Catalog.load('datasources')
dsc[ds_name]
dsc = Catalog.load('datasources') dsc[ds_name]
In [ ]:
Copied!
%%time
dsrc.fetch()
%%time dsrc.fetch()
In [ ]:
Copied!
%%time
dsrc.unpack()
%%time dsrc.unpack()

Create a Dataset from the DataSource¶

In [ ]:
Copied!
from src.data import DatasetGraph
from src.data import DatasetGraph
In [ ]:
Copied!
paths['catalog_path']
paths['catalog_path']
In [ ]:
Copied!
dag = DatasetGraph(catalog_path=paths['catalog_path'])
dag = DatasetGraph(catalog_path=paths['catalog_path'])
In [ ]:
Copied!
dag.sources
dag.sources
In [ ]:
Copied!
dsc = Catalog.load('datasources'); dsc
dsc = Catalog.load('datasources'); dsc
In [ ]:
Copied!
dag.add_source(output_dataset=ds_name, datasource_name=ds_name, overwrite_catalog=True)
dag.add_source(output_dataset=ds_name, datasource_name=ds_name, overwrite_catalog=True)
In [ ]:
Copied!
dc = Catalog.load('datasets'); dc
dc = Catalog.load('datasets'); dc
In [ ]:
Copied!
%%time
ds = Dataset.from_catalog(ds_name)
%%time ds = Dataset.from_catalog(ds_name)
In [ ]:
Copied!
%%time
ds = Dataset.load(ds_name)
%%time ds = Dataset.load(ds_name)
In [ ]:
Copied!
pprint(ds.metadata)
pprint(ds.metadata)
In [ ]:
Copied!
print(ds.LICENSE)
print(ds.LICENSE)
In [ ]:
Copied!
ds.FILESET
ds.FILESET
In [ ]:
Copied!
ds.fileset_file('epidemiology.csv')
ds.fileset_file('epidemiology.csv')
In [ ]:
Copied!
ds.data is None
ds.data is None
In [ ]:
Copied!
ds.target is None
ds.target is None

Check-in the new dataset¶

Finally don't forget to check in the new catalog files.

In [ ]:
Copied!

Previous Next

CCBY 4.0

Built with MkDocs using a theme provided by Read the Docs.