General template for create a new dataset from scratch¶
This example creates the same raw dataset as in the Add-csv-template.ipynb example, but does it completely generally without using a function from helpers. Any (non-derived) dataset can be added in this way.
We'll use this as an example of a non-manual download.
Basic imports¶
In [ ]:
Copied!
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2
In [ ]:
Copied!
# Basic utility functions
import logging
import os
import pathlib
from pprint import pprint
from src.log import logger
from src import paths
from src.utils import list_dir
from functools import partial
# data functions
from src.data import DataSource, Dataset, DatasetGraph, Catalog
from src import helpers
# Basic utility functions
import logging
import os
import pathlib
from pprint import pprint
from src.log import logger
from src import paths
from src.utils import list_dir
from functools import partial
# data functions
from src.data import DataSource, Dataset, DatasetGraph, Catalog
from src import helpers
In [ ]:
Copied!
# Optionally set to debug log level
logger.setLevel(logging.DEBUG)
# Optionally set to debug log level
logger.setLevel(logging.DEBUG)
Create a DataSource¶
In [ ]:
Copied!
ds_name = 'covid-19-epidemiology-raw'
dsrc = DataSource(ds_name)
ds_name = 'covid-19-epidemiology-raw'
dsrc = DataSource(ds_name)
In [ ]:
Copied!
url = 'https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv'
url = 'https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv'
In [ ]:
Copied!
filename = 'epidemiology.csv' # path relative to paths['raw_data_path'] for the file
filename = 'epidemiology.csv' # path relative to paths['raw_data_path'] for the file
In [ ]:
Copied!
license = """
[CC-BY 4.0](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/output/CC-BY)
"""
license = """
[CC-BY 4.0](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/output/CC-BY)
"""
In [ ]:
Copied!
metadata = """
The epidemiology table from Google's [COVID-19 Open-Data dataset](https://github.com/GoogleCloudPlatform/covid-19-open-data).
The full dataset contains datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world. The data is at the spatial resolution of states/provinces for most regions and at county/municipality resolution for many countries such as Argentina, Brazil, Chile, Colombia, Czech Republic, Mexico, Netherlands, Peru, United Kingdom, and USA. All regions are assigned a unique location key, which resolves discrepancies between ISO / NUTS / FIPS codes, etc. The different aggregation levels are:
0: Country
1: Province, state, or local equivalent
2: Municipality, county, or local equivalent
3: Locality which may not follow strict hierarchical order, such as "city" or "nursing homes in X location"
There are multiple types of data:
Outcome data Y(i,t), such as cases, tests, hospitalizations, deaths and recoveries, for region i and time t
Static covariate data X(i), such as population size, health statistics, economic indicators, geographic boundaries
Dynamic covariate data X(i,t), such as mobility, search trends, weather, and government interventions
The data is drawn from multiple sources, as listed below, and stored in separate tables as CSV files grouped by context, which can be easily merged due to the use of consistent geographic (and temporal) keys as it is done for the main table.
One of these files is the epidemiology.csv file used here.
"""
metadata = """
The epidemiology table from Google's [COVID-19 Open-Data dataset](https://github.com/GoogleCloudPlatform/covid-19-open-data).
The full dataset contains datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world. The data is at the spatial resolution of states/provinces for most regions and at county/municipality resolution for many countries such as Argentina, Brazil, Chile, Colombia, Czech Republic, Mexico, Netherlands, Peru, United Kingdom, and USA. All regions are assigned a unique location key, which resolves discrepancies between ISO / NUTS / FIPS codes, etc. The different aggregation levels are:
0: Country
1: Province, state, or local equivalent
2: Municipality, county, or local equivalent
3: Locality which may not follow strict hierarchical order, such as "city" or "nursing homes in X location"
There are multiple types of data:
Outcome data Y(i,t), such as cases, tests, hospitalizations, deaths and recoveries, for region i and time t
Static covariate data X(i), such as population size, health statistics, economic indicators, geographic boundaries
Dynamic covariate data X(i,t), such as mobility, search trends, weather, and government interventions
The data is drawn from multiple sources, as listed below, and stored in separate tables as CSV files grouped by context, which can be easily merged due to the use of consistent geographic (and temporal) keys as it is done for the main table.
One of these files is the epidemiology.csv file used here.
"""
This example uses add_url, but there are other options such as add_manual_download and add_google_drive.
In [ ]:
Copied!
dsrc.add_url(url=url, file_name=filename, unpack_action='copy')
dsrc.add_metadata(contents=metadata, force=True)
dsrc.add_metadata(contents=license, kind='LICENSE', force=True)
dsrc.add_url(url=url, file_name=filename, unpack_action='copy')
dsrc.add_metadata(contents=metadata, force=True)
dsrc.add_metadata(contents=license, kind='LICENSE', force=True)
In [ ]:
Copied!
dsrc.file_dict
dsrc.file_dict
Create a process function¶
By default, we recommend that you use the process_fileset_files functionality and then use a transformer function to create a derived dataset, but you can optionally create your own.
In [ ]:
Copied!
from src.data.fileset import process_fileset_files
process_function = process_fileset_files
process_function_kwargs = {'file_glob':'*.csv',
'do_copy': True,
'fileset_dir': ds_name+'.fileset',
'extract_dir': ds_name}
from src.data.fileset import process_fileset_files
process_function = process_fileset_files
process_function_kwargs = {'file_glob':'*.csv',
'do_copy': True,
'fileset_dir': ds_name+'.fileset',
'extract_dir': ds_name}
In [ ]:
Copied!
help(process_function)
help(process_function)
In [ ]:
Copied!
dsrc.process_function = partial(process_function, **process_function_kwargs)
dsrc.process_function = partial(process_function, **process_function_kwargs)
In [ ]:
Copied!
dsrc.update_catalog()
dsrc.update_catalog()
In [ ]:
Copied!
dsc = Catalog.load('datasources')
dsc[ds_name]
dsc = Catalog.load('datasources')
dsc[ds_name]
In [ ]:
Copied!
%%time
dsrc.fetch()
%%time
dsrc.fetch()
In [ ]:
Copied!
%%time
dsrc.unpack()
%%time
dsrc.unpack()
Create a Dataset from the DataSource¶
In [ ]:
Copied!
from src.data import DatasetGraph
from src.data import DatasetGraph
In [ ]:
Copied!
paths['catalog_path']
paths['catalog_path']
In [ ]:
Copied!
dag = DatasetGraph(catalog_path=paths['catalog_path'])
dag = DatasetGraph(catalog_path=paths['catalog_path'])
In [ ]:
Copied!
dag.sources
dag.sources
In [ ]:
Copied!
dsc = Catalog.load('datasources'); dsc
dsc = Catalog.load('datasources'); dsc
In [ ]:
Copied!
dag.add_source(output_dataset=ds_name, datasource_name=ds_name, overwrite_catalog=True)
dag.add_source(output_dataset=ds_name, datasource_name=ds_name, overwrite_catalog=True)
In [ ]:
Copied!
dc = Catalog.load('datasets'); dc
dc = Catalog.load('datasets'); dc
In [ ]:
Copied!
%%time
ds = Dataset.from_catalog(ds_name)
%%time
ds = Dataset.from_catalog(ds_name)
In [ ]:
Copied!
%%time
ds = Dataset.load(ds_name)
%%time
ds = Dataset.load(ds_name)
In [ ]:
Copied!
pprint(ds.metadata)
pprint(ds.metadata)
In [ ]:
Copied!
print(ds.LICENSE)
print(ds.LICENSE)
In [ ]:
Copied!
ds.FILESET
ds.FILESET
In [ ]:
Copied!
ds.fileset_file('epidemiology.csv')
ds.fileset_file('epidemiology.csv')
In [ ]:
Copied!
ds.data is None
ds.data is None
In [ ]:
Copied!
ds.target is None
ds.target is None
Check-in the new dataset¶
Finally don't forget to check in the new catalog files.
In [ ]:
Copied!