Template for creating a dataset from a single .csv file¶

This example creates a dataset using a single manually downloaded .csv file using a helper function in the workflow.

The src module here should be the name of your project module, whatever you have named it.

In this case, we'll use one of the COVID-19 Open-Data files from Google: https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv as an example.

Basic imports¶

In [ ]:

            
                Copied!
                
# Basic utility functions
import logging
import pathlib

from src.log import logger
from src.data import Dataset, Catalog
from src import paths, helpers
# Basic utility functions
import logging
import pathlib

from src.log import logger
from src.data import Dataset, Catalog
from src import paths, helpers

In [ ]:

            
                Copied!
                
# Optionally set to debug log level
#logger.setLevel(logging.DEBUG)
# Optionally set to debug log level
#logger.setLevel(logging.DEBUG)

In [ ]:

            
                Copied!
                
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

As a reference, this is your current paths['raw_data_path'] set in your conda environment.

In [ ]:

            
                Copied!
                
paths['raw_data_path']
paths['raw_data_path']

Dataset creation information¶

This is the information that you need to provide to create this dataset:

ds_name: The name you want to call your dataset in the Dataset catalog
csv_path: The desired path to your .csv file (in this case epidemiology.csv) relative to paths['raw_data_path']
download_message: The message to display to indicate to the user how to manually download your .csv file.
license_str: Information on the license for the dataset
readme_str: Information on the dataset itself

In [ ]:

            
                Copied!
                
ds_name = 'covid-19-epidemiology'
csv_path = 'epidemiology.csv' # path relative to paths['raw_data_path'] for the file
ds_name = 'covid-19-epidemiology'
csv_path = 'epidemiology.csv' # path relative to paths['raw_data_path'] for the file

In [ ]:

            
                Copied!
                
download_message = f"""Please retrieve epidemiology.csv from https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv \
and place it in {paths['raw_data_path']}"""
download_message = f"""Please retrieve epidemiology.csv from https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv \
and place it in {paths['raw_data_path']}"""

In [ ]:

            
                Copied!
                
license_str = """
[CC-BY 4.0](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/output/CC-BY)
"""
license_str = """
[CC-BY 4.0](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/output/CC-BY)
"""

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
readme_str = """
The epidemiology table from Google's [COVID-19 Open-Data dataset](https://github.com/GoogleCloudPlatform/covid-19-open-data). 

The full dataset contains datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world. The data is at the spatial resolution of states/provinces for most regions and at county/municipality resolution for many countries such as Argentina, Brazil, Chile, Colombia, Czech Republic, Mexico, Netherlands, Peru, United Kingdom, and USA. All regions are assigned a unique location key, which resolves discrepancies between ISO / NUTS / FIPS codes, etc. The different aggregation levels are:

    0: Country
    1: Province, state, or local equivalent
    2: Municipality, county, or local equivalent
    3: Locality which may not follow strict hierarchical order, such as "city" or "nursing homes in X location"

There are multiple types of data:

    Outcome data Y(i,t), such as cases, tests, hospitalizations, deaths and recoveries, for region i and time t
    Static covariate data X(i), such as population size, health statistics, economic indicators, geographic boundaries
    Dynamic covariate data X(i,t), such as mobility, search trends, weather, and government interventions

The data is drawn from multiple sources, as listed below, and stored in separate tables as CSV files grouped by context, which can be easily merged due to the use of consistent geographic (and temporal) keys as it is done for the main table.

One of these files is the epidemiology.csv file used here.
"""
readme_str = """
The epidemiology table from Google's [COVID-19 Open-Data dataset](https://github.com/GoogleCloudPlatform/covid-19-open-data). 

The full dataset contains datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world. The data is at the spatial resolution of states/provinces for most regions and at county/municipality resolution for many countries such as Argentina, Brazil, Chile, Colombia, Czech Republic, Mexico, Netherlands, Peru, United Kingdom, and USA. All regions are assigned a unique location key, which resolves discrepancies between ISO / NUTS / FIPS codes, etc. The different aggregation levels are:

    0: Country
    1: Province, state, or local equivalent
    2: Municipality, county, or local equivalent
    3: Locality which may not follow strict hierarchical order, such as "city" or "nursing homes in X location"

There are multiple types of data:

    Outcome data Y(i,t), such as cases, tests, hospitalizations, deaths and recoveries, for region i and time t
    Static covariate data X(i), such as population size, health statistics, economic indicators, geographic boundaries
    Dynamic covariate data X(i,t), such as mobility, search trends, weather, and government interventions

The data is drawn from multiple sources, as listed below, and stored in separate tables as CSV files grouped by context, which can be easily merged due to the use of consistent geographic (and temporal) keys as it is done for the main table.

One of these files is the epidemiology.csv file used here.
"""

If you have not yet placed your epidemiology.csv file in the appropriate place, the following cell will fail with a FileNotFoundError to the path it expects for your epidemiology.csv file. Put your file in the appropriate place, and then try again.

Create the dataset and explore it¶

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
%%time
ds = helpers.dataset_from_csv_manual_download(ds_name=ds_name,
                                               csv_path=csv_path,
                                               download_message=download_message,
                                               license_str=license_str,
                                               readme_str=readme_str,
                                               overwrite_catalog=True)
%%time
ds = helpers.dataset_from_csv_manual_download(ds_name=ds_name,
                                               csv_path=csv_path,
                                               download_message=download_message,
                                               license_str=license_str,
                                               readme_str=readme_str,
                                               overwrite_catalog=True)

In [ ]:

            
                Copied!
                
%%time
ds = Dataset.load(ds_name)
%%time
ds = Dataset.load(ds_name)

In [ ]:

            
                Copied!
                
ds.data.head()
ds.data.head()

In [ ]:

            
                Copied!
                
ds.data.shape
ds.data.shape

By default, the workflow helper function also created a covid-19-epidemiology_raw dataset that has an empty ds.data, but keeps a record of the location of the final epidemiology.csv file relative to in ds.FILESET.

The .FILESET functionality is covered in other documentation.

In [ ]:

            
                Copied!
                
%%time
ds_raw = Dataset.from_catalog(ds_name+"-raw")
%%time
ds_raw = Dataset.from_catalog(ds_name+"-raw")

In [ ]:

            
                Copied!
                
print(ds_raw.data)
print(ds_raw.data)

In [ ]:

            
                Copied!
                
ds_raw.FILESET
ds_raw.FILESET

In [ ]:

            
                Copied!
                
# fq path to epidemiology.csv file
ds_raw.fileset_file('epidemiology.csv')
# fq path to epidemiology.csv file
ds_raw.fileset_file('epidemiology.csv')

Check-in the catalog¶

The new dataset will now be in the catalog:

In [ ]:

            
                Copied!
                
c = Catalog.load('datasets'); c
c = Catalog.load('datasets'); c

At this point, you'll need to check in your new catalog files so that they are shared with others. Then, anyone with the catalog file can ds.load() the new dataset.

In [ ]: