Basic Starter Notebook

Naming Convention¶

The notebooks are named dd-xyz-title.ipynb where:

dd is an integer indicating the notebook sequence. This is critical when there are dependencies between notebooks
xyz is the author's initials, to help avoid namespace clashes when multiple parties are committing to the same repo
title is the name of the notebook, words separated by hyphens.

Useful Header Cells¶

Make jupyter notebook use the full screen width

In [ ]:

            
                Copied!
                
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
from IPython.core.display import display, HTML
display(HTML(""))

When developing code in the src module, it's very useful to enable auto-reload:

In [ ]:

            
                Copied!
                
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

Python Libraries¶

Imports you'll almost always want

In [ ]:

            
                Copied!
                
# Python Imports, alphabetized
import pathlib

#3rd party python modules, alphabetized

import pandas as pd

# Source module imports 
from src import paths
from src.data import DataSource, Dataset, Catalog
# Python Imports, alphabetized
import pathlib

#3rd party python modules, alphabetized

import pandas as pd

# Source module imports 
from src import paths
from src.data import DataSource, Dataset, Catalog

Logging¶

Enable logging and crank up log level to DEBUG. This is particularly useful when developing code in your project module and using it from a notebook.

In [ ]:

            
                Copied!
                
import logging
from src.log import logger

logger.setLevel(logging.DEBUG)
import logging
from src.log import logger

logger.setLevel(logging.DEBUG)

Working with a Dataset from the catalog¶

List available datasets

In [ ]:

            
                Copied!
                
c = Catalog.load('datasets'); c
c = Catalog.load('datasets'); c

Note: The first time running a load function on a new dataset may be slow, as it is doing all the work to generate and verify the contents of a dataset. However, on subsequent runs, it will use a cached copy of the dataset and be quick.

In [ ]:

            
                Copied!
                
%%time
ds = Dataset.load('20_newsgroups') # replace my-dataset with the name of a dataset you have a recipe for
%%time
ds = Dataset.load('20_newsgroups') # replace my-dataset with the name of a dataset you have a recipe for

In [ ]:

            
                Copied!
                
len(ds.data)
len(ds.data)

In [ ]:

            
                Copied!
                
ds.data[:5]
ds.data[:5]

In [ ]:

            
                Copied!
                
print(ds.README)
print(ds.README)

In [ ]:

            
                Copied!
                
print(ds.LICENSE)
print(ds.LICENSE)

If you have data, you're up and running with a working installation.

Some data science libraries built in to the base conda environment¶

In [ ]:

            
                Copied!
                
# basic data science and visualization libraries
import sklearn
import matplotlib
import scipy
import pandas
# basic data science and visualization libraries
import sklearn
import matplotlib
import scipy
import pandas

In [ ]: