Naming Convention¶
The notebooks are named dd-xyz-title.ipynb where:
ddis an integer indicating the notebook sequence. This is critical when there are dependencies between notebooksxyzis the author's initials, to help avoid namespace clashes when multiple parties are committing to the same repotitleis the name of the notebook, words separated by hyphens.
Useful Header Cells¶
Make jupyter notebook use the full screen width
In [ ]:
Copied!
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
from IPython.core.display import display, HTML
display(HTML(""))
When developing code in the src module, it's very useful to enable auto-reload:
In [ ]:
Copied!
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2
Python Libraries¶
Imports you'll almost always want
In [ ]:
Copied!
# Python Imports, alphabetized
import pathlib
#3rd party python modules, alphabetized
import pandas as pd
# Source module imports
from src import paths
from src.data import DataSource, Dataset, Catalog
# Python Imports, alphabetized
import pathlib
#3rd party python modules, alphabetized
import pandas as pd
# Source module imports
from src import paths
from src.data import DataSource, Dataset, Catalog
Logging¶
Enable logging and crank up log level to DEBUG. This is particularly useful when developing code in your project module and using it from a notebook.
In [ ]:
Copied!
import logging
from src.log import logger
logger.setLevel(logging.DEBUG)
import logging
from src.log import logger
logger.setLevel(logging.DEBUG)
Working with a Dataset from the catalog¶
List available datasets
In [ ]:
Copied!
c = Catalog.load('datasets'); c
c = Catalog.load('datasets'); c
Note: The first time running a load function on a new dataset may be slow, as it is doing all the work to generate and verify the contents of a dataset. However, on subsequent runs, it will use a cached copy of the dataset and be quick.
In [ ]:
Copied!
%%time
ds = Dataset.load('20_newsgroups') # replace my-dataset with the name of a dataset you have a recipe for
%%time
ds = Dataset.load('20_newsgroups') # replace my-dataset with the name of a dataset you have a recipe for
In [ ]:
Copied!
len(ds.data)
len(ds.data)
In [ ]:
Copied!
ds.data[:5]
ds.data[:5]
In [ ]:
Copied!
print(ds.README)
print(ds.README)
In [ ]:
Copied!
print(ds.LICENSE)
print(ds.LICENSE)
If you have data, you're up and running with a working installation.
Some data science libraries built in to the base conda environment¶
In [ ]:
Copied!
# basic data science and visualization libraries
import sklearn
import matplotlib
import scipy
import pandas
# basic data science and visualization libraries
import sklearn
import matplotlib
import scipy
import pandas
In [ ]:
Copied!