Easydata
  • Home
  • Opinions
  • Datasets

Example Notebooks

  • Basic Starter Notebook
  • Template for creating a dataset from a single .csv file
  • Template for creating a dataset from an existing dataset using a single function
  • General template for create a new dataset from scratch
  • General template for creating a derived dataset (aka. adding an edge to the DatasetGraph)
Easydata
  • »
  • Example Notebooks »
  • Basic Starter Notebook
  • Edit on GitHub

Naming Convention¶

The notebooks are named dd-xyz-title.ipynb where:

  • dd is an integer indicating the notebook sequence. This is critical when there are dependencies between notebooks
  • xyz is the author's initials, to help avoid namespace clashes when multiple parties are committing to the same repo
  • title is the name of the notebook, words separated by hyphens.

Useful Header Cells¶

Make jupyter notebook use the full screen width

In [ ]:
Copied!
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
from IPython.core.display import display, HTML display(HTML(""))

When developing code in the src module, it's very useful to enable auto-reload:

In [ ]:
Copied!
%load_ext autoreload
%autoreload 2
%load_ext autoreload %autoreload 2

Python Libraries¶

Imports you'll almost always want

In [ ]:
Copied!
# Python Imports, alphabetized
import pathlib

#3rd party python modules, alphabetized

import pandas as pd

# Source module imports 
from src import paths
from src.data import DataSource, Dataset, Catalog
# Python Imports, alphabetized import pathlib #3rd party python modules, alphabetized import pandas as pd # Source module imports from src import paths from src.data import DataSource, Dataset, Catalog

Logging¶

Enable logging and crank up log level to DEBUG. This is particularly useful when developing code in your project module and using it from a notebook.

In [ ]:
Copied!
import logging
from src.log import logger

logger.setLevel(logging.DEBUG)
import logging from src.log import logger logger.setLevel(logging.DEBUG)

Working with a Dataset from the catalog¶

List available datasets

In [ ]:
Copied!
c = Catalog.load('datasets'); c
c = Catalog.load('datasets'); c

Note: The first time running a load function on a new dataset may be slow, as it is doing all the work to generate and verify the contents of a dataset. However, on subsequent runs, it will use a cached copy of the dataset and be quick.

In [ ]:
Copied!
%%time
ds = Dataset.load('20_newsgroups') # replace my-dataset with the name of a dataset you have a recipe for
%%time ds = Dataset.load('20_newsgroups') # replace my-dataset with the name of a dataset you have a recipe for
In [ ]:
Copied!
len(ds.data)
len(ds.data)
In [ ]:
Copied!
ds.data[:5]
ds.data[:5]
In [ ]:
Copied!
print(ds.README)
print(ds.README)
In [ ]:
Copied!
print(ds.LICENSE)
print(ds.LICENSE)

If you have data, you're up and running with a working installation.

Some data science libraries built in to the base conda environment¶

In [ ]:
Copied!
# basic data science and visualization libraries
import sklearn
import matplotlib
import scipy
import pandas
# basic data science and visualization libraries import sklearn import matplotlib import scipy import pandas
In [ ]:
Copied!

Previous Next

CCBY 4.0

Built with MkDocs using a theme provided by Read the Docs.