skdag.DAG¶

class skdag.DAG(graph, *, memory=None, n_jobs=None, verbose=False)[source]¶

A Directed Acyclic Graph (DAG) of estimators, that itself implements the estimator interface.

A DAG may consist of a simple chain of estimators (being exactly equivalent to a sklearn.pipeline.Pipeline) or a more complex path of dependencies. But as the name suggests, it may not contain any cyclic dependencies and data may only flow from one or more start points (roots) to one or more endpoints (leaves).

Parameters

graphnetworkx.DiGraph: A directed graph with string node IDs indicating the step name. Each node must have a step attribute, which contains a skdag.dag.DAGStep.
memorystr or object with the joblib.Memory interface, default=None: Used to cache the fitted transformers of the DAG. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the DAG cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.
n_jobsint, default=None: Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context.
verbosebool, default=False: If True, the time elapsed while fitting each step will be printed as it is completed.

See also

skdag.DAGBuilder: Convenience utility for simplified DAG construction.

Examples

The simplest DAGs are just a chain of singular dependencies. These DAGs may be created from the skdag.dag.DAG.from_pipeline() method in the same way as a DAG:

>>> from sklearn.decomposition import PCA
>>> from sklearn.impute import SimpleImputer
>>> from sklearn.linear_model import LogisticRegression
>>> dag = DAG.from_pipeline(
...     steps=[
...         ("impute", SimpleImputer()),
...         ("pca", PCA()),
...         ("lr", LogisticRegression())
...     ]
... )
>>> print(dag.draw().strip())
o    impute
|
o    pca
|
o    lr

For more complex DAGs, it is recommended to use a skdag.dag.DAGBuilder, which allows you to define the graph by specifying the dependencies of each new estimator:

>>> from skdag import DAGBuilder
>>> dag = (
...     DAGBuilder()
...     .add_step("impute", SimpleImputer())
...     .add_step("vitals", "passthrough", deps={"impute": slice(0, 4)})
...     .add_step("blood", PCA(n_components=2, random_state=0), deps={"impute": slice(4, 10)})
...     .add_step("lr", LogisticRegression(random_state=0), deps=["blood", "vitals"])
...     .make_dag()
... )
>>> print(dag.draw().strip())
o    impute
|\
o o    blood,vitals
|/
o    lr

In the above examples we pass the first four columns directly to a regressor, but the remaining columns have dimensionality reduction applied first before being passed to the same regressor. Note that we can define our graph edges in two different ways: as a dict (if we need to select only certain columns from the source node) or as a simple list (if we want to simply grab all columns from all input nodes).

The DAG may now be used as an estimator in its own right:

>>> from sklearn import datasets
>>> X, y = datasets.load_diabetes(return_X_y=True)
>>> dag.fit_predict(X, y)
array([...

In an extension to the scikit-learn estimator interface, DAGs also support multiple inputs and multiple outputs. Let’s say we want to compare two different classifiers:

>>> from sklearn.ensemble import RandomForestClassifier
>>> cal = DAG.from_pipeline(
...     [("rf", RandomForestClassifier(random_state=0))]
... )
>>> dag2 = dag.join(cal, edges=[("blood", "rf"), ("vitals", "rf")])
>>> print(dag2.draw().strip())
o    impute
|\
o o    blood,vitals
|x|
o o    lr,rf

Now our DAG will return two outputs: one from each classifier. Multiple outputs are returned as a sklearn.utils.Bunch:

>>> y_pred = dag2.fit_predict(X, y)
>>> y_pred.lr
array([...
>>> y_pred.rf
array([...

Similarly, multiple inputs are also acceptable and inputs can be provided by specifying X and y as a dict-like object.

Attributes

graph_networkx.DiGraph: A read-only view of the workflow.
classes_ndarray of shape (n_classes,): The classes labels.
n_features_in_int: Number of features seen during fit. Only defined if all of the underlying root estimators in graph_ expose such an attribute when fit.
feature_names_in_ndarray of shape (n_features_in_,): Names of features seen during fit. Only defined if the underlying estimators expose such an attribute when fit.