skdag.DAG¶
- class skdag.DAG(graph, *, memory=None, n_jobs=None, verbose=False)[source]¶
A Directed Acyclic Graph (DAG) of estimators, that itself implements the estimator interface.
A DAG may consist of a simple chain of estimators (being exactly equivalent to a
sklearn.pipeline.Pipeline
) or a more complex path of dependencies. But as the name suggests, it may not contain any cyclic dependencies and data may only flow from one or more start points (roots) to one or more endpoints (leaves).- Parameters
- graph
networkx.DiGraph
A directed graph with string node IDs indicating the step name. Each node must have a
step
attribute, which contains askdag.dag.DAGStep
.- memorystr or object with the joblib.Memory interface, default=None
Used to cache the fitted transformers of the DAG. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the DAG cannot be inspected directly. Use the attribute
named_steps
orsteps
to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.- n_jobsint, default=None
Number of jobs to run in parallel.
None
means 1 unless in ajoblib.parallel_backend
context.- verbosebool, default=False
If True, the time elapsed while fitting each step will be printed as it is completed.
- graph
See also
skdag.DAGBuilder
Convenience utility for simplified DAG construction.
Examples
The simplest DAGs are just a chain of singular dependencies. These DAGs may be created from the
skdag.dag.DAG.from_pipeline()
method in the same way as a DAG:>>> from sklearn.decomposition import PCA >>> from sklearn.impute import SimpleImputer >>> from sklearn.linear_model import LogisticRegression >>> dag = DAG.from_pipeline( ... steps=[ ... ("impute", SimpleImputer()), ... ("pca", PCA()), ... ("lr", LogisticRegression()) ... ] ... ) >>> print(dag.draw().strip()) o impute | o pca | o lr
For more complex DAGs, it is recommended to use a
skdag.dag.DAGBuilder
, which allows you to define the graph by specifying the dependencies of each new estimator:>>> from skdag import DAGBuilder >>> dag = ( ... DAGBuilder() ... .add_step("impute", SimpleImputer()) ... .add_step("vitals", "passthrough", deps={"impute": slice(0, 4)}) ... .add_step("blood", PCA(n_components=2, random_state=0), deps={"impute": slice(4, 10)}) ... .add_step("lr", LogisticRegression(random_state=0), deps=["blood", "vitals"]) ... .make_dag() ... ) >>> print(dag.draw().strip()) o impute |\ o o blood,vitals |/ o lr
In the above examples we pass the first four columns directly to a regressor, but the remaining columns have dimensionality reduction applied first before being passed to the same regressor. Note that we can define our graph edges in two different ways: as a dict (if we need to select only certain columns from the source node) or as a simple list (if we want to simply grab all columns from all input nodes).
The DAG may now be used as an estimator in its own right:
>>> from sklearn import datasets >>> X, y = datasets.load_diabetes(return_X_y=True) >>> dag.fit_predict(X, y) array([...
In an extension to the scikit-learn estimator interface, DAGs also support multiple inputs and multiple outputs. Let’s say we want to compare two different classifiers:
>>> from sklearn.ensemble import RandomForestClassifier >>> cal = DAG.from_pipeline( ... [("rf", RandomForestClassifier(random_state=0))] ... ) >>> dag2 = dag.join(cal, edges=[("blood", "rf"), ("vitals", "rf")]) >>> print(dag2.draw().strip()) o impute |\ o o blood,vitals |x| o o lr,rf
Now our DAG will return two outputs: one from each classifier. Multiple outputs are returned as a
sklearn.utils.Bunch
:>>> y_pred = dag2.fit_predict(X, y) >>> y_pred.lr array([... >>> y_pred.rf array([...
Similarly, multiple inputs are also acceptable and inputs can be provided by specifying
X
andy
as adict
-like object.- Attributes
- graph_
networkx.DiGraph
A read-only view of the workflow.
classes_
ndarray of shape (n_classes,)The classes labels.
- n_features_in_int
Number of features seen during fit. Only defined if all of the underlying root estimators in graph_ expose such an attribute when fit.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Only defined if the underlying estimators expose such an attribute when fit.
- graph_