Quickstart¶
The UBayFSpy package implements the framework proposed in the article [Jenul et al. (2022)](https://link.springer.com/article/10.1007/s10994-022-06221-9). UBayFS is an ensemble feature selection technique embedded in a Bayesian statistical framework. The method combines data and user knowledge, where the first is extracted via data-driven ensemble feature selection. The user can control the feature selection by assigning prior weights to features and penalizing specific feature combinations. In particular, the user can define a maximal number of selected features and must-link constraints (features must be selected together) or cannot-link constraints (features must not be selected together). Using relaxed constraints, a parameter $\rho$ regulates the penalty shape. Hence, violation of constraints can be valid but leads to a lower target value of the feature set that is derived from the violated constraints. UBayFS can be used for common feature selection and also for block feature selection.
Classification example¶
The following python example illustrates UBayFS on the Wisconsin breast cancer (classification) dataset, available from scikit-learn. First, we load and prepare the data. Then we initialize a UBayFS model. This example shows how to select features with UBayFS. For more examples including block feature selection have a look at the example notebooks on the UBayFS GitHub repository.
import pandas as pd
import numpy as np
import UBayFS
from UBaymodel import UBaymodel
from UBayconstraint import UBayconstraint
data = pd.read_csv("./data/data.csv")
labels = pd.read_csv("./data/labels.csv").replace(("M","B"),(0,1)).astype(int)
model = UBaymodel(data = data,
target = labels,
feat_names = data.columns,
M = 100,tt_split = 0.75,
nr_features = 10,
method = ["mrmr"],
weights = [0.01],l = 1,
constraints = None,
optim_method = "GA",
popsize = 100,
maxiter = 100,
random_state = 0)
constraints = UBayconstraint(rho=np.array([np.Inf, 0.1, 1, 1]),
constraint_types=["max_size", "must_link", "cannot_link", "cannot_link"],
constraint_vars=[10, [0,10,20], [0,9], [19,22,23]],
num_elements=data.shape[1])
model.setConstraints(constraints)
model.train()