Block feature selection with UBayFS

Introduction

Block feature selection is essential in different fields of application, including treatment outcome prediction in healthcare (for diseases such as cancer). Commonly, data is available from multiple sources, including clinical, genetic, and image data, where measurements from a common source are aggregated into a feature block. In many cases, however, not all data sources are relevant for machine learning models: Suppose we suspected that image data would not contain additional information to the other feature blocks. In that case, it might be easier, in terms of data acquisition and availability, to favor models that do not depend on all data sources at once. In order to detect that a single feature block does not provide any additional information, block feature selection can be deployed. UBayFS is able to cover this scenario by specifying constraints on a block level.

UBayFS example

At first we load the package and the Breast Cancer Wisconsin (BCW) example dataset, which is described in the main notebook (add link).

[1]:
import pandas as pd
import numpy as np
import sys
sys.path.append("../../src/UBayFS")

from UBaymodel import UBaymodel
from UBayconstraint import UBayconstraint

data = pd.read_csv("./data/data.csv")
labels = pd.read_csv("./data/labels.csv").replace(("M","B"),(0,1)).astype(int)

For block feature selection, it is necessary to define each feature’s block affiliation, which is provided either (a) via a block list, or (b) via a block matrix.

Version (a): block list

The first example demonstrates how a list of block indices can provide the block structure of the dataset. We define three blocks for the BCW dataset, where the first block contains features with indices 1 to 10, block two features with indices 11 to 20, and block three features with indices 21 to 30.

[2]:
block_list = [np.arange(0,10), np.arange(10,20), np.arange(20,30)]
block_list
[2]:
[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),
 array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])]

For the UBayFS model, we define a max-size block constraint restricting the number of selected blocks to (up to) one. Thus, the constraint_vars parameter is set to the maximum number of blocks to be selected, and num_elements contains the number of blocks, which equals the number of elements in the block_list.

[3]:
block_constraints = UBayconstraint(rho=np.array([1]),
                             constraint_types=["max_size"],
                             constraint_vars=[1],
                             num_elements=len(block_list),
                                block_list = block_list)

Version (b): block matrix

Assuming the same block structure as for the block list, we demonstrate how to specify the block structure in UBayFS via a block matrix. The block matrix is a binary assignment matrix consisting of rows representing the feature blocks and columns representing the features in the dataset. Note that, in general, a feature may be assigned to an arbitrary number of blocks (i.e., the row and column sums are not restricted), but in practice, a partition of the feature set is sufficient in most cases.

[4]:
block_matrix = np.zeros((3, 30))
block_matrix[0,np.arange(0,10)] = 1
block_matrix[1,np.arange(10,20)] = 1
block_matrix[2,np.arange(20,30)] = 1

pd.DataFrame(block_matrix)
[4]:
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

3 rows × 30 columns

The same block feature constraints as for the block list can be produced using the block matrix. However, note that the num_elements parameter, specifying the number of blocks, has to be set to the number of rows in the block matrix:

[5]:
block_constraints = UBayconstraint(rho=np.array([1]),
                             constraint_types=["max_size", "must_link"],
                             constraint_vars=[1,[1,2]],
                             num_elements=3,
                                block_matrix = block_matrix)

Block-wise prior weights

In addition to block-wise constraints, also prior weights may be specified on block level rather than on feature level. Thus, we define a help function to build the vector of prior block weights. Features from the same block get assigned the same prior weight. In this example, feature weights in block 1 are set to 0.5, feature weights in block 2 are set to 1, and feature weights in block 3 are set to 2.

[6]:
def build_block_weights(blocks, weights):
    weights_ass = []
    for i in blocks:
        weights_ass.append(weights[i])
    return np.array(weights_ass)

prior_weights = build_block_weights(blocks = np.repeat([0,1,2], 10), weights=np.array([0.5,1,2]))
prior_weights
[6]:
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1. , 1. , 1. ,
       1. , 1. , 1. , 1. , 1. , 1. , 1. , 2. , 2. , 2. , 2. , 2. , 2. ,
       2. , 2. , 2. , 2. ])

Evaluation of block feature selection results

After specifying the block constraints, we initialize the UBayFS model. In addition to the block constraints, we require that at most three features are selected in total (max-size constraint).

[7]:
model = UBaymodel(data = data,
                 target = labels,
                 feat_names = data.columns,
                 M = 100,
                 tt_split = 0.75,
                 nr_features = 10,
                 method = ["mrmr"],
                 weights = prior_weights,
                 l = 1,
                 constraints = UBayconstraint(rho=np.array([1]),
                             constraint_types=["max_size"],
                             constraint_vars=[3],
                             num_elements=data.shape[1]),
                 optim_method = "GA",
                 popsize = 100,
                 maxiter = 100,
                 random_state = 0)
[8]:
# add block constraints
model.setConstraints(block_constraints, append=True)

Finallly, the model is trained.

[9]:
result = model.train()
[10]:
model.evaluateFS(result[0].values[:,0])
[10]:
{'cardinality': 3,
 'total utility': 0.291,
 'posterior feature utility': 0.291,
 'admissibility': 1.0,
 'number of violated constraints': 0,
 'average feature correlation': 0.847}

Conclusion

The specification of block constraints in the UBayFS model follows the same syntax as ordinary feature set constraints. Thus, block constraints can be easily integrated and combined with feature-wise constraints. Further, the framework allows setting arbitrary linear constraints for blocks, as well as for single features.