{ "cells": [ { "cell_type": "markdown", "id": "5903f7ac", "metadata": {}, "source": [ "# Block feature selection with UBayFS\n", "\n", "## Introduction\n", "Block feature selection is essential in different fields of application, including treatment outcome prediction in healthcare (for diseases such as cancer). Commonly, data is available from multiple sources, including clinical, genetic, and image data, where measurements from a common source are aggregated into a feature block. In many cases, however, not all data sources are relevant for machine learning models: Suppose we suspected that image data would not contain additional information to the other feature blocks. In that case, it might be easier, in terms of data acquisition and availability, to favor models that do not depend on all data sources at once. In order to detect that a single feature block does not provide any additional information, block feature selection can be deployed. UBayFS is able to cover this scenario by specifying constraints on a block level.\n", "\n", "## UBayFS example\n", "At first we load the package and the Breast Cancer Wisconsin (BCW) example dataset, which is described in the main notebook (add link)." ] }, { "cell_type": "code", "execution_count": 1, "id": "56dc0b25", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import sys\n", "sys.path.append(\"../../src/UBayFS\")\n", "\n", "from UBaymodel import UBaymodel\n", "from UBayconstraint import UBayconstraint\n", "\n", "data = pd.read_csv(\"./data/data.csv\")\n", "labels = pd.read_csv(\"./data/labels.csv\").replace((\"M\",\"B\"),(0,1)).astype(int)" ] }, { "cell_type": "markdown", "id": "8551c0c1", "metadata": {}, "source": [ "For block feature selection, it is necessary to define each feature's block affiliation, which is provided either (a) via a block list, or (b) via a block matrix. \n", "\n", "### Version (a): block list\n", "The first example demonstrates how a list of block indices can provide the block structure of the dataset. We define three blocks for the BCW dataset, where the first block contains features with indices 1 to 10, block two features with indices 11 to 20, and block three features with indices 21 to 30." ] }, { "cell_type": "code", "execution_count": 2, "id": "3155e5fc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),\n", " array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),\n", " array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "block_list = [np.arange(0,10), np.arange(10,20), np.arange(20,30)]\n", "block_list" ] }, { "cell_type": "markdown", "id": "b333c52b", "metadata": {}, "source": [ "For the UBayFS model, we define a max-size block constraint restricting the number of selected blocks to (up to) one. Thus, the `constraint_vars` parameter is set to the maximum number of blocks to be selected, and `num_elements` contains the number of blocks, which equals the number of elements in the block_list." ] }, { "cell_type": "code", "execution_count": 3, "id": "05cc4282", "metadata": {}, "outputs": [], "source": [ "block_constraints = UBayconstraint(rho=np.array([1]), \n", " constraint_types=[\"max_size\"], \n", " constraint_vars=[1], \n", " num_elements=len(block_list),\n", " block_list = block_list)" ] }, { "cell_type": "markdown", "id": "4d823675", "metadata": {}, "source": [ "### Version (b): block matrix\n", "Assuming the same block structure as for the block list, we demonstrate how to specify the block structure in UBayFS via a block matrix. The block matrix is a binary assignment matrix consisting of rows representing the feature blocks and columns representing the features in the dataset. Note that, in general, a feature may be assigned to an arbitrary number of blocks (i.e., the row and column sums are not restricted), but in practice, a partition of the feature set is sufficient in most cases." ] }, { "cell_type": "code", "execution_count": 4, "id": "74412d59", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "... | \n", "20 | \n", "21 | \n", "22 | \n", "23 | \n", "24 | \n", "25 | \n", "26 | \n", "27 | \n", "28 | \n", "29 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
1 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
2 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "
3 rows × 30 columns
\n", "