VCNet examples
In this page, we illustrate how to use VCNet
: the joint learning model of a classifier and a counterfactual generator.
Starting by importing the classes of our library. You will need to use the datasets and the model classes.
from vcnet import DataCatalog, VCNet
Preparing the dataset
In this example, we will use the classical Adult dataset which contains both numerical and categorical features.
The code to prepare the dataset is as follows:
import pandas as pd
df = pd.read_csv("datasets/adult.data")
dataset_settings = {
"target":"income",
"class_size" : 2,
"continuous" : ["age","hours-per-week"],
"categorical": ["workclass","education","marital-status","occupation","race","gender"],
"immutables" : ["race","gender"],
"batch_size": 64,
"scaling_method": "MinMax",
"encoding_method": "OneHot_drop_binary",
"activate_rounding": True,
}
dataset = DataCatalog(hp['dataset'])
This dataset can be initially loaded in a pandas dataframe, then the dataset of given to a DataCatalog
instance, named :py:`dataset`.
This dataset can be seen is a kind of Lightning dataset, and to create this dataset that will be used to train the model, we need providing some information to it:
target: this is the attribute name to predict by the classifier
batch_size: this is classical parameter for optimization based machine learning methods
scaling_method: is an optional parameter to scale the numerical attributes
test_size / val_size (in [0,1], default 0.33) can be used to define the proportion of the dataset used for test/validation
stratify (default False) specifies whether the test/validation sets have to be sampled with class balance
Warning
The target attribute has to be categorial (but not numerical).
In addition, there are parameters that are specific to VCNet:
continous and categorical defines the features to use (and which type of attribute they are)
imputation_method define a method to use to make imputation of missing values (subclass of :py:module:`sklearn.impute` )
encoding_method: The attributes that are declared as categorial will be encoded using this method (by default, it is a one-hot encoding)
immutable: List here the attributes that you want to see unchanged while generating counterfactuals.
activate_rounding: If True (default), it activates the rounding of numerical attributes as a post-processing of counterfactual generation. This makes the counterfactuals more realistic.
Note
It is worth noticing that only the features that are listed in continuous or categorical will be handled by the model. Other attributes will be ignored.
We now prepare the datasets for training.
dataset.prepare_data(df)
train_loader = dataset.train_dataloader()
test_loader = dataset.test_dataloader()
The data preparation transforms the pandas dataframe into a dataset compatible with the VCNet. The settings that we provided earlier are used at these stages.
Then, we collect the two datasets (for training and testing) into Lightning loaders. This loaders will be used by the VCNet module.
Training the model
Now that our dataset is prepared, it is time to define and fit the model. Let first define the model … and again, we have to set up a collection of hyperparameters. Note that part of the hyper-parameters of the models are directly the ones of the dataset. The other parameters are related to the architecture of VCNet. We invite you to have a look at the original article of VCNet to get their insights. In short:
lambda_… are loss weights
latent_size, latent_size_share etc. are architecture hyper parameters (layer sizes)
epochs and lr (learning rate) are the classical optimization parameters
hp = {
"dataset": dataset_settings,
"vcnet_params" : {
"lr": 2e-3,
"epochs" : 5,
"lambda_KLD": 0.5,
"lambda_CE": 0.93,
"lambda_BCE": 1,
"latent_size" : 19,
"latent_size_share" : 304,
"mid_reduce_size" : 152
}
}
vcnet = VCNet(hp)
Let us now fit the model that we have defined on the train dataset. For that, we use the Lightning trainer. Then, it can be done in two lines:
That’s it! You trained both the classifier and the counterfactual generator!
Use your model
At this stage, your are supposed to test to accuracy of the model and also the degree of validity of the counterfactual generator …
The code below illustrates how to generate counterfactuals from the testset and test there accuracy and validity. It takes examples from the test_loader, applies the model forward_pred to get the probabilistic prediction, and also generates a counterfactuals with the method counterfactuals.
vcnet.eval()
for data, labels in test_loader:
cl = vcnet.forward_pred(data)
cl = (cl.squeeze()>0.5).float()
cf, clcf = vcnet.counterfactuals(data)
acc = torch.sum( cl == labels)/len(data)
validity = torch.sum( cl != clcf)/len(data)
print(f"Accuracy: {acc}, validity:{validity}")
It works??? fine !! Then, you can now apply the fitted model on new examples! The code below illustrates how to generate counterfactuals in practice. Again, we use the test set, but it can be another dataset (with the same dataset_settings):
for data, labels in test_loader:
cf, clcf = vcnet.counterfactuals(data)
cfdf = dataset.data_unloader(cf,clcf)
print(cfdf)
The main different with the previous example is that we applied the method data_unloader which post-processes the internal counterfactuals of VCNet to make it user-readable and more realistic. This step reverses the preprocessing (scaling and one-hot encoding), and apply the smart-rounding (if activated). Thus, the generated counterfactuals cfdf looks exactly like an example of the original database.