vcnet package
Submodules
vcnet.classifiers module
Classifier Model for VCNet.
- class vcnet.classifiers.Classifier(hp)
Bases:
LightningModule
Simple fully convolutional classifier that can be used
- Args:
hp (Dict): configuration of the classifier (hyperparameters) and the dataset
- configure_optimizers()
Choose what optimizers and learning-rate schedulers to use in your optimization. Normally you’d need one. But in the case of GANs or similar you might have multiple. Optimization with multiple optimizers only works in the manual optimization mode.
- Return:
Any of these 6 options.
Single optimizer.
List or Tuple of optimizers.
Two lists - The first list has multiple optimizers, and the second has multiple LR schedulers (or multiple
lr_scheduler_config
).Dictionary, with an
"optimizer"
key, and (optionally) a"lr_scheduler"
key whose value is a single LR scheduler orlr_scheduler_config
.None - Fit will run without any optimizer.
The
lr_scheduler_config
is a dictionary which contains the scheduler and its associated configuration. The default configuration is shown below.lr_scheduler_config = { # REQUIRED: The scheduler instance "scheduler": lr_scheduler, # The unit of the scheduler's step size, could also be 'step'. # 'epoch' updates the scheduler on epoch end whereas 'step' # updates it after a optimizer update. "interval": "epoch", # How many epochs/steps should pass between calls to # `scheduler.step()`. 1 corresponds to updating the learning # rate after every epoch/step. "frequency": 1, # Metric to monitor for schedulers like `ReduceLROnPlateau` "monitor": "val_loss", # If set to `True`, will enforce that the value specified 'monitor' # is available when the scheduler is updated, thus stopping # training if not found. If set to `False`, it will only produce a warning "strict": True, # If using the `LearningRateMonitor` callback to monitor the # learning rate progress, this keyword can be used to specify # a custom logged name "name": None, }
When there are schedulers in which the
.step()
method is conditioned on a value, such as thetorch.optim.lr_scheduler.ReduceLROnPlateau
scheduler, Lightning requires that thelr_scheduler_config
contains the keyword"monitor"
set to the metric name that the scheduler should be conditioned on.# The ReduceLROnPlateau scheduler requires a monitor def configure_optimizers(self): optimizer = Adam(...) return { "optimizer": optimizer, "lr_scheduler": { "scheduler": ReduceLROnPlateau(optimizer, ...), "monitor": "metric_to_track", "frequency": "indicates how often the metric is updated", # If "monitor" references validation metrics, then "frequency" should be set to a # multiple of "trainer.check_val_every_n_epoch". }, } # In the case of two optimizers, only one using the ReduceLROnPlateau scheduler def configure_optimizers(self): optimizer1 = Adam(...) optimizer2 = SGD(...) scheduler1 = ReduceLROnPlateau(optimizer1, ...) scheduler2 = LambdaLR(optimizer2, ...) return ( { "optimizer": optimizer1, "lr_scheduler": { "scheduler": scheduler1, "monitor": "metric_to_track", }, }, {"optimizer": optimizer2, "lr_scheduler": scheduler2}, )
Metrics can be made available to monitor by simply logging it using
self.log('metric_to_track', metric_val)
in yourLightningModule
.- Note:
Some things to know:
Lightning calls
.backward()
and.step()
automatically in case of automatic optimization.If a learning rate scheduler is specified in
configure_optimizers()
with key"interval"
(default “epoch”) in the scheduler configuration, Lightning will call the scheduler’s.step()
method automatically in case of automatic optimization.If you use 16-bit precision (
precision=16
), Lightning will automatically handle the optimizer.If you use
torch.optim.LBFGS
, Lightning handles the closure function automatically for you.If you use multiple optimizers, you will have to switch to ‘manual optimization’ mode and step them yourself.
If you need to control how often the optimizer steps, override the
optimizer_step()
hook.
- forward(x: tensor)
Same as
torch.nn.Module.forward()
.
- training_step(batch, batch_idx)
Here you compute and return the training loss and some additional metrics for e.g. the progress bar or logger.
- Args:
batch: The output of your data iterable, normally a
DataLoader
. batch_idx: The index of this batch. dataloader_idx: The index of the dataloader that produced this batch.(only if multiple dataloaders used)
- Return:
Tensor
- The loss tensordict
- A dictionary which can include any keys, but must include the key'loss'
in the case of automatic optimization.None
- In automatic optimization, this will skip to the next batch (but is not supported for multi-GPU, TPU, or DeepSpeed). For manual optimization, this has no special meaning, as returning the loss is not required.
In this step you’d normally do the forward pass and calculate the loss for a batch. You can also do fancier things like multiple forward passes or something model specific.
Example:
def training_step(self, batch, batch_idx): x, y, z = batch out = self.encoder(x) loss = self.loss(out, x) return loss
To use multiple optimizers, you can switch to ‘manual optimization’ and control their stepping:
def __init__(self): super().__init__() self.automatic_optimization = False # Multiple optimizers (e.g.: GANs) def training_step(self, batch, batch_idx): opt1, opt2 = self.optimizers() # do training_step with encoder ... opt1.step() # do training_step with decoder ... opt2.step()
- Note:
When
accumulate_grad_batches
> 1, the loss returned here will be automatically normalized byaccumulate_grad_batches
internally.
- class vcnet.classifiers.SKLearnClassifier(hp)
Bases:
object
Wrapper for using a sklearn classifiers in the VCNet pipeline.
Example of minimal configuration:
- “dataset”: {
“target”:”income”,
}, “classifier_params” : {
“skname”: “RandomForestClassifier”, “kwargs”: {
“n_estimators” : 50,
}
}
} classifier = SKLearnClassifier(hp) classifier.fit(dataset.df_train) ```
- Attributes:
hp (Dict): configuration of the classifier (hyperparameters) and the dataset
Remark
This class allows to use an XGBoostClassifier
Remark
The _kwargs_ of the classifier have to be checked from the sklearn API or the (XGBoost API)[https://xgboost.readthedocs.io/en/stable/python/sklearn_estimator.html]
- fit(X: DataFrame)
function to fit the model
- Args:
X (pd.DataFrame): dataset to train the model on
vcnet.data module
VCNet data module. This module provides the classes to manage the data for VCNet
- class vcnet.data.DataCatalog(config: dict)
Bases:
LightningDataModule
Generic framework for datasets, using sklearn processing. This class is implemented by OnlineCatalog and CsvCatalog. OnlineCatalog allows the user to easily load online datasets, while CsvCatalog allows easy use of local datasets.
The preprocessing pipeline is made of the following steps: encoding of categorical attributes, data imputation and scaling. The reverse pipeline can also apply a rounding of numerical attributes.
- Args:
- config: Dict
Configuration dictionary containing the required and optional settings to prepare the dataset for counterfactual generation. The settings are used to setup an internal pipeline (and its reverse pipeline).
- The settings must at least define the following attributes:
- target: str
Name of the target attribute
- continuous: List[str]
List of continuous attributes of the dataset
- categorical: List[str]
List of categorical attributes of the dataset
- immutables: List[str]
List of immutable attributes (among the continuous or categorical attributes)
If the dataset is store in a file, it can be loaded in the pipeline by setting the filename attribute
- The following optional settings define the train/test sets:
- test_size/val_size: float
proportions of the dataset dedicated to test and validation
- stratify: bool
Use a stratification strategy to sample the test/train sets
- The following optional settings define the pre-processing pipeline:
- scaling_method: str, default: MinMax
Type of used sklearn scaler. Can be set with the property setter to any sklearn scaler. Set to “Identity” for no scaling.
- encoding_method: str, default: OneHot_drop_binary
Type of OneHotEncoding {OneHot, OneHot_drop_binary}. Additional drop binary decides if one column is dropped for binary features. Can be set with the property setter to any sklearn encoder. Set to “Identity” for no encoding.
- imputation_method: str, default: Identity
Type of sklearn imputer (“SimpleImputer” or “Identity”) Set to “Identity” for no imputation.
- activate_rounding: bool, default: False
If True, the continuous attributes values of a generated counterfactual will be rounded to be more realistic.
- Finally, some other optional parameters:
- batch_size: int
default value is 64
- Attributes:
- data_name: str
What name the dataset should have.
- df: pd.DataFrame
The complete Dataframe. This is equivalent to the combination of df_train and df_test, although not shuffled.
- df_train: pd.DataFrame
Training portion of the complete Dataframe.
- df_test: pd.DataFrame
Testing portion of the complete Dataframe.
- df_val: pd.DataFrame
Validation portion of the complete Dataframe.
Warning
Imputation works only for continuous variables.
Warning
Rounding is applied to all numerical attributes or none. You can not choose the attribute on which the rounding will be applied. Nonetheless, the rounding setting (number of decimal) is automatically infered from the training data per attribute (two different attributes).
- data_unloader(X, y) DataFrame
recreates a dataframe from the (numpy) arrays of data and labels.
It applies the inverse transformation required internally by VCNet. More especially, it reverses the one hot encoding to recreate readable categorical features for the user.
- Returns:
DataFrame: Dataframe with the same columns as input dataframe
Warning
In case the pre-processing included missing values imputations, this step is not reversed and the output dataset contains the imputed values.
- property df_test: DataFrame
Dataframe containing prepared test data
- property df_train: DataFrame
Dataframe containing prepared train data
- property df_val: DataFrame
Dataframe containing prepared validation data
- property encoder: BaseEstimator
Contains a fitted sklearn encoder:
Returns
sklearn.preprocessing.BaseEstimator
- get_pipeline_element(key: str) Callable
Returns a specific element of the transformation pipeline.
Parameters
- keystr
Element of the pipeline we want to return
Returns
Pipeline element
- property imputer: BaseEstimator
Contains a fitted sklearn imputer:
Returns
sklearn.preprocessing.BaseEstimator
- inverse_transform(df: DataFrame) DataFrame
Transforms output after prediction back into original form. Only possible for DataFrames with preprocessing steps.
Parameters
- dfpd.DataFrame
Contains normalized and encoded data.
Returns
- outputpd.DataFrame
Prediction output denormalized and decoded
- prepare_data(raw_pd: DataFrame = None)
Data preparation
- Args:
- raw_pd (pd.DataFrame, optional): A pandas dataframe containing data to prepare.
Defaults to None.
- Returns: Dict or None
Updated settings ready for use in a VCNet model. If None, this means the data preparation failed.
- property raw_df_test: DataFrame
Dataframe containing raw test data
- property raw_df_train: DataFrame
Dataframe containing raw train data
- property raw_df_val: DataFrame
Dataframe containing raw validation data
- property scaler: BaseEstimator
Contains a fitted sklearn scaler.
Returns
sklearn.preprocessing.BaseEstimator
- property settings
Settings of the dataset
- test_dataloader() DataLoader
An iterable or collection of iterables specifying test samples.
For more information about multiple dataloaders, see this section.
For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()
setup()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Note:
If you don’t need a test dataset and a
test_step()
, you don’t need to implement this method.
- train_dataloader() DataLoader
An iterable or collection of iterables specifying training samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
setup()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- transform(df: DataFrame) DataFrame
Transforms input for prediction into correct form. Only possible for DataFrames without preprocessing steps.
Recommended to keep correct encodings and normalization
Parameters
- dfpd.DataFrame
Contains raw (not normalized and not encoded) data.
Returns
- outputpd.DataFrame
Prediction input normalized and encoded
- val_dataloader() DataLoader
An iterable or collection of iterables specifying validation samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data()
.fit()
validate()
setup()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Note:
If you don’t need a validation dataset and a
validation_step()
, you don’t need to implement this method.
- class vcnet.data.NumpyDataset(*arrs)
Bases:
TensorDataset
Dataset with only numerical attributes. A numpy dataset is represented by a tensor with attributes in columns. When it is a training dataset, the last column is the numerical target feature.
- data_loader(batch_size=128, shuffle=True, num_workers=4)
Builder of a torch data loader to be use for training VCNet
- Args:
batch_size (int, optional): Size of the Batch. Defaults to 128. shuffle (bool, optional): Shuffle or not the examples before creating batches.
Defaults to True.
num_workers (int, optional): Number of threads. Defaults to 4.
- Returns:
torch.DataLoader: representation of a dataset for mini-batch optimization
- features(test=False)
Returns the feature part of the tensor
- Args:
- test (bool, optional): indicates whether the dataset contains labels (False)
or not (True). Defaults to False.
- target(test=False)
Returns the labels of the dataset (if exists, otherwise it returns None)
- Args:
test (bool, optional): indicates whether the dataset contains labels (False) or not (True). Defaults to False.
- class vcnet.data.PostHocRounder(precisions)
Bases:
object
A class dedicated to rounding values at a given decimal. This is a post-hoc rounder as it applied a rounding on the numerical values generated by VCNet to provide more realistic values.
The class implement an inverse_transform only, as it apply the transformation on generate counterfactuals.
- Args:
precisions (Dict(str,int)): map that gives the precision to apply to an attribute name
- inverse_transform(df)
Apply the inverse transformation on the dataframe df.
- vcnet.data.attribute_round(fitted_rounder: PostHocRounder, features: List[str], df: DataFrame) DataFrame
Pipeline function to round data the numeraical attributes.
Parameters
- fitted_rounderRounder
Round the attribute of the data at a fitter level of precision
- featureslist
List of continuous feature
- dfpd.DataFrame
Data we want to round
Returns
- outputpd.DataFrame
Whole DataFrame with rounded values
- vcnet.data.decode(fitted_encoder: BaseEstimator, features: List[str], df: DataFrame) DataFrame
Pipeline function to decode data with fitted sklearn OneHotEncoder.
Parameters
- fitted_encodersklearn OneHotEncoder
Encodes input data.
- featureslist
List of categorical feature.
- dfpd.DataFrame
Data we want to normalize
Returns
- outputpd.DataFrame
Whole DataFrame with encoded values
- vcnet.data.descale(fitted_scaler: BaseEstimator, features: List[str], df: DataFrame) DataFrame
Pipeline function to de-normalize data with fitted sklearn scaler.
Parameters
- fitted_scalersklearn Scaler
Normalizes input data
- featureslist
List of continuous feature
- dfpd.DataFrame
Data we want to de-normalize
Returns
- outputpd.DataFrame
Whole DataFrame with de-normalized values
- vcnet.data.encode(fitted_encoder: BaseEstimator, features: List[str], df: DataFrame) DataFrame
Pipeline function to encode data with fitted sklearn OneHotEncoder.
Parameters
- fitted_encodersklearn OneHotEncoder
Encodes input data.
- featureslist
List of categorical feature.
- dfpd.DataFrame
Data we want to normalize
Returns
- outputpd.DataFrame
Whole DataFrame with encoded values
- vcnet.data.fit_encoder(encoding_method, df)
Parameters
- encoding_method: {“OneHot”, “OneHot_drop_binary”, “Identity”}
String indicating what encoding method to use or sklearn.preprocessing function.
- df: pd.DataFrame
DataFrame containing only categorical data.
Returns
sklearn.base.BaseEstimator
- vcnet.data.fit_imputer(imputation_method, df)
Parameters
- imputation_method: {“SimpleImputer”,”Identity”}
String indicating what scaling method to use or sklearn.impute function.
- df: pd.DataFrame
DataFrame only containing continuous data.
Returns
sklearn.base.BaseEstimator
- vcnet.data.fit_rounder(df)
Function that build a rounder from a dataframe.
Parameters
- df: pd.DataFrame
DataFrame only containing continuous data.
Returns
Rounder
- vcnet.data.fit_scaler(scaling_method, df)
Parameters
- scaling_method: {“MinMax”, “Standard”, “Identity”}
String indicating what scaling method to use or sklearn.preprocessing function.
- df: pd.DataFrame
DataFrame only containing continuous data.
Returns
sklearn.base.BaseEstimator
- vcnet.data.impute(fitted_imputer: BaseEstimator, features: List[str], df: DataFrame) DataFrame
Pipeline function to impute missing values in the dataset with a fitted sklearn Imputer. This function has to be applied once the
Parameters
- fitted_imputersklearn Imputer
Imputes missing values.
- featureslist
List of numerical feature.
- dfpd.DataFrame
Data we want to modify
Returns
- outputpd.DataFrame
Whole DataFrame without missing values (in the selected features)
- vcnet.data.order_data(feature_order: List[str], df: DataFrame) DataFrame
Restores the correct input feature order for the ML model
Only works for encoded data
Parameters
- feature_orderlist
List of input feature in correct order
- dfpd.DataFrame
Data we want to order
Returns
- outputpd.DataFrame
Whole DataFrame with ordered feature
- vcnet.data.scale(fitted_scaler: BaseEstimator, features: List[str], df: DataFrame) DataFrame
Pipeline function to normalize data with fitted sklearn scaler.
Parameters
- fitted_scalersklearn Scaler
Normalizes input data
- featureslist
List of continuous feature
- dfpd.DataFrame
Data we want to normalize
Returns
- outputpd.DataFrame
Whole DataFrame with normalized values
vcnet.models module
Module for the VCNet models
- class vcnet.models.PHVCNet(model_config: Dict, classifier: Module)
Bases:
VCNetBase
Class for Post-hoc VCNet immutable version model architecture. Post-hoc VCNet uses a torch classifier trained on a classification task and trains the counterfactual generators.
A classifier provided to this class is assumed to take examples to be classified.
- classif(z: tensor, x: tensor, x_mut: tensor, x_immut: tensor) tensor
Forward function of the classification layers. It predicts the class of an example z prepared by the encode_classif function.
- Args:
z (torch.tensor): examples represented in their latent space for classification.
- Returns:
torch.tensor: example classification. Dimension self.class_size.
- training_step(batch, batch_idx)
Training step for lightning
- Args:
batch (torch.tensor): batch batch_idx (torch.tensor): list of example indices
- Returns:
float: loss measure for the batch
- class vcnet.models.VCNet(model_config: Dict)
Bases:
VCNetBase
Class for VCNet immutable version model architecture. VCNet is a joint learning architecture: during the training phase, both the classifier and the counterfactual generators are fitted.
- classif(z: tensor, x: tensor, x_mut: tensor, x_immut: tensor) tensor
Forward function of the classification layers. It predicts the class of an example z prepared by the encode_classif function.
- Args:
z (torch.tensor): examples represented in their latent space for classification.
- Returns:
torch.tensor: example classification. Dimension self.class_size.
- loss_functions(recon_x, x, mu, sigma, output_class=None, y_true=None)
Evaluation of the VCNet losses
- pre_encode(x_mut: tensor, x_immut: tensor) tensor
Function that prepares examples (x) with a shared pre-coding layers.
The default behavior is to transmit x as it.
- training_step(batch, batch_idx)
Training step for lightning
- Args:
batch (torch.tensor): batch batch_idx (torch.tensor): list of example indices
- Returns:
float: loss measure for the batch
- class vcnet.models.VCNetBase(model_config: Dict)
Bases:
LightningModule
,ABC
Class for the general VCNet architecture with handling immutable features. This class is abstract. It specifies a VCNet model with a classifier and a conditional variational auto-encoder (cVAE) and a training procedure. The training procedure of a VCNet architecture consists in training the cVAE in a classical way. The VCNet trick lies in generating counterfactuals by switching the predicted class of an example to generate a modified example using the cVAE.
The VCNet architecture handles natively the immutable features.
Note that this VCNet architecture handles only numerical features. The user of this class has to manage the encoding of categorical features out of this class.
The current class implements the cVAE and abstract function.
- abstractmethod classif(z: tensor, x: tensor, x_mut: tensor, x_immut: tensor) tensor
Forward function of the classification layers. It predicts the class of an example z prepared by the encode_classif function.
- Args:
z (torch.tensor): examples represented in their latent space for classification.
- Returns:
torch.tensor: example classification. Dimension self.class_size.
- configure_optimizers()
Setup of the optimizer
- counterfactuals(x: tensor) tensor
Generation of counterfactuals for the example x.
Warning
This function has been tested only for binary classification. The use with a multiclass problem is still to evaluate.
- decode(z_prime: tensor, c: tensor) tensor
C-VAE decoding, computes P(x|z, c)
- Args:
z_prime (torch.tensor): _description_ c (torch.tensor): conditioning of the VAE. For VCNet, the decoding is conditioned
by the class and the immutable features [class, x_immutable]. Then, its dimension is class_size-1 + len(x_immutable)
- Returns:
torch.tensor: _description_
- encode(z: tensor, x_mut: tensor, x_immut: tensor) tensor
C-VAE encoding
- Args:
- z (torch.tensor): pre-encoded input representation. None or tensor of size
defined by latent_size_share
x_mut (torch.tensor): mutable part of the input tensor x_immut (torch.tensor): mutable part of the input tensor
- Returns:
- tuple(torch.tensor, torch.tensor): representation of the gaussian distribution
in the latent space (mu, sigma). Tensors of dimension latent_size.
- forward(x: tensor)
Forward function used during the training phase of a VCNet model. It mainly goes through the three parts of the models: the pre-coding, the C-VAE and the classification. Finally, it returns the reconstructed example, the output class and VAE distribution parameters.
- Args:
x (torch.tensor): input examples
- forward_pred(x: tensor) tensor
Forward function for prediction in test phase (prediction task). It prepares the examples and then classify
- Args:
x (torch.tensor): _description_
- loss_functions(recon_x, x, mu, sigma, output_class=None, y_true=None)
Evaluate the loss of the reconstruction
- pre_encode(x_mut: tensor, x_immut: tensor) tensor
Function that prepares examples (x) with a shared pre-coding layers.
The default behavior is to transmit x as it.
- reparameterize(mu: tensor, sigma: tensor) tensor
C-VAE Reparametrization trick
- Args:
mu (torch.tensor): size latent_size sigma (torch.tensor): size latent_size
- Returns:
torch.tensor: size latent_size
Module contents
VCNet Package
- class vcnet.DataCatalog(config: dict)
Bases:
LightningDataModule
Generic framework for datasets, using sklearn processing. This class is implemented by OnlineCatalog and CsvCatalog. OnlineCatalog allows the user to easily load online datasets, while CsvCatalog allows easy use of local datasets.
The preprocessing pipeline is made of the following steps: encoding of categorical attributes, data imputation and scaling. The reverse pipeline can also apply a rounding of numerical attributes.
- Args:
- config: Dict
Configuration dictionary containing the required and optional settings to prepare the dataset for counterfactual generation. The settings are used to setup an internal pipeline (and its reverse pipeline).
- The settings must at least define the following attributes:
- target: str
Name of the target attribute
- continuous: List[str]
List of continuous attributes of the dataset
- categorical: List[str]
List of categorical attributes of the dataset
- immutables: List[str]
List of immutable attributes (among the continuous or categorical attributes)
If the dataset is store in a file, it can be loaded in the pipeline by setting the filename attribute
- The following optional settings define the train/test sets:
- test_size/val_size: float
proportions of the dataset dedicated to test and validation
- stratify: bool
Use a stratification strategy to sample the test/train sets
- The following optional settings define the pre-processing pipeline:
- scaling_method: str, default: MinMax
Type of used sklearn scaler. Can be set with the property setter to any sklearn scaler. Set to “Identity” for no scaling.
- encoding_method: str, default: OneHot_drop_binary
Type of OneHotEncoding {OneHot, OneHot_drop_binary}. Additional drop binary decides if one column is dropped for binary features. Can be set with the property setter to any sklearn encoder. Set to “Identity” for no encoding.
- imputation_method: str, default: Identity
Type of sklearn imputer (“SimpleImputer” or “Identity”) Set to “Identity” for no imputation.
- activate_rounding: bool, default: False
If True, the continuous attributes values of a generated counterfactual will be rounded to be more realistic.
- Finally, some other optional parameters:
- batch_size: int
default value is 64
- Attributes:
- data_name: str
What name the dataset should have.
- df: pd.DataFrame
The complete Dataframe. This is equivalent to the combination of df_train and df_test, although not shuffled.
- df_train: pd.DataFrame
Training portion of the complete Dataframe.
- df_test: pd.DataFrame
Testing portion of the complete Dataframe.
- df_val: pd.DataFrame
Validation portion of the complete Dataframe.
Warning
Imputation works only for continuous variables.
Warning
Rounding is applied to all numerical attributes or none. You can not choose the attribute on which the rounding will be applied. Nonetheless, the rounding setting (number of decimal) is automatically infered from the training data per attribute (two different attributes).
- data_unloader(X, y) DataFrame
recreates a dataframe from the (numpy) arrays of data and labels.
It applies the inverse transformation required internally by VCNet. More especially, it reverses the one hot encoding to recreate readable categorical features for the user.
- Returns:
DataFrame: Dataframe with the same columns as input dataframe
Warning
In case the pre-processing included missing values imputations, this step is not reversed and the output dataset contains the imputed values.
- property df_test: DataFrame
Dataframe containing prepared test data
- property df_train: DataFrame
Dataframe containing prepared train data
- property df_val: DataFrame
Dataframe containing prepared validation data
- property encoder: BaseEstimator
Contains a fitted sklearn encoder:
Returns
sklearn.preprocessing.BaseEstimator
- get_pipeline_element(key: str) Callable
Returns a specific element of the transformation pipeline.
Parameters
- keystr
Element of the pipeline we want to return
Returns
Pipeline element
- property imputer: BaseEstimator
Contains a fitted sklearn imputer:
Returns
sklearn.preprocessing.BaseEstimator
- inverse_transform(df: DataFrame) DataFrame
Transforms output after prediction back into original form. Only possible for DataFrames with preprocessing steps.
Parameters
- dfpd.DataFrame
Contains normalized and encoded data.
Returns
- outputpd.DataFrame
Prediction output denormalized and decoded
- prepare_data(raw_pd: DataFrame = None)
Data preparation
- Args:
- raw_pd (pd.DataFrame, optional): A pandas dataframe containing data to prepare.
Defaults to None.
- Returns: Dict or None
Updated settings ready for use in a VCNet model. If None, this means the data preparation failed.
- property raw_df_test: DataFrame
Dataframe containing raw test data
- property raw_df_train: DataFrame
Dataframe containing raw train data
- property raw_df_val: DataFrame
Dataframe containing raw validation data
- property scaler: BaseEstimator
Contains a fitted sklearn scaler.
Returns
sklearn.preprocessing.BaseEstimator
- property settings
Settings of the dataset
- test_dataloader() DataLoader
An iterable or collection of iterables specifying test samples.
For more information about multiple dataloaders, see this section.
For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()
setup()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Note:
If you don’t need a test dataset and a
test_step()
, you don’t need to implement this method.
- train_dataloader() DataLoader
An iterable or collection of iterables specifying training samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
setup()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- transform(df: DataFrame) DataFrame
Transforms input for prediction into correct form. Only possible for DataFrames without preprocessing steps.
Recommended to keep correct encodings and normalization
Parameters
- dfpd.DataFrame
Contains raw (not normalized and not encoded) data.
Returns
- outputpd.DataFrame
Prediction input normalized and encoded
- val_dataloader() DataLoader
An iterable or collection of iterables specifying validation samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data()
.fit()
validate()
setup()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Note:
If you don’t need a validation dataset and a
validation_step()
, you don’t need to implement this method.
- class vcnet.PHVCNet(model_config: Dict, classifier: Module)
Bases:
VCNetBase
Class for Post-hoc VCNet immutable version model architecture. Post-hoc VCNet uses a torch classifier trained on a classification task and trains the counterfactual generators.
A classifier provided to this class is assumed to take examples to be classified.
- classif(z: tensor, x: tensor, x_mut: tensor, x_immut: tensor) tensor
Forward function of the classification layers. It predicts the class of an example z prepared by the encode_classif function.
- Args:
z (torch.tensor): examples represented in their latent space for classification.
- Returns:
torch.tensor: example classification. Dimension self.class_size.
- training_step(batch, batch_idx)
Training step for lightning
- Args:
batch (torch.tensor): batch batch_idx (torch.tensor): list of example indices
- Returns:
float: loss measure for the batch
- class vcnet.SKLearnClassifier(hp)
Bases:
object
Wrapper for using a sklearn classifiers in the VCNet pipeline.
Example of minimal configuration:
- “dataset”: {
“target”:”income”,
}, “classifier_params” : {
“skname”: “RandomForestClassifier”, “kwargs”: {
“n_estimators” : 50,
}
}
} classifier = SKLearnClassifier(hp) classifier.fit(dataset.df_train) ```
- Attributes:
hp (Dict): configuration of the classifier (hyperparameters) and the dataset
Remark
This class allows to use an XGBoostClassifier
Remark
The _kwargs_ of the classifier have to be checked from the sklearn API or the (XGBoost API)[https://xgboost.readthedocs.io/en/stable/python/sklearn_estimator.html]
- fit(X: DataFrame)
function to fit the model
- Args:
X (pd.DataFrame): dataset to train the model on
- class vcnet.VCNet(model_config: Dict)
Bases:
VCNetBase
Class for VCNet immutable version model architecture. VCNet is a joint learning architecture: during the training phase, both the classifier and the counterfactual generators are fitted.
- classif(z: tensor, x: tensor, x_mut: tensor, x_immut: tensor) tensor
Forward function of the classification layers. It predicts the class of an example z prepared by the encode_classif function.
- Args:
z (torch.tensor): examples represented in their latent space for classification.
- Returns:
torch.tensor: example classification. Dimension self.class_size.
- loss_functions(recon_x, x, mu, sigma, output_class=None, y_true=None)
Evaluation of the VCNet losses
- pre_encode(x_mut: tensor, x_immut: tensor) tensor
Function that prepares examples (x) with a shared pre-coding layers.
The default behavior is to transmit x as it.
- training_step(batch, batch_idx)
Training step for lightning
- Args:
batch (torch.tensor): batch batch_idx (torch.tensor): list of example indices
- Returns:
float: loss measure for the batch