rfcfit

Syntax

parameters = rfcfit(X,y)

parameters = rfcfit(X,y,options)

Inputs

X

Training data.

Type: double

Dimension: vector | matrix

y

Target values.

Type: double

Dimension: vector | matrix

options

Type: struct

n_estimators: The number of trees in the forest (default: 100).; Type: integer; Dimension: scalar
criterion: Function to measure quality of a split. 'gini' for Gini Impurity (default) and 'entropy' for Information Gain.; Type: char; Dimension: string
max_depth: The maximum depth of the tree. If not assigned, the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split.; Type: integer; Dimension: scalar
min_samples_split: The minimum number of samples required to split an internal node (default: 2). If integer, consider it as the minimum number; if float, (min_samples_split * number of samples) is taken as the minimum number of samples for each split.; Type: double | integer; Dimension: scalar
min_samples_leaf: The minimum number of samples required to be at a leaf node (default: 1). If number of samples are less than min_samples_leaf at any node, tree is not built further under that node. If integer, consider it as the minimum number; if float, (min_samples_leaf * number of samples) is taken as the minimum number of samples for each node.; Type: double | integer; Dimension: scalar
min_weight_fraction_leaf: The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node (default: 0).; Type: double; Dimension: scalar
max_features: The number of features to consider when looking for the best split (default: number of features in training data). If integer: at each split, consider max_features; if float: At each split, consider floor(max_features * n_features).; Type: double | integer; Dimension: scalar
max_leaf_nodes: Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined by its reduction in impurity. If not assigned, then trees have possible number of leaf nodes.; Type: integer; Dimension: scalar
min_impurity_decrease: A node will be split if this split reduces the impurity >= this value (default: 0).; Type: double; Dimension: scalar
bootstrap: Whether bootstrap samples are used when building trees. If false, the whole dataset is used to build each tree (default: true).; Type: Boolean; Dimension: logical
oob_score: Whether to use out-of-bag samples to estimate the generalization accuracy (default: false).; Type: Boolean; Dimension: logical
random_state: Controls the randomness of the model. random_state is the seed used by the random number generator.; Type: integer; Dimension: scalar

Outputs

parameters

Contains all the values passed to rfcfit method as options. Additionally it has below key-value pairs.

Type: struct

scorer: Function handle pointing to 'accuracy' function.; Type: function handle
oob_score: Score of the training dataset obtained using an out-of-bag estimate. It is set only when oob_score = true in options.; Type: double; Dimension: scalar
classes: The class labels (single output problem), or a matrix of class labels (multi-output problem).; Type: double; Dimension: vector | matrix
n_samples: Number of rows in the training data.; Type: integer; Dimension: scalar
n_features: Number of columns in the training data.; Type: integer; Dimension: scalar

Example

Usage of rfcfit

data = dlmread(‘iris.csv', ',', 1);
X = data(:,1:end-1);
y = data(:,end);

parameters = rfcfit(X, y, options);

> parameters
parameters = struct [
  bootstrap: 1
  classes: [Matrix] 1 x 3
  0  1  2
  criterion: gini
  min_impurity_decrease: 0
  min_samples_leaf: 1
  min_samples_split: 2
  min_weight_fraction_leaf: 0
  n_estimators: 100
  n_features: 4
  n_samples: 150
  oob_score: oob_score not set to true while training
]

Comments

The sub-sample size is always the same as original input size but samples are drawn with replacement if bootstrap is set to true (default). If parameters like max_depth, min_samples_leaf are unassigned (default values are chosen), it leads to fully grown, unpruned trees which can be very large on some datasets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values. The features are always randomly permuted at each split. Even when max_features = number of features in dataset and bootstrap = false, the best found split may vary. random_state has to be fixed to obtain a deterministic behaviour. Output 'parameters' should be passed as input to rfcpredict function.