CSDT

Introduction

The Custom Split Decision Tree (CSDT) is a flexible and customizable decision tree algorithm that allows users to define their own criteria for data splitting and prediction. Unlike traditional decision trees, CSDT gives users the ability to tailor these critical functionalities, enabling optimized solutions for a variety of problem types.

Advantages of CSDT

Flexibility: Enables users to customize splitting and prediction functions.
Customization: Adaptable to specific domain requirements.
Performance Optimization: Achieves better splits and predictions tailored to the dataset.
Visualization: Provides an intuitive way to visualize the decision tree, enhancing interpretability.
Support for Diverse Data: Can handle multi-target problems efficiently.

Components of CSDT

1. Node Class

A node is the fundamental unit of a decision tree. The tree structure is built upon these nodes.

Node Attributes:

right and left: Right and left child nodes.
column and column_name: The index and name of the feature used for splitting.
threshold: The threshold value for splitting.
id: A unique identifier for the node.
depth: The depth of the node in the tree.
is_terminal: Indicates whether the node is a terminal (leaf) node.
prediction: Stores the prediction value if the node is terminal.
count: The number of data points at the node.
split_details: Additional details about the split.
class_counts: Class counts at this node (used for classification tasks).
error: The error value for the node.
best_score: The best score achieved during the splitting process.

2. CSDT Class

The CSDT class manages the entire decision tree. It handles training, prediction, and visualization of the tree.

Attributes:

max_depth: Maximum depth of the tree.
min_samples_leaf: Minimum number of samples required in a leaf node.
min_samples_split: Minimum number of samples required to split a node.
split_criteria: User-defined function for evaluating split quality.
max_features: Maximum number of features considered for splitting.
random_state: Seed value for reproducibility.
Tree: Stores the root node of the tree.
use_hashmaps: Whether to use hashmaps for faster computations and lookups.
use_multithreading: Enables parallel computation of feature splits to speed up tree construction.

3. User-Defined Split and Prediction Functions

The standout feature of CSDT is the ability for users to define custom split and prediction functions.

    def calculate_mse(y, predictions,initial_solution):
        return mean_squared_error(y, predictions)
    
    def return_mean(y, x):
        return y.mean(axis=0)

CSDT in Practice

    tree = CSDT(
        max_depth=10,
        min_samples_leaf=5,
        min_samples_split=10,
        split_criteria=lambda y, x,initial_solutions: split_criteria_with_methods(y, x, pred=return_mean, split_criteria=calculate_mse,initial_solutions = initial_solutions),
        use_hashmaps = True,
        use_initial_solution = False
    )
    tree.fit(features_df, labels_df)

Conclusion

CSDT stands apart from traditional decision trees by offering users full control over data splitting and prediction processes. This flexibility makes it a powerful tool for specialized tasks, such as multi-target regression, custom error metrics, and domain-specific applications. Beyond being a machine learning model, CSDT serves as a platform for developing custom solutions tailored to unique datasets and problem types.