# Description
Build random forests for large data sets using CUDA.
This is the GPU-enabled version of [brif](URL 'https://pypi.org/project/brif/').
The same program is available on [CRAN](URL 'https://cran.r-project.org/web/packages/brif/index.html') for R users.
# Build from source
## Prerequisites
An Nvidia graphics / compute card must be present and the [CUDA Toolkit](URL 'https://developer.nvidia.com/cuda-toolkit') must be installed.
For Windows, Microsoft Visual Studio [Build Tools for C++](URL 'https://learn.microsoft.com/en-us/visualstudio/msbuild/msbuild?view=vs-2022') must be installed. For Linux and MacOS, some C++ build tool chain (e.g., gcc) is required.
Python [build](URL 'https://pypa-build.readthedocs.io/en/stable/') is required, can be installed via
```bash
pip install build
```
The pandas and numpy packages are required, can be installed via
```bash
pip install pandas numpy
```
## Build and install on Windows
Clone (or download as zip and extract) this project to a local directory.
Search in the Windows search bar and run as administrator the "x64 Native Tools Command Prompt for VS 2022".
In the command window thus opened, cd into the project root directory, and run
```bash
mkdir build
cd build
cmake ../
```
If successful, the file cubrif.sln (among other files) will be generated, then run
```bash
MSBuild.exe cubrif.sln /p:Configuration=Release
```
If successful, several files will be created in the Release subfolder. Important ones include cubrif.lib, cubrif.dll and cubrif_main.exe. cubrif.lib will be used in building python package, cubrif.dll will be used in runtime, and cubrif_main.exe is a standalone executable.
Copy cubrif.lib to the project root directory:
```bash
copy Release\cubrif.lib ..\
```
Now go back to the project root and build the Python package, as follows
```bash
cd ..
python -m build
```
If successful, the package, e.g., cubrif-1.4.0.tar.gz, will be create in the dist subfolder.
Install the package by
```bash
pip install dist/cubrif-1.4.0.tar.gz
```
To use the package, the cubrif.dll must be visible to python, for example:
```python
import os
os.add_dll_directory("C:/path/to/project/build/Release")
from cubrif import cubrif
```
## Build and install on Ubuntu
The build process is similar, but use 'make' instead of MSBuild.exe, and the dynamic library file generated will be libcubrif.so instead of cubrif.dll.
```bash
mkdir build
cd build
cmake ../
make
cp libcubrif.so ../
cd ..
python3 -m build
pip install dist/cubrif-1.4.0.tar.gz
```
In the above step, if "python3 -m build" does not work, use the equivalent command
```bash
python3 setup.py sdist bdist_wheel
```
To use the package, the *libcubrif.so* must be visible to python. Either copy libcubrif.so to usr/lib or use os.add_dll_directory() as described above. For example,
```bash
sudo cp libcubrif.so /usr/lib
```
or in python,
```python
import os
os.add_dll_directory("C:/path/to/project/build/Release")
```
# Usage Examples
```python
from cubrif import cubrif
import pandas as pd
# Create a brif object with default parameters.
bf = cubrif.cubrif()
# Display the current parameter values.
bf.get_param()
# To change certain parameter values, e.g.:
bf.set_param({'ntrees':10, 'nthreads':2, 'GPU':1})
# Or simply:
bf.ntrees = 50
# Load input data frame. Data must be a pandas data frame with appropriate headers.
df = pd.read_csv("auto.csv")
# Train the model
bf.fit(df, 'origin') # specify the target column name
# Or equivalently
bf.fit(df, 7) # specify the target column index
# Make predictions
# The target variable column must be excluded, and all other columns should appear in the same order as in training
# Here, predict the first 10 rows of df
pred_labels = bf.predict(df.iloc[0:10, 0:7], type='class') # return a list containing the predicted class labels
pred_scores = bf.predict(df.iloc[0:10, 0:7], type='score') # return a data frame containing predicted probabilities by class
# Note: for a regression problem (i.e., when the response variable is numeric type), the predict function will always return a list containing the predicted values
```
# Parameters
**tmp_preddata**
a character string specifying a filename to save the temporary scoring data. Default is "tmp_brif_preddata.txt".
**n_numeric_cuts**
an integer value indicating the maximum number of split points to generate for each numeric variable.
**n_integer_cuts**
an integer value indicating the maximum number of split points to generate for each integer variable.
**max_integer_classes**
an integer value. If the target variable is integer and has more than max_integer_classes unique values in the training data, then the target variable will be grouped into max_integer_classes bins. If the target variable is numeric, then the smaller of max_integer_classes and the number of unique values number of bins will be created on the target variables and the regression problem will be solved as a classification problem.
**max_depth**
an integer specifying the maximum depth of each tree. Maximum is 40.
**min_node_size**
an integer specifying the minimum number of training cases a leaf node must contain.
**ntrees**
an integer specifying the number of trees in the forest.
**ps**
an integer indicating the number of predictors to sample at each node split. Default is 0, meaning to use sqrt(p), where p is the number of predictors in the input.
**max_factor_levels**
an integer. If any factor variables has more than max_factor_levels, the program stops and prompts the user to increase the value of this parameter if the too-many-level factor is indeed intended.
**seed**
a positive integer, random number generator seed.
**nthreads**
an integer specifying the number of threads used by the program. This parameter takes effect only on systems supporting OpenMP.
**blocksize**
an integer specifying the CUDA thread block size. Must be a multiple of 64, and no more than 1024.
**GPU**
an integer (0, 1 or 2). 0: Do not use the GPU (for small datasets, e.g., less than 100,000 rows, using GPU is slower). 1: Force use the GPU. 2: Use GPU to evaluate splits only when the node size is greater than or equal to n_lb_GPU.
**n_lb_GPU**
an integer specifying the threshold number of rows in the training data to use GPU for training. This parameter takes effect only when GPU = 2.
**vote_method**
an integer (0 or 1) specifying the voting method in prediction. 0: each leaf contributes the raw count and an average is taken on the sum over all leaves; 1: each leaf contributes an intra-node fraction which is then averaged over all leaves with equal weight.
**na_numeric**
a numeric value, substitute for 'nan' in numeric variables.
**na_integer**
an integer value, substitute for 'nan' in integer variables.
**na_factor**
a character string, substitute for missing values in factor variables.
**type**
a character string indicating the return content of the predict function. For a classification problem, "score" means the by-class probabilities and "class" means the class labels (i.e., the target variable levels). For regression, the predicted values are returned. This is a parameter for the predict function, not an attribute of the brif object.
Raw data
{
"_id": null,
"home_page": "https://pypi.org/project/cubrif/",
"name": "cubrif",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.5",
"maintainer_email": "",
"keywords": "random forest,classification,regression,prediction",
"author": "Yanchao Liu",
"author_email": "yanchaoliu@wayne.edu",
"download_url": "https://files.pythonhosted.org/packages/d7/bc/3316619fb2e4983e3519a0cd634ce10adf1e573491b4476c65d2d07ecebf/cubrif-1.4.3.tar.gz",
"platform": null,
"description": "# Description\r\n\r\nBuild random forests for large data sets using CUDA. \r\nThis is the GPU-enabled version of [brif](URL 'https://pypi.org/project/brif/'). \r\nThe same program is available on [CRAN](URL 'https://cran.r-project.org/web/packages/brif/index.html') for R users. \r\n\r\n# Build from source\r\n\r\n## Prerequisites\r\n\r\nAn Nvidia graphics / compute card must be present and the [CUDA Toolkit](URL 'https://developer.nvidia.com/cuda-toolkit') must be installed. \r\n\r\nFor Windows, Microsoft Visual Studio [Build Tools for C++](URL 'https://learn.microsoft.com/en-us/visualstudio/msbuild/msbuild?view=vs-2022') must be installed. For Linux and MacOS, some C++ build tool chain (e.g., gcc) is required. \r\n\r\nPython [build](URL 'https://pypa-build.readthedocs.io/en/stable/') is required, can be installed via\r\n\r\n```bash\r\npip install build\r\n```\r\n\r\nThe pandas and numpy packages are required, can be installed via\r\n\r\n```bash\r\npip install pandas numpy\r\n```\r\n\r\n## Build and install on Windows\r\n\r\nClone (or download as zip and extract) this project to a local directory. \r\n\r\nSearch in the Windows search bar and run as administrator the \"x64 Native Tools Command Prompt for VS 2022\". \r\nIn the command window thus opened, cd into the project root directory, and run\r\n\r\n```bash\r\nmkdir build\r\ncd build\r\ncmake ../\r\n```\r\n\r\nIf successful, the file cubrif.sln (among other files) will be generated, then run\r\n\r\n```bash\r\nMSBuild.exe cubrif.sln /p:Configuration=Release\r\n```\r\n\r\nIf successful, several files will be created in the Release subfolder. Important ones include cubrif.lib, cubrif.dll and cubrif_main.exe. cubrif.lib will be used in building python package, cubrif.dll will be used in runtime, and cubrif_main.exe is a standalone executable. \r\n\r\nCopy cubrif.lib to the project root directory:\r\n\r\n```bash\r\ncopy Release\\cubrif.lib ..\\\r\n```\r\n\r\nNow go back to the project root and build the Python package, as follows\r\n\r\n```bash\r\ncd ..\r\npython -m build\r\n```\r\n\r\nIf successful, the package, e.g., cubrif-1.4.0.tar.gz, will be create in the dist subfolder. \r\n\r\nInstall the package by\r\n\r\n```bash\r\npip install dist/cubrif-1.4.0.tar.gz\r\n```\r\n\r\nTo use the package, the cubrif.dll must be visible to python, for example:\r\n\r\n```python\r\nimport os\r\nos.add_dll_directory(\"C:/path/to/project/build/Release\")\r\nfrom cubrif import cubrif\r\n```\r\n\r\n## Build and install on Ubuntu\r\n\r\nThe build process is similar, but use 'make' instead of MSBuild.exe, and the dynamic library file generated will be libcubrif.so instead of cubrif.dll.\r\n\r\n```bash\r\nmkdir build\r\ncd build\r\ncmake ../\r\nmake\r\ncp libcubrif.so ../\r\ncd ..\r\npython3 -m build\r\npip install dist/cubrif-1.4.0.tar.gz\r\n```\r\n\r\nIn the above step, if \"python3 -m build\" does not work, use the equivalent command \r\n```bash\r\npython3 setup.py sdist bdist_wheel\r\n```\r\n\r\nTo use the package, the *libcubrif.so* must be visible to python. Either copy libcubrif.so to usr/lib or use os.add_dll_directory() as described above. For example,\r\n\r\n```bash\r\nsudo cp libcubrif.so /usr/lib\r\n```\r\n\r\nor in python,\r\n\r\n```python\r\nimport os\r\nos.add_dll_directory(\"C:/path/to/project/build/Release\")\r\n```\r\n\r\n\r\n# Usage Examples\r\n\r\n```python\r\nfrom cubrif import cubrif\r\nimport pandas as pd\r\n\r\n# Create a brif object with default parameters.\r\nbf = cubrif.cubrif() \r\n\r\n# Display the current parameter values. \r\nbf.get_param() \r\n\r\n# To change certain parameter values, e.g.:\r\nbf.set_param({'ntrees':10, 'nthreads':2, 'GPU':1}) \r\n\r\n# Or simply:\r\nbf.ntrees = 50\r\n\r\n# Load input data frame. Data must be a pandas data frame with appropriate headers.\r\ndf = pd.read_csv(\"auto.csv\")\r\n\r\n# Train the model\r\nbf.fit(df, 'origin') # specify the target column name\r\n\r\n# Or equivalently\r\nbf.fit(df, 7) # specify the target column index\r\n\r\n# Make predictions \r\n# The target variable column must be excluded, and all other columns should appear in the same order as in training\r\n# Here, predict the first 10 rows of df\r\npred_labels = bf.predict(df.iloc[0:10, 0:7], type='class') # return a list containing the predicted class labels\r\npred_scores = bf.predict(df.iloc[0:10, 0:7], type='score') # return a data frame containing predicted probabilities by class\r\n\r\n# Note: for a regression problem (i.e., when the response variable is numeric type), the predict function will always return a list containing the predicted values\r\n\r\n```\r\n\r\n# Parameters\r\n**tmp_preddata**\r\na character string specifying a filename to save the temporary scoring data. Default is \"tmp_brif_preddata.txt\".\r\n\r\n**n_numeric_cuts**\t\r\nan integer value indicating the maximum number of split points to generate for each numeric variable.\r\n\r\n**n_integer_cuts**\t\r\nan integer value indicating the maximum number of split points to generate for each integer variable.\r\n\r\n**max_integer_classes**\r\nan integer value. If the target variable is integer and has more than max_integer_classes unique values in the training data, then the target variable will be grouped into max_integer_classes bins. If the target variable is numeric, then the smaller of max_integer_classes and the number of unique values number of bins will be created on the target variables and the regression problem will be solved as a classification problem.\r\n\r\n**max_depth**\r\nan integer specifying the maximum depth of each tree. Maximum is 40.\r\n\r\n**min_node_size**\t\r\nan integer specifying the minimum number of training cases a leaf node must contain.\r\n\r\n**ntrees**\r\nan integer specifying the number of trees in the forest.\r\n\r\n**ps**\r\nan integer indicating the number of predictors to sample at each node split. Default is 0, meaning to use sqrt(p), where p is the number of predictors in the input.\r\n\r\n**max_factor_levels**\r\nan integer. If any factor variables has more than max_factor_levels, the program stops and prompts the user to increase the value of this parameter if the too-many-level factor is indeed intended.\r\n\r\n**seed**\r\na positive integer, random number generator seed.\r\n\r\n**nthreads**\r\nan integer specifying the number of threads used by the program. This parameter takes effect only on systems supporting OpenMP.\r\n\r\n**blocksize**\r\nan integer specifying the CUDA thread block size. Must be a multiple of 64, and no more than 1024. \r\n\r\n**GPU**\r\nan integer (0, 1 or 2). 0: Do not use the GPU (for small datasets, e.g., less than 100,000 rows, using GPU is slower). 1: Force use the GPU. 2: Use GPU to evaluate splits only when the node size is greater than or equal to n_lb_GPU. \r\n\r\n**n_lb_GPU**\r\nan integer specifying the threshold number of rows in the training data to use GPU for training. This parameter takes effect only when GPU = 2. \r\n\r\n**vote_method**\r\nan integer (0 or 1) specifying the voting method in prediction. 0: each leaf contributes the raw count and an average is taken on the sum over all leaves; 1: each leaf contributes an intra-node fraction which is then averaged over all leaves with equal weight.\r\n\r\n**na_numeric**\r\na numeric value, substitute for 'nan' in numeric variables.\r\n\r\n**na_integer**\r\nan integer value, substitute for 'nan' in integer variables.\r\n\r\n**na_factor**\r\na character string, substitute for missing values in factor variables. \r\n\r\n**type**\r\na character string indicating the return content of the predict function. For a classification problem, \"score\" means the by-class probabilities and \"class\" means the class labels (i.e., the target variable levels). For regression, the predicted values are returned. This is a parameter for the predict function, not an attribute of the brif object. \r\n\r\n",
"bugtrack_url": null,
"license": "GPL3",
"summary": "Build random forests using CUDA GPU.",
"version": "1.4.3",
"split_keywords": [
"random forest",
"classification",
"regression",
"prediction"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9cd6b0d4afdc65ee2f315c8a8b6e283541520c263ac10b4c49f261d5497cde6f",
"md5": "6da5615e8a94c24b16dcc21f8bdc4f0a",
"sha256": "adda85d3583c50d00fd20f60172215e75c619f3c7a90822f9287baa9cfd64674"
},
"downloads": -1,
"filename": "cubrif-1.4.3-cp310-cp310-win_amd64.whl",
"has_sig": false,
"md5_digest": "6da5615e8a94c24b16dcc21f8bdc4f0a",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.5",
"size": 16144,
"upload_time": "2023-01-15T01:04:32",
"upload_time_iso_8601": "2023-01-15T01:04:32.833871Z",
"url": "https://files.pythonhosted.org/packages/9c/d6/b0d4afdc65ee2f315c8a8b6e283541520c263ac10b4c49f261d5497cde6f/cubrif-1.4.3-cp310-cp310-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d7bc3316619fb2e4983e3519a0cd634ce10adf1e573491b4476c65d2d07ecebf",
"md5": "dea997d571d209a022cb6b2f0a6b86ca",
"sha256": "b236aef7f0c5c3eac69142a9add002637a29d518e820580bf5e5c7eb4dbe6228"
},
"downloads": -1,
"filename": "cubrif-1.4.3.tar.gz",
"has_sig": false,
"md5_digest": "dea997d571d209a022cb6b2f0a6b86ca",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 19348,
"upload_time": "2023-01-15T01:04:34",
"upload_time_iso_8601": "2023-01-15T01:04:34.641819Z",
"url": "https://files.pythonhosted.org/packages/d7/bc/3316619fb2e4983e3519a0cd634ce10adf1e573491b4476c65d2d07ecebf/cubrif-1.4.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-15 01:04:34",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "cubrif"
}