AdvancedAnalytics
===================
A collection of python modules, classes and methods for simplifying the use of machine learning solutions. **AdvancedAnalytics** provides easy access to advanced tools in **Sci-Learn**, **NLTK** and other machine learning packages. **AdvancedAnalytics** was developed to simplify learning python from the book *The Art and Science of Data Analytics*.
Description
===========
From a high level view, building machine learning applications typically proceeds through three stages:
1. Data Preprocessing
2. Modeling or Analytics
3. Postprocessing
The classes and methods in **AdvancedAnalytics** primarily support the first and last stages of machine learning applications.
Data scientists report they spend 80% of their total effort in first and last stages. The first stage, *data preprocessing*, is concerned with preparing the data for analysis. This includes:
1. identifying and correcting outliers,
2. imputing missing values, and
3. encoding data.
The last stage, *solution postprocessing*, involves developing graphic summaries of the solution, and metrics for evaluating the quality of the solution.
Documentation and Examples
============================
The API and documentation for all classes and examples are available at https://github.com/tandonneur/AdvancedAnalytics/.
Usage
=====
Currently the most popular usage is for supporting solutions developed using these advanced machine learning packages:
* Sci-Learn
* StatsModels
* NLTK
The intention is to expand this list to other packages. This is a simple example for linear regression that uses the data map structure to preprocess data:
.. code-block:: python
from AdvancedAnalytics.ReplaceImputeEncode import DT
from AdvancedAnalytics.ReplaceImputeEncode import ReplaceImputeEncode
from AdvancedAnalytics.Tree import tree_regressor
from sklearn.tree import DecisionTreeRegressor, export_graphviz
# Data Map Using DT, Data Types
data_map = {
"Salary": [DT.Interval, (20000.0, 2000000.0)],
"Department": [DT.Nominal, ("HR", "Sales", "Marketing")]
"Classification": [DT.Nominal, (1, 2, 3, 4, 5)]
"Years": [DT.Interval, (18, 60)] }
# Preprocess data from data frame df
rie = ReplaceImputeEncode(data_map=data_map, interval_scaling=None,
nominal_encoding= "SAS", drop=True)
encoded_df = rie.fit_transform(df)
y = encoded_df["Salary"]
X = encoded_df.drop("Salary", axis=1)
dt = DecisionTreeRegressor(criterion= "gini", max_depth=4,
min_samples_split=5, min_samples_leaf=5)
dt = dt.fit(X,y)
tree_regressor.display_importance(dt, encoded_df.columns)
tree_regressor.display_metrics(dt, X, y)
Current Modules and Classes
=============================
ReplaceImputeEncode
Classes for Data Preprocessing
* DT defines new data types used in the data dictionary
* ReplaceImputeEncode a class for data preprocessing
Regression
Classes for Linear and Logistic Regression
* linreg support for linear regressino
* logreg support for logistic regression
* stepwise a variable selection class
Tree
Classes for Decision Tree Solutions
* tree_regressor support for regressor decision trees
* tree_classifier support for classification decision trees
Forest
Classes for Random Forests
* forest_regressor support for regressor random forests
* forest_classifier support for classification random forests
NeuralNetwork
Classes for Neural Networks
* nn_regressor support for regressor neural networks
* nn_classifier support for classification neural networks
Text
Classes for Text Analytics
* text_analysis support for topic analysis
* text_plot for word clouds
* sentiment_analysis support for sentiment analysis
Internet
Classes for Internet Applications
* scrape support for web scrapping
* metrics a class for solution metrics
Installation and Dependencies
=============================
**AdvancedAnalytics** is designed to work on any operating system running python 3. It can be installed using **pip** or **conda**.
.. code-block:: python
pip install AdvancedAnalytics
# or
conda install -c dr.jones AdvancedAnalytics
General Dependencies
There are dependencies. Most classes import one or more modules from
**Sci-Learn**, referenced as *sklearn* in module imports, and
**StatsModels**. These are both installed with the current version
of **anaconda**.
Installed with AdvancedAnalytics
Most packages used by **AdvancedAnalytics** are automatically
installed with its installation. These consist of the following
packages.
* statsmodels
* scikit-learn
* scikit-image
* nltk
* pydotplus
Other Dependencies
The *Tree* and *Forest* modules plot decision trees and importance
metrics using **pydotplus** and the **graphviz** packages. These
should also be automatically installed with **AdvancedAnalytics**.
However, the **graphviz** install is sometimes not fully complete
with the conda install. It may require an additional pip install.
.. code-block:: python
pip install graphviz
Text Analytics Dependencies
The *TextAnalytics* module uses the **NLTK**, **Sci-Learn**, and
**wordcloud** packages. Usually these are also automatically
installed automatically with **AdvancedAnalytics**. You can verify
they are installed using the following commands.
.. code-block:: python
conda list nltk
conda list sci-learn
conda list wordcloud
However, when the **NLTK** package is installed, it does not
install the data used by the package. In order to load the
**NLTK** data run the following code once before using the
*TextAnalytics* module.
.. code-block:: python
#The following NLTK commands should be run once
nltk.download("punkt")
nltk.download("averaged_preceptron_tagger")
nltk.download("stopwords")
nltk.download("wordnet")
The **wordcloud** package also uses a little know package
**tinysegmenter** version 0.3. Run the following code to ensure
it is installed.
.. code-block:: python
conda install -c conda-forge tinysegmenter==0.3
# or
pip install tinysegmenter==0.3
Internet Dependencies
The *Internet* module contains a class *scrape* which has some
functions for scraping newsfeeds. Some of these use the
**newspaper3k** package. It should be automatically installed with
**AdvancedAnalytics**.
However, it also uses the package **newsapi-python**, which is not
automatically installed. If you intended to use this news scraping
scraping tool, it is necessary to install the package using the
following code:
.. code-block:: python
conda install -c conda-forge newsapi
# or
pip install newsapi
In addition, the newsapi service is sponsored by a commercial company
www.newsapi.com. You will need to register with them to obtain an
*API* key required to access this service. This is free of charge
for developers, but there is a fee if *newsapi* is used to broadcast
news with an application or at a website.
Code of Conduct
---------------
Everyone interacting in the AdvancedAnalytics project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the PyPA Code of Conduct: https://www.pypa.io/en/latest/code-of-conduct/ .
Raw data
{
"_id": null,
"home_page": "https://github.com/tandonneur/AdvancedAnalytics",
"name": "AdvancedAnalytics",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "Analytics,data map,preprocessing,pre-processing,postprocessing,post-processing,NLTK,Sci-Learn,sklearn,StatsModels,web scraping,word cloud,regression,decision trees,random forest,neural network,cross validation,topic analysis,sentiment analytic,natural language processing,NLP",
"author": "Edward R Jones",
"author_email": "ejones@tamu.edu",
"download_url": "https://files.pythonhosted.org/packages/c6/4f/afb4423adf5b8455a0bf19abfd9fd4352ebe80c3cf5e454dd40985971c6f/AdvancedAnalytics-1.39.tar.gz",
"platform": null,
"description": "AdvancedAnalytics\n===================\n\nA collection of python modules, classes and methods for simplifying the use of machine learning solutions. **AdvancedAnalytics** provides easy access to advanced tools in **Sci-Learn**, **NLTK** and other machine learning packages. **AdvancedAnalytics** was developed to simplify learning python from the book *The Art and Science of Data Analytics*.\n\nDescription\n===========\n\nFrom a high level view, building machine learning applications typically proceeds through three stages:\n\n 1. Data Preprocessing\n 2. Modeling or Analytics\n 3. Postprocessing\n\nThe classes and methods in **AdvancedAnalytics** primarily support the first and last stages of machine learning applications. \n\nData scientists report they spend 80% of their total effort in first and last stages. The first stage, *data preprocessing*, is concerned with preparing the data for analysis. This includes:\n\n 1. identifying and correcting outliers, \n 2. imputing missing values, and \n 3. encoding data. \n\nThe last stage, *solution postprocessing*, involves developing graphic summaries of the solution, and metrics for evaluating the quality of the solution.\n\nDocumentation and Examples\n============================\n\nThe API and documentation for all classes and examples are available at https://github.com/tandonneur/AdvancedAnalytics/. \n\nUsage\n=====\n\nCurrently the most popular usage is for supporting solutions developed using these advanced machine learning packages:\n\n * Sci-Learn\n * StatsModels\n * NLTK\n\nThe intention is to expand this list to other packages. This is a simple example for linear regression that uses the data map structure to preprocess data:\n\n.. code-block:: python\n\n from AdvancedAnalytics.ReplaceImputeEncode import DT\n from AdvancedAnalytics.ReplaceImputeEncode import ReplaceImputeEncode\n from AdvancedAnalytics.Tree import tree_regressor\n from sklearn.tree import DecisionTreeRegressor, export_graphviz \n # Data Map Using DT, Data Types\n data_map = {\n \"Salary\": [DT.Interval, (20000.0, 2000000.0)],\n \"Department\": [DT.Nominal, (\"HR\", \"Sales\", \"Marketing\")] \n \"Classification\": [DT.Nominal, (1, 2, 3, 4, 5)]\n \"Years\": [DT.Interval, (18, 60)] }\n # Preprocess data from data frame df\n rie = ReplaceImputeEncode(data_map=data_map, interval_scaling=None,\n nominal_encoding= \"SAS\", drop=True)\n encoded_df = rie.fit_transform(df)\n y = encoded_df[\"Salary\"]\n X = encoded_df.drop(\"Salary\", axis=1)\n dt = DecisionTreeRegressor(criterion= \"gini\", max_depth=4,\n min_samples_split=5, min_samples_leaf=5)\n dt = dt.fit(X,y)\n tree_regressor.display_importance(dt, encoded_df.columns)\n tree_regressor.display_metrics(dt, X, y)\n\nCurrent Modules and Classes\n=============================\n\nReplaceImputeEncode\n Classes for Data Preprocessing\n * DT defines new data types used in the data dictionary\n * ReplaceImputeEncode a class for data preprocessing\n\nRegression\n Classes for Linear and Logistic Regression\n * linreg support for linear regressino\n * logreg support for logistic regression\n * stepwise a variable selection class\n\nTree\n Classes for Decision Tree Solutions\n * tree_regressor support for regressor decision trees\n * tree_classifier support for classification decision trees\n\nForest\n Classes for Random Forests\n * forest_regressor support for regressor random forests\n * forest_classifier support for classification random forests\n\nNeuralNetwork\n Classes for Neural Networks\n * nn_regressor support for regressor neural networks\n * nn_classifier support for classification neural networks\n\nText\n Classes for Text Analytics\n * text_analysis support for topic analysis\n * text_plot for word clouds\n * sentiment_analysis support for sentiment analysis\n\nInternet\n Classes for Internet Applications\n * scrape support for web scrapping\n * metrics a class for solution metrics\n\nInstallation and Dependencies\n=============================\n\n**AdvancedAnalytics** is designed to work on any operating system running python 3. It can be installed using **pip** or **conda**.\n\n.. code-block:: python\n\n pip install AdvancedAnalytics\n # or\n conda install -c dr.jones AdvancedAnalytics\n\nGeneral Dependencies\n There are dependencies. Most classes import one or more modules from \n **Sci-Learn**, referenced as *sklearn* in module imports, and \n **StatsModels**. These are both installed with the current version\n of **anaconda**.\n\nInstalled with AdvancedAnalytics\n Most packages used by **AdvancedAnalytics** are automatically \n installed with its installation. These consist of the following \n packages.\n\n * statsmodels\n * scikit-learn\n * scikit-image\n * nltk\n * pydotplus\n\nOther Dependencies\n The *Tree* and *Forest* modules plot decision trees and importance\n metrics using **pydotplus** and the **graphviz** packages. These\n should also be automatically installed with **AdvancedAnalytics**.\n\n However, the **graphviz** install is sometimes not fully complete \n with the conda install. It may require an additional pip install.\n\n .. code-block:: python\n\n pip install graphviz\n\nText Analytics Dependencies\n The *TextAnalytics* module uses the **NLTK**, **Sci-Learn**, and \n **wordcloud** packages. Usually these are also automatically \n installed automatically with **AdvancedAnalytics**. You can verify \n they are installed using the following commands.\n\n .. code-block:: python\n\n conda list nltk\n conda list sci-learn\n conda list wordcloud\n\n However, when the **NLTK** package is installed, it does not \n install the data used by the package. In order to load the\n **NLTK** data run the following code once before using the \n *TextAnalytics* module.\n\n .. code-block:: python\n\n #The following NLTK commands should be run once\n nltk.download(\"punkt\")\n nltk.download(\"averaged_preceptron_tagger\")\n nltk.download(\"stopwords\")\n nltk.download(\"wordnet\")\n\n The **wordcloud** package also uses a little know package\n **tinysegmenter** version 0.3. Run the following code to ensure\n it is installed.\n\n .. code-block:: python\n\n conda install -c conda-forge tinysegmenter==0.3\n # or\n pip install tinysegmenter==0.3\n\nInternet Dependencies\n The *Internet* module contains a class *scrape* which has some \n functions for scraping newsfeeds. Some of these use the \n **newspaper3k** package. It should be automatically installed with \n **AdvancedAnalytics**.\n\n However, it also uses the package **newsapi-python**, which is not \n automatically installed. If you intended to use this news scraping\n scraping tool, it is necessary to install the package using the \n following code:\n\n .. code-block:: python\n\n conda install -c conda-forge newsapi\n # or\n pip install newsapi\n\n In addition, the newsapi service is sponsored by a commercial company\n www.newsapi.com. You will need to register with them to obtain an \n *API* key required to access this service. This is free of charge \n for developers, but there is a fee if *newsapi* is used to broadcast \n news with an application or at a website.\n\nCode of Conduct\n---------------\n\nEveryone interacting in the AdvancedAnalytics project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the PyPA Code of Conduct: https://www.pypa.io/en/latest/code-of-conduct/ .\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "Python support for 'The Art and Science of Data Analytics'",
"version": "1.39",
"split_keywords": [
"analytics",
"data map",
"preprocessing",
"pre-processing",
"postprocessing",
"post-processing",
"nltk",
"sci-learn",
"sklearn",
"statsmodels",
"web scraping",
"word cloud",
"regression",
"decision trees",
"random forest",
"neural network",
"cross validation",
"topic analysis",
"sentiment analytic",
"natural language processing",
"nlp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4fddb53200f0b95d27ed9371ddc9fd57a29a84252ef639587ef40a1d01d7b326",
"md5": "c2f39c58bdf135bca9beb30f73f28d87",
"sha256": "3052697e9ddaea85b5f1b95331eb304db84df12c8c358ee786b89903a68be8d0"
},
"downloads": -1,
"filename": "AdvancedAnalytics-1.39-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c2f39c58bdf135bca9beb30f73f28d87",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 64807,
"upload_time": "2023-03-31T00:09:36",
"upload_time_iso_8601": "2023-03-31T00:09:36.037003Z",
"url": "https://files.pythonhosted.org/packages/4f/dd/b53200f0b95d27ed9371ddc9fd57a29a84252ef639587ef40a1d01d7b326/AdvancedAnalytics-1.39-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c64fafb4423adf5b8455a0bf19abfd9fd4352ebe80c3cf5e454dd40985971c6f",
"md5": "df2cd728cc288aeaf9d7c2a261d2f630",
"sha256": "d6a321f69217081d85c8f42d9ab4594047eaaa5c5d921c521c2f01624e1996c3"
},
"downloads": -1,
"filename": "AdvancedAnalytics-1.39.tar.gz",
"has_sig": false,
"md5_digest": "df2cd728cc288aeaf9d7c2a261d2f630",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 61879,
"upload_time": "2023-03-31T00:09:38",
"upload_time_iso_8601": "2023-03-31T00:09:38.735193Z",
"url": "https://files.pythonhosted.org/packages/c6/4f/afb4423adf5b8455a0bf19abfd9fd4352ebe80c3cf5e454dd40985971c6f/AdvancedAnalytics-1.39.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-31 00:09:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "tandonneur",
"github_project": "AdvancedAnalytics",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "advancedanalytics"
}