static-frame


Namestatic-frame JSON
Version 2.15.1 PyPI version JSON
download
home_pagehttps://github.com/static-frame/static-frame
SummaryImmutable and statically-typeable DataFrames with runtime type and data validation.
upload_time2024-11-27 04:52:37
maintainerNone
docs_urlNone
authorChristopher Ariza
requires_python>=3.9
licenseMIT
keywords staticframe pandas numpy immutable array
VCS
bugtrack_url
requirements numpy arraymap arraykit typing-extensions
Travis-CI No Travis.
coveralls test coverage No coveralls.
            Immutable and statically-typeable DataFrames with runtime type and data validation.

Among the many Python DataFrame libraries, StaticFrame is an alternative that prioritizes correctness, maintainability, and reducing opportunities for error. Key features include:

* ðŸ›Ąïļ Immutable Data: Provides memory efficiency, excellent performance, and prohibits side effects.
* 🗜ïļ Static Typing: Use Python type-hints to statically type index, columns, and columnar types.
* ðŸšĶ Runtime Validation: Use type hints and specialized validators for runtime type and data checks.
* 🧭 Consistent Interface: An easy-to-learn, hierarchical, and intuitive API that avoids the many inconsistencies of Pandas.
* 🧎 Comprehensive ``dtype`` Support: Full compatibility with all NumPy dtypes and datetime64 units.
* 🔗 Broad Interoperability: Translate between Pandas, DuckDB, Arrow, Parquet, CSV, TSV, JSON, MessagePack, Excel XLSX, SQLite, HDF5, and NumPy; output to xarray, VisiData, HTML, RST, Markdown, LaTeX, and Jupyter notebooks.
* 🚀 Optimized Serialization & Memory Mapping: Fast disk I/O with custom NPZ and NPY encodings.
* 💞 Multi-Table Containers: The ``Bus`` and ``Yarn`` provide interfaces to collections of tables with lazy data loading, well-suited for large datasets.
* âģ Deferred Processing: The ``Batch`` provides a common interface for deferred processing of groups, windows, or any iterator.
* ðŸŠķ Lean Dependencies: Core functionality relies only on NumPy and team-maintained C-extensions.
* 📚 Comprehensive Documentation: All API endpoints documented with thousands of easily runnable examples.


Code: https://github.com/static-frame/static-frame

Docs: http://static-frame.readthedocs.io

Packages: https://pypi.org/project/static-frame

API Search: https://staticframe.dev

Jupyter Notebook Tutorial: `Launch Binder <https://mybinder.org/v2/gh/static-frame/static-frame-ftgu/default?urlpath=tree/index.ipynb>`_



Installation via ``pip``
-------------------------------

Install StaticFrame with ``pip``. Note that pre-built wheels are published for all supported Python versions and platforms (including Apple Silicon platforms)::

    pip install static-frame

To install optional dependencies for full support of input and output formats (such as XLSX and HDF5) via ``pip``::

    pip install static-frame [extras]



Installation via ``conda``
-------------------------------

StaticFrame can be installed via ``conda`` with the ``conda-forge`` channel. Note that pre-built wheels of StaticFrame and all compiled dependencies are available through ``pip`` and may offer more compatibility than a ``conda``-based installation ::

    conda install -c conda-forge static-frame


Installation via Pyodide
-------------------------------

StaticFrame can be run in the browser via Pyodide with the ``static_frame_pyodide`` package: https://github.com/static-frame/static-frame-pyodide


Dependencies
--------------

Core StaticFrame requires the following:

- Python>=3.9
- numpy>=1.23.5 (numpy>=2 is supported)
- arraymap==0.4.0
- arraykit==0.10.0
- typing-extensions>=4.12.0

For extended input and output, the following packages are required:

- pandas>=1.1.5
- duckdb>=1.0.0
- xlsxwriter>=1.1.2
- openpyxl>=3.0.9
- xarray>=0.13.0
- tables>=3.9.1
- pyarrow>=3.0.0
- visidata>=2.4


Quick-Start Guide
---------------------

To get startred quickly, let's download the classic iris (flower) characteristics data set and build a simple naive Bayes classifier that can predict species from iris petal characteristics.

While StaticFrame's API has over 7,500 endpoints, much will be familiar to users of Pandas or other DataFrame libraries. Rather than offering fewer interfaces with greater configurability, StaticFrame favors more numerous interfaces with more narrow parameters and functionality. This design leads to more maintainable code. (Read more about differences between Pandas and StaticFrame `here <https://static-frame.readthedocs.io/en/latest/articles/upgrade.html>`__.)


We can download the data set from the UCI Machine Learning Repository and create a ``Frame``. StaticFrame exposes all constructors on the class: here, we will use the ``Frame.from_csv()`` constructor. To download a file from the internet and provide it to a constructor, we can use StaticFrame's ``WWW.from_file()`` interface::

    >>> import static_frame as sf
    >>> data = sf.Frame.from_csv(sf.WWW.from_file('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'), columns_depth=0)


Each record (or row) in this dataset describes observations of an iris flower, including its sepal and petal characteristics, as well as its species (of which there are three). To display just the first few rows, we can use the ``head()`` method. Notice that StaticFrame's default display makes it very clear what type of ``Frame``, ``Index``, and NumPy datatypes are present::

    >>> data.head()
    <Frame>
    <Index> 0         1         2         3         4           <int64>
    <Index>
    0       5.1       3.5       1.4       0.2       Iris-setosa
    1       4.9       3.0       1.4       0.2       Iris-setosa
    2       4.7       3.2       1.3       0.2       Iris-setosa
    3       4.6       3.1       1.5       0.2       Iris-setosa
    4       5.0       3.6       1.4       0.2       Iris-setosa
    <int64> <float64> <float64> <float64> <float64> <<U15>


As the columns are unlabelled, let's next add column labels. StaticFrame supports reindexing (conforming existing axis labels to new labels, potentially changing the size and ordering) and relabeling (simply applying new labels without regard to existing labels). As we can ignore the default column labels (auto-incremented integers), the ``relabel()`` method is used to provide new labels.

Note that while ``relabel()`` creates a new ``Frame``, underlying NumPy data is not copied. As all NumPy data is immutable in StaticFrame, we can reuse it in our new container, making such operations very efficient::

    >>> data = data.relabel(columns=('sepal_l', 'sepal_w', 'petal_l', 'petal_w', 'species'))
    >>> data.head()
    <Frame>
    <Index> sepal_l   sepal_w   petal_l   petal_w   species     <<U7>
    <Index>
    0       5.1       3.5       1.4       0.2       Iris-setosa
    1       4.9       3.0       1.4       0.2       Iris-setosa
    2       4.7       3.2       1.3       0.2       Iris-setosa
    3       4.6       3.1       1.5       0.2       Iris-setosa
    4       5.0       3.6       1.4       0.2       Iris-setosa
    <int64> <float64> <float64> <float64> <float64> <<U15>


(Read more about no-copy operations `here <https://static-frame.readthedocs.io/en/latest/articles/no_copy.html>`__.)

For this example, eighty percent of the data will be used to train the classifier; the remaining twenty percent will be used to test the classifier. As all records are labelled with the known species, we can conclude by measuring the effectiveness of the classifier on the test data.

To divide the data into two groups, we create a ``Series`` of contiguous integers and then extract a random selection of 80% of the values into a new ``Series``, here named ``sel_train``. This will be used to select our traning data. As the ``sample()`` method, given a count, randomly samples that many values, your results will be different unless use the same ``seed`` argument::

    >>> sel = sf.Series(np.arange(len(data)))
    >>> sel_train = sel.sample(round(len(data) * .8), seed=42)
    >>> sel_train.head()
    <Series>
    <Index>
    0        0
    2        2
    3        3
    4        4
    5        5
    <int64>  <int64>


We will create another ``Series`` to select the test data. The ``drop[]`` interface can be used to create a new ``Series`` that excludes the training selections, leaving just the testing selections. As with many interfaces in StaticFrame (such as ``astype`` and ``assign``), brackets can be used to do ``loc[]`` style selections::

    >>> sel_test = sel.drop[sel_train]
    >>> sel_test.head()
    <Series>
    <Index>
    1        1
    14       14
    20       20
    21       21
    37       37
    <int64>  <int64>


To select a subset of the data for training, the ``sel_train`` ``Series`` can be passed to ``loc[]`` to select just those rows::

    >>> data_train = data.loc[sel_train]
    >>> data_train.head()
    <Frame>
    <Index> sepal_l   sepal_w   petal_l   petal_w   species     <<U7>
    <Index>
    0       5.1       3.5       1.4       0.2       Iris-setosa
    2       4.7       3.2       1.3       0.2       Iris-setosa
    3       4.6       3.1       1.5       0.2       Iris-setosa
    4       5.0       3.6       1.4       0.2       Iris-setosa
    5       5.4       3.9       1.7       0.4       Iris-setosa
    <int64> <float64> <float64> <float64> <float64> <<U15>


With our data divided into two randomly-selected, non-overlapping groups, we can proceed to implement the naive Bayes classifier. We will compute the ``posterior`` of the test data by multiplying the ``prior`` and the ``likelihood``. With the ``posterior``, we can determine which species the classifier has calculated is most likely. (More on naive Bayes classifiers can be found `here <https://en.wikipedia.org/wiki/Naive_Bayes_classifier>`__.)

The ``prior`` is calculated as the percentage of samples of each species in the training data. This is the "normalized" count per species. To get a ``Series`` of counts per species, we can select the species column, iterate over groups based on species name, and count the size of each group.

In StaticFrame, this can be done by calling ``Series.iter_group_items()`` to get an iterator of pairs of group label, group (where the group is a ``Series``). This iterator (or any similar iterator) can be given to a ``Batch``, a chaining processor of ``Frame`` or ``Series``, to perform operations on each group. (For more on the ``Batch`` and other higher-order containers in StaticFrame, see `here <https://static-frame.readthedocs.io/en/latest/articles/uhoc.html>`__.)

Once the ``Batch`` is created, selections, method calls, and operator expressions can be chained as if they were being called on a single container. Processing happens to every contained container, and a container is returned, only when a finalizer method, such as ``to_series()``, is called::

    >>> counts = sf.Batch(data_train['species'].iter_group_items()).count().to_series()
    >>> counts
    <Series>
    <Index>
    Iris-setosa     43
    Iris-versicolor 39
    Iris-virginica  38
    <<U15>          <int64>


As with NumPy, StaticFrame containers can be used in expressions with binary operators. The ``prior`` can be derived by dividing ``counts`` by the size of the training data. This returns a ``Series`` of the percentage of records per species::

    >>> prior = counts / len(data_train)
    >>> prior
    <Series>
    <Index>
    Iris-setosa     0.35833333333333334
    Iris-versicolor 0.325
    Iris-virginica  0.31666666666666665
    <<U15>          <float64>


Having calculated the ``prior``, we can calculate ``likelihood`` next. To calculate ``likelihood``, we will call a probability distribution function (imported from SciPy) with the test data, once for each species, given the characteristics (mean and standard deviation) observed in the test data for that species.

The ``Batch`` can again be used to calculate the mean and standard deviation, per species, from the training data. With the ``Frame`` of training data, we call ``iter_group_items()`` to group by species and, passing that iterator to ``Batch``, call ``mean()`` (assigned to ``mu``) or ``std()`` (assigned to ``sigma``). Note that ``iter_group_items()`` has an optional ``drop`` parameter to remove the column used for grouping from subsequent operations::


    >>> mu = sf.Batch(data_train[['sepal_l', 'sepal_w', 'species']].iter_group_items('species', drop=True)).mean().to_frame()
    >>> mu
    <Frame>
    <Index>         sepal_l            sepal_w            <<U7>
    <Index>
    Iris-setosa     4.986046511627907  3.434883720930233
    Iris-versicolor 5.920512820512819  2.771794871794872
    Iris-virginica  6.6078947368421055 2.9763157894736842
    <<U15>          <float64>          <float64>

    >>> sigma = sf.Batch(data_train[['sepal_l', 'sepal_w', 'species']].iter_group_items('species', drop=True)).std(ddof=1).to_frame()
    >>> sigma
    <Frame>
    <Index>         sepal_l            sepal_w             <<U7>
    <Index>
    Iris-setosa     0.3419700595003668 0.3477024733400345
    Iris-versicolor 0.508444214804487  0.33082728674826684
    Iris-virginica  0.6055516042229233 0.3513942965328924
    <<U15>          <float64>          <float64>


For a unified display of these characteristics, we can build a hierarchical index on each ``Frame`` with ``relabel_level_add()`` (adding the "mu" or "sigma" labels), then vertically concatenate the tables. As StaticFrame always requires unique labels in indices, adding an additional label is required before concatenation. The built-in ``round`` function can be used for more tidy display::

    >>> stats = sf.Frame.from_concat((mu.relabel_level_add('mu'), sigma.relabel_level_add('sigma')))
    >>> round(stats, 2)
    <Frame>
    <Index>                          sepal_l   sepal_w   <<U7>
    <IndexHierarchy>
    mu               Iris-setosa     4.99      3.43
    mu               Iris-versicolor 5.92      2.77
    mu               Iris-virginica  6.61      2.98
    sigma            Iris-setosa     0.34      0.35
    sigma            Iris-versicolor 0.51      0.33
    sigma            Iris-virginica  0.61      0.35
    <<U5>            <<U15>          <float64> <float64>


We can now move on to processing the test data with the characteristics derived from the training data. To do that, we will extract our previously selected test records with ``sel_test`` into a new ``Frame``, to which we can add our ``posterior`` predictions and final species classifications.

It is common to process data in table by adding columns from left to right. StaticFrame permits this limited form of mutability with the grow-only ``FrameGO``. While underlying NumPy arrays are still always immutable, columns can be added to a ``FrameGO`` with bracket-style assignments. A ``FrameGO`` can be created from a ``Frame`` with the ``to_frame_go()`` method. As mentioned elsewhere, underlying immutable NumPy arrays are not copied: this is an efficient, no-copy operation.

Passing two arguments to ``loc[]``, we can select rows with the values from ``sel_test``, and we can select columns with a list of labels for the sepal length and sepal width::

    >>> data_test = data.loc[sel_test.values, ['sepal_l', 'sepal_w']].to_frame_go()
    >>> data_test.head()
    <FrameGO>
    <IndexGO> sepal_l   sepal_w   <<U7>
    <Index>
    1         4.9       3.0
    14        5.8       4.0
    20        5.4       3.4
    21        5.1       3.7
    37        4.9       3.1
    <int64>   <float64> <float64>


StaticFrame interfaces make extensive use of iterators and generators. As used below, the ``Frame.from_fields()`` constructor will create a ``Frame`` from any iterable (or generator) of column arrays.

The ``likelihood_of_species()`` function (defined below), for each index label in ``mu`` (which provides each unique iris species), calculates a probability density function for the test data, given the ``mu`` (mean) and ``sigma`` (standard deviation) for the species. An array of the sum of the log is yielded::

    >>> from scipy.stats import norm
    >>> def likelihood_of_species():
    ...     for label in mu.index:
    ...             pdf = norm.pdf(data_test.values, mu.loc[label], sigma.loc[label])
    ...             yield np.log(pdf).sum(axis=1)


While the generator function above is easy to read, it is hard to copy and paste. If you are following along, using the one-line generator expression, below, will be easier. The two are equivalent:

>>> likelihood_of_species = (np.log(norm.pdf(data_test.values, mu.loc[label], sigma.loc[label])).sum(axis=1) for label in mu.index)


With this generator expression defined, we call the ``from_fields`` constructor to produce the ``likelihood`` table, providing column labels from ``mu.index`` and index labels from ``data_test.index``. For each test record row we now have a likelihood per species::

    >>> likelihood = sf.Frame.from_fields(likelihood_of_species, columns=mu.index, index=data_test.index)
    >>> round(likelihood.head(), 2)
    <Frame>
    <Index> Iris-setosa Iris-versicolor Iris-virginica <<U15>
    <Index>
    1       -0.52       -2.31           -4.27
    14      -3.86       -6.97           -5.42
    20      -0.45       -2.38           -3.01
    21      -0.05       -5.29           -5.51
    37      -0.2        -2.56           -4.33
    <int64> <float64>   <float64>       <float64>


We can calculate the ``posterior`` by multiplying ``likelihood`` by ``prior``. Whenever performing binary operations on ``Frame`` and ``Series``, indices will be aligned and, if necessary, reindexed before processing::

    >>> posterior = likelihood * prior
    >>> round(posterior.head(), 2)
    <Frame>
    <Index> Iris-setosa Iris-versicolor Iris-virginica <<U15>
    <Index>
    1       -0.19       -0.75           -1.35
    14      -1.38       -2.27           -1.72
    20      -0.16       -0.77           -0.95
    21      -0.02       -1.72           -1.75
    37      -0.07       -0.83           -1.37
    <int64> <float64>   <float64>       <float64>


We can now add columns to our ``data_test`` ``FrameGO``. To determine our best prediction of species for each row of the test data, the column label (the species) of the maximum a posteriori estimate is selected with ``loc_max()``::

    >>> data_test['predict'] = posterior.loc_max(axis=1)
    >>> data_test.head()
    <FrameGO>
    <IndexGO> sepal_l   sepal_w   predict     <<U7>
    <Index>
    1         4.9       3.0       Iris-setosa
    14        5.8       4.0       Iris-setosa
    20        5.4       3.4       Iris-setosa
    21        5.1       3.7       Iris-setosa
    37        4.9       3.1       Iris-setosa
    <int64>   <float64> <float64> <<U15>


We can add two additional columns to evaluate the effectivess of the classifier. First, we can add an "observed" column by adding the original "species" column from the original ``data`` ``Frame``. In assigning a ``Series`` to a ``Frame``, only values found in the intersection of the indices will be added as a column::

    >>> data_test['observed'] = data['species']
    >>> data_test.head()
    <FrameGO>
    <IndexGO> sepal_l   sepal_w   predict     observed    <<U8>
    <Index>
    1         4.9       3.0       Iris-setosa Iris-setosa
    14        5.8       4.0       Iris-setosa Iris-setosa
    20        5.4       3.4       Iris-setosa Iris-setosa
    21        5.1       3.7       Iris-setosa Iris-setosa
    37        4.9       3.1       Iris-setosa Iris-setosa
    <int64>   <float64> <float64> <<U15>      <<U15>


Having populated a column of predicted and observed values, we can compare the two to get a Boolean column indicating when the classifier calculated a correct predicton::

    >>> data_test['correct'] = data_test['predict'] == data_test['observed']
    >>> data_test.tail()
    <FrameGO>
    <IndexGO> sepal_l   sepal_w   predict         observed       correct <<U8>
    <Index>
    129       7.2       3.0       Iris-virginica  Iris-virginica True
    130       7.4       2.8       Iris-virginica  Iris-virginica True
    140       6.7       3.1       Iris-virginica  Iris-virginica True
    144       6.7       3.3       Iris-virginica  Iris-virginica True
    149       5.9       3.0       Iris-versicolor Iris-virginica False
    <int64>   <float64> <float64> <<U15>          <<U15>         <bool>


To find the percentage of correct classifications among the test data, we can sum the ``correct`` Boolean column and divide that by the size of the test data::

    >>> data_test["correct"].sum() / len(data_test)
    0.7333333333333333

This simple naive Bayes classifier can predict iris species correctly about 73% of the time.

For further introduction to StaticFrame, including links to articles, videos, and documentation, see `here <https://static-frame.readthedocs.io/en/latest/intro.html>`__.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/static-frame/static-frame",
    "name": "static-frame",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "staticframe pandas numpy immutable array",
    "author": "Christopher Ariza",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/44/47/13c2cc1fc62c046e06cea82a992d2014b560c10e972b772ad9c1b30fc662/static-frame-2.15.1.tar.gz",
    "platform": null,
    "description": "Immutable and statically-typeable DataFrames with runtime type and data validation.\n\nAmong the many Python DataFrame libraries, StaticFrame is an alternative that prioritizes correctness, maintainability, and reducing opportunities for error. Key features include:\n\n* \ud83d\udee1\ufe0f Immutable Data: Provides memory efficiency, excellent performance, and prohibits side effects.\n* \ud83d\udddc\ufe0f Static Typing: Use Python type-hints to statically type index, columns, and columnar types.\n* \ud83d\udea6 Runtime Validation: Use type hints and specialized validators for runtime type and data checks.\n* \ud83e\udded Consistent Interface: An easy-to-learn, hierarchical, and intuitive API that avoids the many inconsistencies of Pandas.\n* \ud83e\uddec Comprehensive ``dtype`` Support: Full compatibility with all NumPy dtypes and datetime64 units.\n* \ud83d\udd17 Broad Interoperability: Translate between Pandas, DuckDB, Arrow, Parquet, CSV, TSV, JSON, MessagePack, Excel XLSX, SQLite, HDF5, and NumPy; output to xarray, VisiData, HTML, RST, Markdown, LaTeX, and Jupyter notebooks.\n* \ud83d\ude80 Optimized Serialization & Memory Mapping: Fast disk I/O with custom NPZ and NPY encodings.\n* \ud83d\udcbc Multi-Table Containers: The ``Bus`` and ``Yarn`` provide interfaces to collections of tables with lazy data loading, well-suited for large datasets.\n* \u23f3 Deferred Processing: The ``Batch`` provides a common interface for deferred processing of groups, windows, or any iterator.\n* \ud83e\udeb6 Lean Dependencies: Core functionality relies only on NumPy and team-maintained C-extensions.\n* \ud83d\udcda Comprehensive Documentation: All API endpoints documented with thousands of easily runnable examples.\n\n\nCode: https://github.com/static-frame/static-frame\n\nDocs: http://static-frame.readthedocs.io\n\nPackages: https://pypi.org/project/static-frame\n\nAPI Search: https://staticframe.dev\n\nJupyter Notebook Tutorial: `Launch Binder <https://mybinder.org/v2/gh/static-frame/static-frame-ftgu/default?urlpath=tree/index.ipynb>`_\n\n\n\nInstallation via ``pip``\n-------------------------------\n\nInstall StaticFrame with ``pip``. Note that pre-built wheels are published for all supported Python versions and platforms (including Apple Silicon platforms)::\n\n    pip install static-frame\n\nTo install optional dependencies for full support of input and output formats (such as XLSX and HDF5) via ``pip``::\n\n    pip install static-frame [extras]\n\n\n\nInstallation via ``conda``\n-------------------------------\n\nStaticFrame can be installed via ``conda`` with the ``conda-forge`` channel. Note that pre-built wheels of StaticFrame and all compiled dependencies are available through ``pip`` and may offer more compatibility than a ``conda``-based installation ::\n\n    conda install -c conda-forge static-frame\n\n\nInstallation via Pyodide\n-------------------------------\n\nStaticFrame can be run in the browser via Pyodide with the ``static_frame_pyodide`` package: https://github.com/static-frame/static-frame-pyodide\n\n\nDependencies\n--------------\n\nCore StaticFrame requires the following:\n\n- Python>=3.9\n- numpy>=1.23.5 (numpy>=2 is supported)\n- arraymap==0.4.0\n- arraykit==0.10.0\n- typing-extensions>=4.12.0\n\nFor extended input and output, the following packages are required:\n\n- pandas>=1.1.5\n- duckdb>=1.0.0\n- xlsxwriter>=1.1.2\n- openpyxl>=3.0.9\n- xarray>=0.13.0\n- tables>=3.9.1\n- pyarrow>=3.0.0\n- visidata>=2.4\n\n\nQuick-Start Guide\n---------------------\n\nTo get startred quickly, let's download the classic iris (flower) characteristics data set and build a simple naive Bayes classifier that can predict species from iris petal characteristics.\n\nWhile StaticFrame's API has over 7,500 endpoints, much will be familiar to users of Pandas or other DataFrame libraries. Rather than offering fewer interfaces with greater configurability, StaticFrame favors more numerous interfaces with more narrow parameters and functionality. This design leads to more maintainable code. (Read more about differences between Pandas and StaticFrame `here <https://static-frame.readthedocs.io/en/latest/articles/upgrade.html>`__.)\n\n\nWe can download the data set from the UCI Machine Learning Repository and create a ``Frame``. StaticFrame exposes all constructors on the class: here, we will use the ``Frame.from_csv()`` constructor. To download a file from the internet and provide it to a constructor, we can use StaticFrame's ``WWW.from_file()`` interface::\n\n    >>> import static_frame as sf\n    >>> data = sf.Frame.from_csv(sf.WWW.from_file('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'), columns_depth=0)\n\n\nEach record (or row) in this dataset describes observations of an iris flower, including its sepal and petal characteristics, as well as its species (of which there are three). To display just the first few rows, we can use the ``head()`` method. Notice that StaticFrame's default display makes it very clear what type of ``Frame``, ``Index``, and NumPy datatypes are present::\n\n    >>> data.head()\n    <Frame>\n    <Index> 0         1         2         3         4           <int64>\n    <Index>\n    0       5.1       3.5       1.4       0.2       Iris-setosa\n    1       4.9       3.0       1.4       0.2       Iris-setosa\n    2       4.7       3.2       1.3       0.2       Iris-setosa\n    3       4.6       3.1       1.5       0.2       Iris-setosa\n    4       5.0       3.6       1.4       0.2       Iris-setosa\n    <int64> <float64> <float64> <float64> <float64> <<U15>\n\n\nAs the columns are unlabelled, let's next add column labels. StaticFrame supports reindexing (conforming existing axis labels to new labels, potentially changing the size and ordering) and relabeling (simply applying new labels without regard to existing labels). As we can ignore the default column labels (auto-incremented integers), the ``relabel()`` method is used to provide new labels.\n\nNote that while ``relabel()`` creates a new ``Frame``, underlying NumPy data is not copied. As all NumPy data is immutable in StaticFrame, we can reuse it in our new container, making such operations very efficient::\n\n    >>> data = data.relabel(columns=('sepal_l', 'sepal_w', 'petal_l', 'petal_w', 'species'))\n    >>> data.head()\n    <Frame>\n    <Index> sepal_l   sepal_w   petal_l   petal_w   species     <<U7>\n    <Index>\n    0       5.1       3.5       1.4       0.2       Iris-setosa\n    1       4.9       3.0       1.4       0.2       Iris-setosa\n    2       4.7       3.2       1.3       0.2       Iris-setosa\n    3       4.6       3.1       1.5       0.2       Iris-setosa\n    4       5.0       3.6       1.4       0.2       Iris-setosa\n    <int64> <float64> <float64> <float64> <float64> <<U15>\n\n\n(Read more about no-copy operations `here <https://static-frame.readthedocs.io/en/latest/articles/no_copy.html>`__.)\n\nFor this example, eighty percent of the data will be used to train the classifier; the remaining twenty percent will be used to test the classifier. As all records are labelled with the known species, we can conclude by measuring the effectiveness of the classifier on the test data.\n\nTo divide the data into two groups, we create a ``Series`` of contiguous integers and then extract a random selection of 80% of the values into a new ``Series``, here named ``sel_train``. This will be used to select our traning data. As the ``sample()`` method, given a count, randomly samples that many values, your results will be different unless use the same ``seed`` argument::\n\n    >>> sel = sf.Series(np.arange(len(data)))\n    >>> sel_train = sel.sample(round(len(data) * .8), seed=42)\n    >>> sel_train.head()\n    <Series>\n    <Index>\n    0        0\n    2        2\n    3        3\n    4        4\n    5        5\n    <int64>  <int64>\n\n\nWe will create another ``Series`` to select the test data. The ``drop[]`` interface can be used to create a new ``Series`` that excludes the training selections, leaving just the testing selections. As with many interfaces in StaticFrame (such as ``astype`` and ``assign``), brackets can be used to do ``loc[]`` style selections::\n\n    >>> sel_test = sel.drop[sel_train]\n    >>> sel_test.head()\n    <Series>\n    <Index>\n    1        1\n    14       14\n    20       20\n    21       21\n    37       37\n    <int64>  <int64>\n\n\nTo select a subset of the data for training, the ``sel_train`` ``Series`` can be passed to ``loc[]`` to select just those rows::\n\n    >>> data_train = data.loc[sel_train]\n    >>> data_train.head()\n    <Frame>\n    <Index> sepal_l   sepal_w   petal_l   petal_w   species     <<U7>\n    <Index>\n    0       5.1       3.5       1.4       0.2       Iris-setosa\n    2       4.7       3.2       1.3       0.2       Iris-setosa\n    3       4.6       3.1       1.5       0.2       Iris-setosa\n    4       5.0       3.6       1.4       0.2       Iris-setosa\n    5       5.4       3.9       1.7       0.4       Iris-setosa\n    <int64> <float64> <float64> <float64> <float64> <<U15>\n\n\nWith our data divided into two randomly-selected, non-overlapping groups, we can proceed to implement the naive Bayes classifier. We will compute the ``posterior`` of the test data by multiplying the ``prior`` and the ``likelihood``. With the ``posterior``, we can determine which species the classifier has calculated is most likely. (More on naive Bayes classifiers can be found `here <https://en.wikipedia.org/wiki/Naive_Bayes_classifier>`__.)\n\nThe ``prior`` is calculated as the percentage of samples of each species in the training data. This is the \"normalized\" count per species. To get a ``Series`` of counts per species, we can select the species column, iterate over groups based on species name, and count the size of each group.\n\nIn StaticFrame, this can be done by calling ``Series.iter_group_items()`` to get an iterator of pairs of group label, group (where the group is a ``Series``). This iterator (or any similar iterator) can be given to a ``Batch``, a chaining processor of ``Frame`` or ``Series``, to perform operations on each group. (For more on the ``Batch`` and other higher-order containers in StaticFrame, see `here <https://static-frame.readthedocs.io/en/latest/articles/uhoc.html>`__.)\n\nOnce the ``Batch`` is created, selections, method calls, and operator expressions can be chained as if they were being called on a single container. Processing happens to every contained container, and a container is returned, only when a finalizer method, such as ``to_series()``, is called::\n\n    >>> counts = sf.Batch(data_train['species'].iter_group_items()).count().to_series()\n    >>> counts\n    <Series>\n    <Index>\n    Iris-setosa     43\n    Iris-versicolor 39\n    Iris-virginica  38\n    <<U15>          <int64>\n\n\nAs with NumPy, StaticFrame containers can be used in expressions with binary operators. The ``prior`` can be derived by dividing ``counts`` by the size of the training data. This returns a ``Series`` of the percentage of records per species::\n\n    >>> prior = counts / len(data_train)\n    >>> prior\n    <Series>\n    <Index>\n    Iris-setosa     0.35833333333333334\n    Iris-versicolor 0.325\n    Iris-virginica  0.31666666666666665\n    <<U15>          <float64>\n\n\nHaving calculated the ``prior``, we can calculate ``likelihood`` next. To calculate ``likelihood``, we will call a probability distribution function (imported from SciPy) with the test data, once for each species, given the characteristics (mean and standard deviation) observed in the test data for that species.\n\nThe ``Batch`` can again be used to calculate the mean and standard deviation, per species, from the training data. With the ``Frame`` of training data, we call ``iter_group_items()`` to group by species and, passing that iterator to ``Batch``, call ``mean()`` (assigned to ``mu``) or ``std()`` (assigned to ``sigma``). Note that ``iter_group_items()`` has an optional ``drop`` parameter to remove the column used for grouping from subsequent operations::\n\n\n    >>> mu = sf.Batch(data_train[['sepal_l', 'sepal_w', 'species']].iter_group_items('species', drop=True)).mean().to_frame()\n    >>> mu\n    <Frame>\n    <Index>         sepal_l            sepal_w            <<U7>\n    <Index>\n    Iris-setosa     4.986046511627907  3.434883720930233\n    Iris-versicolor 5.920512820512819  2.771794871794872\n    Iris-virginica  6.6078947368421055 2.9763157894736842\n    <<U15>          <float64>          <float64>\n\n    >>> sigma = sf.Batch(data_train[['sepal_l', 'sepal_w', 'species']].iter_group_items('species', drop=True)).std(ddof=1).to_frame()\n    >>> sigma\n    <Frame>\n    <Index>         sepal_l            sepal_w             <<U7>\n    <Index>\n    Iris-setosa     0.3419700595003668 0.3477024733400345\n    Iris-versicolor 0.508444214804487  0.33082728674826684\n    Iris-virginica  0.6055516042229233 0.3513942965328924\n    <<U15>          <float64>          <float64>\n\n\nFor a unified display of these characteristics, we can build a hierarchical index on each ``Frame`` with ``relabel_level_add()`` (adding the \"mu\" or \"sigma\" labels), then vertically concatenate the tables. As StaticFrame always requires unique labels in indices, adding an additional label is required before concatenation. The built-in ``round`` function can be used for more tidy display::\n\n    >>> stats = sf.Frame.from_concat((mu.relabel_level_add('mu'), sigma.relabel_level_add('sigma')))\n    >>> round(stats, 2)\n    <Frame>\n    <Index>                          sepal_l   sepal_w   <<U7>\n    <IndexHierarchy>\n    mu               Iris-setosa     4.99      3.43\n    mu               Iris-versicolor 5.92      2.77\n    mu               Iris-virginica  6.61      2.98\n    sigma            Iris-setosa     0.34      0.35\n    sigma            Iris-versicolor 0.51      0.33\n    sigma            Iris-virginica  0.61      0.35\n    <<U5>            <<U15>          <float64> <float64>\n\n\nWe can now move on to processing the test data with the characteristics derived from the training data. To do that, we will extract our previously selected test records with ``sel_test`` into a new ``Frame``, to which we can add our ``posterior`` predictions and final species classifications.\n\nIt is common to process data in table by adding columns from left to right. StaticFrame permits this limited form of mutability with the grow-only ``FrameGO``. While underlying NumPy arrays are still always immutable, columns can be added to a ``FrameGO`` with bracket-style assignments. A ``FrameGO`` can be created from a ``Frame`` with the ``to_frame_go()`` method. As mentioned elsewhere, underlying immutable NumPy arrays are not copied: this is an efficient, no-copy operation.\n\nPassing two arguments to ``loc[]``, we can select rows with the values from ``sel_test``, and we can select columns with a list of labels for the sepal length and sepal width::\n\n    >>> data_test = data.loc[sel_test.values, ['sepal_l', 'sepal_w']].to_frame_go()\n    >>> data_test.head()\n    <FrameGO>\n    <IndexGO> sepal_l   sepal_w   <<U7>\n    <Index>\n    1         4.9       3.0\n    14        5.8       4.0\n    20        5.4       3.4\n    21        5.1       3.7\n    37        4.9       3.1\n    <int64>   <float64> <float64>\n\n\nStaticFrame interfaces make extensive use of iterators and generators. As used below, the ``Frame.from_fields()`` constructor will create a ``Frame`` from any iterable (or generator) of column arrays.\n\nThe ``likelihood_of_species()`` function (defined below), for each index label in ``mu`` (which provides each unique iris species), calculates a probability density function for the test data, given the ``mu`` (mean) and ``sigma`` (standard deviation) for the species. An array of the sum of the log is yielded::\n\n    >>> from scipy.stats import norm\n    >>> def likelihood_of_species():\n    ...     for label in mu.index:\n    ...             pdf = norm.pdf(data_test.values, mu.loc[label], sigma.loc[label])\n    ...             yield np.log(pdf).sum(axis=1)\n\n\nWhile the generator function above is easy to read, it is hard to copy and paste. If you are following along, using the one-line generator expression, below, will be easier. The two are equivalent:\n\n>>> likelihood_of_species = (np.log(norm.pdf(data_test.values, mu.loc[label], sigma.loc[label])).sum(axis=1) for label in mu.index)\n\n\nWith this generator expression defined, we call the ``from_fields`` constructor to produce the ``likelihood`` table, providing column labels from ``mu.index`` and index labels from ``data_test.index``. For each test record row we now have a likelihood per species::\n\n    >>> likelihood = sf.Frame.from_fields(likelihood_of_species, columns=mu.index, index=data_test.index)\n    >>> round(likelihood.head(), 2)\n    <Frame>\n    <Index> Iris-setosa Iris-versicolor Iris-virginica <<U15>\n    <Index>\n    1       -0.52       -2.31           -4.27\n    14      -3.86       -6.97           -5.42\n    20      -0.45       -2.38           -3.01\n    21      -0.05       -5.29           -5.51\n    37      -0.2        -2.56           -4.33\n    <int64> <float64>   <float64>       <float64>\n\n\nWe can calculate the ``posterior`` by multiplying ``likelihood`` by ``prior``. Whenever performing binary operations on ``Frame`` and ``Series``, indices will be aligned and, if necessary, reindexed before processing::\n\n    >>> posterior = likelihood * prior\n    >>> round(posterior.head(), 2)\n    <Frame>\n    <Index> Iris-setosa Iris-versicolor Iris-virginica <<U15>\n    <Index>\n    1       -0.19       -0.75           -1.35\n    14      -1.38       -2.27           -1.72\n    20      -0.16       -0.77           -0.95\n    21      -0.02       -1.72           -1.75\n    37      -0.07       -0.83           -1.37\n    <int64> <float64>   <float64>       <float64>\n\n\nWe can now add columns to our ``data_test`` ``FrameGO``. To determine our best prediction of species for each row of the test data, the column label (the species) of the maximum a posteriori estimate is selected with ``loc_max()``::\n\n    >>> data_test['predict'] = posterior.loc_max(axis=1)\n    >>> data_test.head()\n    <FrameGO>\n    <IndexGO> sepal_l   sepal_w   predict     <<U7>\n    <Index>\n    1         4.9       3.0       Iris-setosa\n    14        5.8       4.0       Iris-setosa\n    20        5.4       3.4       Iris-setosa\n    21        5.1       3.7       Iris-setosa\n    37        4.9       3.1       Iris-setosa\n    <int64>   <float64> <float64> <<U15>\n\n\nWe can add two additional columns to evaluate the effectivess of the classifier. First, we can add an \"observed\" column by adding the original \"species\" column from the original ``data`` ``Frame``. In assigning a ``Series`` to a ``Frame``, only values found in the intersection of the indices will be added as a column::\n\n    >>> data_test['observed'] = data['species']\n    >>> data_test.head()\n    <FrameGO>\n    <IndexGO> sepal_l   sepal_w   predict     observed    <<U8>\n    <Index>\n    1         4.9       3.0       Iris-setosa Iris-setosa\n    14        5.8       4.0       Iris-setosa Iris-setosa\n    20        5.4       3.4       Iris-setosa Iris-setosa\n    21        5.1       3.7       Iris-setosa Iris-setosa\n    37        4.9       3.1       Iris-setosa Iris-setosa\n    <int64>   <float64> <float64> <<U15>      <<U15>\n\n\nHaving populated a column of predicted and observed values, we can compare the two to get a Boolean column indicating when the classifier calculated a correct predicton::\n\n    >>> data_test['correct'] = data_test['predict'] == data_test['observed']\n    >>> data_test.tail()\n    <FrameGO>\n    <IndexGO> sepal_l   sepal_w   predict         observed       correct <<U8>\n    <Index>\n    129       7.2       3.0       Iris-virginica  Iris-virginica True\n    130       7.4       2.8       Iris-virginica  Iris-virginica True\n    140       6.7       3.1       Iris-virginica  Iris-virginica True\n    144       6.7       3.3       Iris-virginica  Iris-virginica True\n    149       5.9       3.0       Iris-versicolor Iris-virginica False\n    <int64>   <float64> <float64> <<U15>          <<U15>         <bool>\n\n\nTo find the percentage of correct classifications among the test data, we can sum the ``correct`` Boolean column and divide that by the size of the test data::\n\n    >>> data_test[\"correct\"].sum() / len(data_test)\n    0.7333333333333333\n\nThis simple naive Bayes classifier can predict iris species correctly about 73% of the time.\n\nFor further introduction to StaticFrame, including links to articles, videos, and documentation, see `here <https://static-frame.readthedocs.io/en/latest/intro.html>`__.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Immutable and statically-typeable DataFrames with runtime type and data validation.",
    "version": "2.15.1",
    "project_urls": {
        "Homepage": "https://github.com/static-frame/static-frame"
    },
    "split_keywords": [
        "staticframe",
        "pandas",
        "numpy",
        "immutable",
        "array"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8cb0db1d597859cea3eb7c537eca8cec59e329777775184cd9ef50fba53e3b78",
                "md5": "386157f3eab7e9babe04f0dcbee3e778",
                "sha256": "9394e1985ebdb9a95554f52beac9d29edb09631298dcc17cd2eafb83c42922e6"
            },
            "downloads": -1,
            "filename": "static_frame-2.15.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "386157f3eab7e9babe04f0dcbee3e778",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 779675,
            "upload_time": "2024-11-27T04:52:33",
            "upload_time_iso_8601": "2024-11-27T04:52:33.523866Z",
            "url": "https://files.pythonhosted.org/packages/8c/b0/db1d597859cea3eb7c537eca8cec59e329777775184cd9ef50fba53e3b78/static_frame-2.15.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "444713c2cc1fc62c046e06cea82a992d2014b560c10e972b772ad9c1b30fc662",
                "md5": "31f533f3b27095af9231e9a051b24e33",
                "sha256": "e12cd83c723cbc85659e0c67012b9ce0bce351c6a26ab796765182e46be5c349"
            },
            "downloads": -1,
            "filename": "static-frame-2.15.1.tar.gz",
            "has_sig": false,
            "md5_digest": "31f533f3b27095af9231e9a051b24e33",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 723638,
            "upload_time": "2024-11-27T04:52:37",
            "upload_time_iso_8601": "2024-11-27T04:52:37.215084Z",
            "url": "https://files.pythonhosted.org/packages/44/47/13c2cc1fc62c046e06cea82a992d2014b560c10e972b772ad9c1b30fc662/static-frame-2.15.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-27 04:52:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "static-frame",
    "github_project": "static-frame",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.23.5"
                ]
            ]
        },
        {
            "name": "arraymap",
            "specs": [
                [
                    "==",
                    "0.4.0"
                ]
            ]
        },
        {
            "name": "arraykit",
            "specs": [
                [
                    "==",
                    "0.10.0"
                ]
            ]
        },
        {
            "name": "typing-extensions",
            "specs": [
                [
                    ">=",
                    "4.12.0"
                ]
            ]
        }
    ],
    "lcname": "static-frame"
}
        
Elapsed time: 0.47177s