greylock

Name	greylock JSON
Version	1.0.0 JSON
	download
home_page	None
Summary	A Python package for measuring the composition of complex datasets
upload_time	2024-11-14 15:40:44
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            ![alt text](images/diversity_logo.png)

# <h1> <i>greylock</i>: A Python package for measuring the composition of complex datasets</h1>

[![Python version](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)](https://www.python.org/downloads/release/python-380/)
[![Tests](https://github.com/ArnaoutLab/diversity/actions/workflows/tests.yml/badge.svg)](https://github.com/ArnaoutLab/diversity/actions/workflows/tests.yml)

- [About](#about)
  - [Definitions](#definitions)
  - [Partitioned diversity](#partitioned-diversity)
  - [Frequency-sensitive diversity](#frequency-sensitive-diversity)
  - [Similarity-sensitive diversity](#similarity-sensitive-diversity)
  - [Rescaled diversity indices](#rescaled-diversity-indices)
  - [One package to rule them all](#one-package-to-rule-them-all)
- [Basic usage](#basic-usage)
  - [alpha diversities](#alpha-diversities)
  - [beta diversities](#beta-diversities)
- [Advanced usage](#advanced-usage)
- [Command-line usage](#command-line-usage)
- [Applications](#applications)
- [Alternatives](#alternatives)

# About

`greylock` calculates effective numbers in an extended version of the Hill framework, with extensions due to Leinster and Cobbold and Reeve et al. “Extending” a hill makes a mountain. At 3,489 feet (1,063 meters), Mount Greylock is Massachusetts’ tallest mountain. It is named for Gray Lock (c. 1670–1750),  a historical figure of the Abnaki, an indigenous people of New England.

## Availability and installation
The package is available on GitHub at https://github.com/ArnaoutLab/diversity. It can be installed by running

`pip install greylock`

from the command-line interface. The test suite runs successfully on Macintosh, Windows, and Unix systems. The unit tests (including a coverage report) can be run after installation by

```
pip install 'greylock[tests]'
pytest --pyargs greylock --cov greylock
```

## How to cite this work

If you use this package, please cite it as:

Nguyen et al., <i>greylock</i>. <https://github.com/ArnaoutLab/diversity>

## Definitions

A ***community*** is a collection of elements called ***individuals***, each of which is assigned a label called its ***species***, where multiple individuals may have the same species. An example of a community is all the animals and plants living in a lake. A ***metacommunity*** consists of several communities. An example of a metacommunity is all the animals in a lake split into different depths. Each community that makes up a metacommunity is called a ***subcommunity***.

Even though the terms metacommunity and subcommunity originate in ecology, we use them in a broader sense. If one is interested in analyzing a subset of a dataset, then the subset is a subcommunity and the entire dataset is the metacommunity. Alternatively, if one is interested in how individual datasets (e.g. from individual research subjects) compare to all datasets used in a study, the individual datasets are subcommunities and the set of all datasets is the metacommunity. (When there is only a single dataset under study, we use “subcommunity” and “metacommunity” interchangeably as convenient.)

A ***diversity index*** is a statistic associated with a community, which describes how much the species of its individuals vary. For example, a community of many individuals of the same species has a very low diversity whereas a community with multiple species and the same amount of individuals per species has a high diversity.

## Partitioned diversity

Some diversity indices compare the diversities of the subcommunities with respect to the overall metacommunity. For example, two subcommunities with the same frequency distribution but no shared species each comprise half of the combined metacommunity diversity.

## Frequency-sensitive diversity

[In 1973, Hill introduced a framework](https://doi.org/10.2307/1934352) which unifies commonly used diversity indices into a single parameterized family of diversity measures. The so-called ***viewpoint parameter*** can be thought of as the sensitivity to rare species. At one end of the spectrum, when the viewpoint parameter is set to 0, species frequency is ignored entirely, and only the number of distinct species matters, while at the other end of the spectrum, when the viewpoint parameter is set to $\infty$, only the highest frequency species in a community is considered by the corresponding diversity measure. Common diversity measures such as ***species richness***, ***Shannon entropy***, the ***Gini-Simpson index***, and the ***Berger-Parker index*** have simple and natural relationships with Hill's indices at different values for the viewpoint parameter ($0$, $1$, $2$, $\infty$, respectively).

## Similarity-sensitive diversity

In addition to being sensitive to frequency, it often makes sense to account for similarity in a diversity measure. For example, a community of two different types of rodents may be considered less diverse than a community where one of the rodent species was replaced by the same number of individuals of a bird species. [Reeve et al.](https://arxiv.org/abs/1404.6520) and [Leinster and Cobbold](https://doi.org/10.1890/10-2402.1) present a general mathematically rigorous way of incorporating similarity measures into Hill's framework. The result is a family of similarity-sensitive diversity indices parameterized by the same viewpoint parameter as well as the similarity function used for the species in the meta- or subcommunities of interest. These similarity-sensitive diversity measures account for both the pairwise similarity between all species and their frequencies.

## Rescaled diversity indices

In addition to the diversity measures introduced by Reeve et al, we also included two new rescaled measures $\hat{\rho}$ and $\hat{\beta}$, as well as their metacommunity counterparts. The motivation for introducing these measures is that $\rho$ can become very large if the number of subcommunities is large. Similarly, $\beta$ can become very small in this case. The rescaled versions are designed so that they remain of order unity even when there are lots of subcommunities.

## One package to rule them all

The `greylock` package is able to calculate all of the similarity- and frequency-sensitive subcommunity and metacommunity diversity measures described in [Reeve et al.](https://arxiv.org/abs/1404.6520). See the paper for more in-depth information on their derivation and interpretation.


**Supported subcommunity diversity measures**:

  - $\alpha$ - diversity of subcommunity $j$ in isolation, per individual
  - $\bar{\alpha}$ - diversity of subcommunity $j$ in isolation
  - $\rho$ - redundancy of subcommunity $j$
  - $\bar{\rho}$ - representativeness of subcommunity $j$
  - $\hat{\rho}$ - rescaled version of redundancy ($\rho$)
  - $\beta$ - distinctiveness of subcommunity $j$
  - $\bar{\beta}$ - effective number of distinct subcommunities
  - $\hat{\beta}$ - rescaled version of distinctiveness ($\beta$) 
  - $\gamma$ - contribution of subcommunity $j$ toward metacommunity diversity


**Supported metacommunity diversity measures**:
  - $A$ - naive-community metacommunity diversity
  - $\bar{A}$ - average diversity of subcommunities
  - $R$ - average redundancy of subcommunities
  - $\bar{R}$ - average representativeness of subcommunities
  - $\hat{R}$ - average rescaled redundancy of subcommunities
  - $B$ - average distinctiveness of subcommunities
  - $\bar{B}$ - effective number of distinct subcommunities
  - $\hat{B}$ - average rescaled distinctiveness of subcommunities
  - $G$ - metacommunity diversity


# Basic usage
## Alpha diversities 

We illustrate the basic usage of `greylock` on simple, field-of-study-agnostic datasets of fruits and animals. First, consider two datasets of size $n=35$ that each contains counts of six types of fruit: apples, oranges, bananas, pears, blueberries, and grapes.

<img src='images/fruits-1.png' width='350'>

Dataset 1a is mostly apples; in dataset 1b, all fruits are represented at almost identical frequencies. The frequencies of the fruits in each dataset is tabulated below:

|           | Dataset 1a | Dataset 1b | 
| :-------- | ---------: | ---------: | 
| apple     |         30 |          6 | 
| orange    |          1 |          6 |
| banana    |          1 |          6 |
| pear      |          1 |          6 |
| blueberry |          1 |          6 |
| grape     |          1 |          5 |
| total     |         35 |         35 | 

A frequency-sensitive metacommunity can be created in Python by passing a `counts` DataFrame to a `Metacommunity` object:

```python
import pandas as pd
import numpy as np
from greylock import Metacommunity

counts_1a = pd.DataFrame({"Dataset 1a": [30, 1, 1, 1, 1, 1]}, 
   index=["apple", "orange", "banana", "pear", "blueberry", "grape"])

metacommunity_1a = Metacommunity(counts_1a)
```

Once a metacommunity has been created, diversity measures can be calculated. For example, to calculate $D_1$, we type:

```python
metacommunity_1a.subcommunity_diversity(viewpoint=1, measure='alpha')
```

The output shows that $D_1=1.90$. To calculate the corresponding metacommunity diversity index:

```python
metacommunity_1a.metacommunity_diversity(viewpoint=1, measure='alpha')
```

In this example, the metacommunity indices are the same as the subcommunity ones, since there is only one subcommunity. To calculated multiple diversity measures at once and store them in a DataFrame, we type:

```python 
metacommunity_1a.to_dataframe(viewpoint=[0, 1, np.inf])
```

which produces the following output:

|      | community     | viewpoint | alpha |  rho | beta | gamma | normalized_alpha | normalized_rho | normalized_beta | rho_hat | beta_hat |
| ---: | :------------ | --------: | ----: | ---: | ---: | ----: | ---------------: | -------------: | --------------: | ------: | -------: |
|    0 | metacommunity |      0.00 |  6.00 | 1.00 | 1.00 |  6.00 |             6.00 |           1.00 |            1.00 |    1.00 |     1.00 |
|    1 | Dataset 1a    |      0.00 |  6.00 | 1.00 | 1.00 |  6.00 |             6.00 |           1.00 |            1.00 |    1.00 |     1.00 |
|    2 | metacommunity |      1.00 |  1.90 | 1.00 | 1.00 |  1.90 |             1.90 |           1.00 |            1.00 |    1.00 |     1.00 |
|    3 | Dataset 1a    |      1.00 |  1.90 | 1.00 | 1.00 |  1.90 |             1.90 |           1.00 |            1.00 |    1.00 |     1.00 |
|    4 | metacommunity |       inf |  1.17 | 1.00 | 1.00 |  1.17 |             1.17 |           1.00 |            1.00 |    1.00 |     1.00 |
|    5 | Dataset 1a    |       inf |  1.17 | 1.00 | 1.00 |  1.17 |             1.17 |           1.00 |            1.00 |    1.00 |     1.00 |


Next, let us repeat for Dataset 1b. Again, we make the `counts` dataframe and a `Metacommunity` object:

```python
counts_1b = pd.DataFrame({"Community 1b": [6, 6, 6, 6, 6, 5]},
    index=["apple", "orange", "banana", "pear", "blueberry", "grape"])

metacommunity_1b = Metacommunity(counts_1b)
```

To obtain $D_1$, we run:

```python
metacommunity_1b.subcommunity_diversity(viewpoint=1, measure='alpha')
```

We find that $D_1 \approx 5.99$ for Dataset 1b. The larger value of $D_1$ for Dataset 1b aligns with the intuitive sense that more balance in the frequencies of unique elements means a more diverse dataset. To output multiple diversity measures at once, we run:

```python
metacommunity_1b.to_dataframe(viewpoint=[0, 1, np.inf])
```

which produces the output:

|      | community     | viewpoint | alpha |  rho | beta | gamma | normalized_alpha | normalized_rho | normalized_beta | rho_hat | beta_hat |
| ---: | :------------ | --------: | ----: | ---: | ---: | ----: | ---------------: | -------------: | --------------: | ------: | -------: |
|    0 | metacommunity |      0.00 |  6.00 | 1.00 | 1.00 |  6.00 |             6.00 |           1.00 |            1.00 |    1.00 |     1.00 |
|    1 | Dataset 1b    |      0.00 |  6.00 | 1.00 | 1.00 |  6.00 |             6.00 |           1.00 |            1.00 |    1.00 |     1.00 |
|    2 | metacommunity |      1.00 |  5.99 | 1.00 | 1.00 |  5.99 |             5.99 |           1.00 |            1.00 |    1.00 |     1.00 |
|    3 | Dataset 1b    |      1.00 |  5.99 | 1.00 | 1.00 |  5.99 |             5.99 |           1.00 |            1.00 |    1.00 |     1.00 |
|    4 | metacommunity |       inf |  5.83 | 1.00 | 1.00 |  5.83 |             5.83 |           1.00 |            1.00 |    1.00 |     1.00 |
|    5 | Dataset 1b    |       inf |  5.83 | 1.00 | 1.00 |  5.83 |             5.83 |           1.00 |            1.00 |    1.00 |     1.00 |

The `greylock` package can also calculate similarity-sensitive diversity measures for any user-supplied definition of similarity. To illustrate, we now consider a second example in which the dataset elements are all unique. Uniqueness means element frequencies are identical, so similarity is the only factor that influences diversity calculations.

<img src='images/fig2_thumbnail.png' width='350'>

The datasets now each contain a set of animals in which each animal appears only once. We consider phylogenetic similarity (approximated roughly, for purposes of this example). Dataset 2a consists entirely of birds, so all entries in the similarity matrix are close to $1$:

```python
labels_2a = ["owl", "eagle", "flamingo", "swan", "duck", "chicken", "turkey", "dodo", "dove"]
no_species_2a = len(labels_2a)
S_2a = np.identity(n=no_species_2a)


S_2a[0][1:9] = (0.91, 0.88, 0.88, 0.88, 0.88, 0.88, 0.88, 0.88) # owl
S_2a[1][2:9] = (      0.88, 0.89, 0.88, 0.88, 0.88, 0.89, 0.88) # eagle
S_2a[2][3:9] = (            0.90, 0.89, 0.88, 0.88, 0.88, 0.89) # flamingo
S_2a[3][4:9] = (                  0.92, 0.90, 0.89, 0.88, 0.88) # swan
S_2a[4][5:9] = (                        0.91, 0.89, 0.88, 0.88) # duck
S_2a[5][6:9] = (                              0.92, 0.88, 0.88) # chicken
S_2a[6][7:9] = (                                    0.89, 0.88) # turkey
S_2a[7][8:9] = (                                          0.88) # dodo
                                                                # dove


S_2a = np.maximum( S_2a, S_2a.transpose() )
```
We may optionally convert this to a DataFrame for inspection:
```python
S_2a_df = pd.DataFrame({labels_2a[i]: S_2a[i] for i in range(no_species_2a)}, index=labels_2a)
```

which corresponds to the following table:

|           |      owl |     eagle | flamingo |      swan |    duck |   chicken |    turkey |     dodo |       dove |
| :-------- | -------: | --------: | -------: | --------: | ------: | --------: | --------: | -------: | ---------: |
|       owl |        1 |      0.91 |     0.88 |      0.88 |    0.88 |      0.88 |      0.88 |     0.88 |       0.88 |
|     eagle |     0.91 |         1 |     0.88 |      0.89 |    0.88 |      0.88 |      0.88 |     0.89 |       0.88 |
|  flamingo |     0.88 |      0.88 |        1 |      0.90 |    0.89 |      0.88 |      0.88 |     0.88 |       0.89 |
|      swan |     0.88 |      0.89 |     0.90 |         1 |    0.92 |      0.90 |      0.89 |     0.88 |       0.88 |
|      duck |     0.88 |      0.88 |     0.89 |      0.92 |       1 |      0.91 |      0.89 |     0.88 |       0.88 |
|   chicken |     0.88 |      0.88 |     0.88 |      0.90 |    0.91 |         1 |      0.92 |     0.88 |       0.88 |
|    turkey |     0.88 |      0.88 |     0.88 |      0.89 |    0.89 |      0.92 |         1 |     0.89 |       0.88 |
|      dodo |     0.88 |      0.89 |     0.88 |      0.88 |    0.88 |      0.88 |      0.89 |        1 |       0.88 |
|      dove |     0.88 |      0.88 |     0.89 |      0.88 |    0.88 |      0.88 |      0.88 |     0.88 |          1 |


We make a DataFrame of counts in the same way as in the previous example:

```python
counts_2a = pd.DataFrame({"Community 2a": [1, 1, 1, 1, 1, 1, 1, 1, 1]}, index=labels_2a)
```

To compute the similarity-sensitive diversity indices, we now pass the similarity matrix to the similarity argument of the metacommunity object.
In this example we pass the similarity matrix in the form of a numpy array:

```python
metacommunity_2a = Metacommunity(counts_2a, similarity=S_2a)
```

(If we wanted to use the similarity matrix in DataFrame format, we use a custom Similarity subclass.

```python
from greylock.similarity import SimilarityFromDataFrame
metacommunity_2a = Metacommunity(counts_2a, similarity=SimilarityFromDataFrame(S_2a_df))
```

Note that even though the code looks a little different, the calculation will be exactly the same.)

We can find $D_0^Z$ similarly to the above:

```python
metacommunity_2a.subcommunity_diversity(viewpoint=0, measure='alpha')
```

The output tells us that $D_0^Z=1.11$. The fact that this number is close to 1 reflects the fact that all individuals in this community are very similar to each other (all birds).

In contrast, Dataset 2b consists of members from two different phyla: vertebrates and invertebrates. As above, we define a similarity matrix:

```python
labels_2b = ("ladybug", "bee", "butterfly", "lobster", "fish", "turtle", "parrot", "llama", "orangutan")
no_species_2b = len(labels_2b)
S_2b = np.identity(n=no_species_2b)
S_2b[0][1:9] = (0.60, 0.55, 0.45, 0.25, 0.22, 0.23, 0.18, 0.16) # ladybug
S_2b[1][2:9] = (      0.60, 0.48, 0.22, 0.23, 0.21, 0.16, 0.14) # bee
S_2b[2][3:9] = (            0.42, 0.27, 0.20, 0.22, 0.17, 0.15) # bu’fly
S_2b[3][4:9] = (                  0.28, 0.26, 0.26, 0.20, 0.18) # lobster
S_2b[4][5:9] = (                        0.75, 0.70, 0.66, 0.63) # fish
S_2b[5][6:9] = (                              0.85, 0.70, 0.70) # turtle
S_2b[6][7:9] = (                                    0.75, 0.72) # parrot
S_2b[7][8:9] = (                                          0.85) # llama
                                                                #orangutan

S_2b = np.maximum( S_2b, S_2b.transpose() )
# optional, convert to DataFrame for inspection:
S_2b_df = pd.DataFrame({labels_2b[i]: S_2b[i] for i in range(no_species_2b)}, index=labels_2b)
```

which corresponds to the following table:
|           |  ladybug |       bee |    b'fly |   lobster |    fish |    turtle |    parrot |    llama |  orangutan |
| :-------- | -------: | --------: | -------: | --------: | ------: | --------: | --------: | -------: | ---------: |
| ladybug   |        1 |      0.60 |     0.55 |      0.45 |    0.25 |      0.22 |      0.23 |     0.18 |       0.16 |
| bee       |     0.60 |         1 |     0.60 |      0.48 |    0.22 |      0.23 |      0.21 |     0.16 |       0.14 |
| b'fly     |     0.55 |      0.60 |        1 |      0.42 |    0.27 |      0.20 |      0.22 |     0.17 |       0.15 |
| lobster   |     0.45 |      0.48 |     0.42 |         1 |    0.28 |      0.26 |      0.26 |     0.20 |       0.18 |
| fish      |     0.25 |      0.22 |     0.27 |      0.28 |       1 |      0.75 |      0.70 |     0.66 |       0.63 |
| turtle    |     0.22 |      0.23 |     0.20 |      0.26 |    0.75 |         1 |      0.85 |     0.70 |       0.70 |
| parrot    |     0.23 |      0.21 |     0.22 |      0.26 |    0.70 |      0.85 |         1 |     0.75 |       0.72 |
| llama     |     0.18 |      0.16 |     0.17 |      0.20 |    0.66 |      0.70 |      0.75 |        1 |       0.85 |
| orangutan |     0.16 |      0.14 |      0.15|      0.18 |     0.63|      0.70 |      0.72 |     0.85 |          1 |

The values of the similarity matrix indicate high similarity among the vertebrates, high similarity among the invertebrates and low similarity between vertebrates and invertebrates.

To calculate the alpha diversity (with $q=0$ as above), we proceed as before, defining counts, creating a Metacommunity object, and calling its `subcommunity_diversity` method with the desired settings:

```python
counts_2b = pd.DataFrame({"Community 2b": [1, 1, 1, 1, 1, 1, 1, 1, 1]}, index=labels_2b)
metacommunity_2b = Metacommunity(counts_2b, similarity=S_2b)
metacommunity_2b.subcommunity_diversity(viewpoint=0, measure='alpha')
```

which outputs $D_0^Z=2.16$. That this number is close to 2 reflects the fact that members in this community belong to two broad classes of animals: vertebrates and invertebrates. The remaining $0.16$ above $2$ is interpreted as reflecting the diversity within each phylum.

## Beta diversities
Recall beta diversity is between-group diversity. To illustrate, we will re-imagine Dataset 2b as a metacommunity made up of 2 subcommunities—the invertebrates and the vertebrates—defined as follows:

```python
counts_2b_1 = pd.DataFrame(
{
   "Subcommunity_2b_1": [1, 1, 1, 1, 0, 0, 0, 0, 0], # invertebrates
      "Subcommunity_2b_2": [0, 0, 0, 0, 1, 1, 1, 1, 1], #   vertebrates
},
index=labels_2b
)
```

We can obtain the representativeness $\bar{\rho}$ (“rho-bar”) of each subcommunity, here at $q=0$, as follows:

```python
metacommunity_2b_1 = Metacommunity(counts_2b_1, similarity=S_2b)
metacommunity_2b_1.subcommunity_diversity(viewpoint=0, 
measure='rho_hat')
```

with the output $[0.41, 0.21]$. Recall $\hat{\rho}$ indicates how well a subcommunity represents the metacommunity. We find that $\hat{\rho}$ of the two subcommunities are rather low— $0.41$ and $0.21$ for the invertebrates and the vertebrates, respectively—reflecting the low similarity between these groups. 
Note the invertebrates are more diverse than the vertebrates, which we can see by calculating $q=0$ $\alpha$ diversity of these subcommunities:

```python
metacommunity_2b_1.subcommunity_diversity(viewpoint=0, measure='alpha')
```

which outputs $[3.54, 2.30]$. In contrast, suppose we split Dataset 2b into two subsets at random, without regard to phylum:

```python
counts_2b_2 = pd.DataFrame(
{
   "Subcommunity_2b_3": [1, 0, 1, 0, 1, 0, 1, 0, 1],
   "Subcommunity_2b_4": [0, 1, 0, 1, 0, 1, 0, 1, 0],
},
index=labels_2b
)
```

Proceeding again as above,

```python
metacommunity_2b_2 = Metacommunity(counts_2b_2, similarity=S_2b)
metacommunity_2b_2.subcommunity_diversity(viewpoint=0, measure='rho_hat')
```

yielding $[0.68, 1.07]$. We find that the $\hat{\rho}$ of the two subsets are now, respectively, $0.68$ and $1.07$. These high values reflect the fact that the vertebrates and the invertebrates are roughly equally represented.

# Advanced usage

In the examples above, the entire similarity matrix has been created in RAM (as a `numpy.ndarray` or `pandas.DataFrame`) before being passed to the `Metacommunity` constructor. However, this may not be the best tactic for large datasets. The `greylock` package offers better options in these cases. Given that the
simillarity matrix is of complexity $O(n^2)$ (where $n$ is the number of species), the creation, storage, and use of the similarity matrix are the most computationally resource-intense aspects of calculating diversity. Careful consideration of how to handle the similarity matrix can extend the range of problems that are tractable by many orders of magnitude.

Any large similarity matrix that is created in Python as a `numpy.ndarray` benefits from being memory-mapped, as NumPy can then use the data without requiring it all to be in memory. See the NumPy [memmap documentation](https://numpy.org/doc/stable/reference/generated/numpy.memmap.html) for guidance. Because `memmap` is a subclass of `ndarray`, using this type of file storage for the similarity matrix requires no modification to your use of the Metacommunity API. This conversion, and the resulting storage of the data on disk, has the advantage that if you revise the downstream analysis, or perform additional analyses, re-calculation of the similarity matrix may be skipped.

The strategy of calculate-once, use-many-times afforded by storage of the similarity matrix to a file allows you to do the work of calculating the similarity matrix in an entirely separate process. You may choose to calculate the similarity matrix in a more performant language, such as C++, and/or inspect the matrix in Excel. In these cases, it is  convenient to store the similarity matrix in a non-Python-specific format, such as a .csv or .tsv file. The entire csv/tsv file need not be read into memory before invoking the `Metacommunity` constructor.
Rather, one may instantiate `SimilarityFromFile`, whose constructor is given the path of a cvs or tsv file. The `SimilarityFromFile` object will use the file's contents in a memory-efficient way, reading in chunks as they are used. 

To illustrate passing a csv file, we re-use the counts_2b_1 and S_2b from above and save the latter as .csv files (note `index=False`, since the csv files should *not* contain row labels):
```python
S_2b_df.to_csv("S_2b.csv", index=False)
```
then we can build a metacommunity as follows
```python
from greylock.similarity import SimilarityFromFile
metacommunity_2b_1 = Metacommunity(counts_2b_1,
                                   similarity=SimilarityFromFile('S_2b.csv', chunk_size=5))
```
The optional `chunk_size` argument to `SimilarityFromFile`'s constructor specifies how many rows of the similarity matrix are read from the file at a time.

Alternatively, to avoid a large footprint on either RAM or disk, the similarity matrix can be constructed and processed on the fly. 
A `SimilarityFromFunction` object generates a similarity matrix from a similarity function, and an array or `DataFrame` of features to `X`. Each row of X represents the feature values of a species. 
For example, given numeric features all of the same type:

```python
from greylock.similarity import SimilarityFromFunction

X = np.array([
  [1, 2], 
  [3, 4], 
  [5, 6]
])

def similarity_function(species_i, species_j):
  return 1 / (1 + np.linalg.norm(species_i - species_j))

metacommunity = Metacommunity(np.array([[1, 1], [1, 0], [0, 1]]),
                              similarity=SimilarityFromFunction(similarity_function,
                                                               X=X, chunk_size=10))
```

(The optional `chunk_size` parameter specifies how many rows of the similarity matrix to generate at once; larger values should be faster, as long as the chunks are not too large
compared to available RAM.)

If there are features of various types, and it would be convenient to address features by name, features can be supplied in a DataFrame. (Note that, because of the use of named tuples to represent species in the similarity function, it is helpful if the column names are valid Python identifiers.)

```
X = pd.DataFrame(
    {
        "breathes": [
            "water",
            "air",
            "air",
        ],
        "covering": [
            "scales",
            "scales",
            "fur",
        ],
        "n_legs": [
            0,
            0,
            4,
        ],
    },
    index=[
        "tuna",
        "snake",
        "rabbit",
    ],
)

def feature_similarity(animal_i, animal_j):
    if animal_i.breathes != animal_j.breathes:
        return 0.0
    if animal_i.covering == animal_j.covering:
        result = 1
    else:
        result = 0.5
    if animal_i.n_legs != animal_j.n_legs:
        result *= 0.5
    return result

metacommunity = Metacommunity(np.array([[1, 1], [1, 0], [0, 1]]),
                              similarity=SimilarityFromFunction(feature_similarity, X=X))
```

A two-fold speed-up is possible when the following (typical) conditions hold:

* The similarity matrix is symmetric (i.e. similarity[i, j] == similarity[j, i] for all i and j).
* The similarity of each species with itself is 1.0.
* The number of subcommunities is much smaller than the number of species.

In this case, we don't really need to call the simularity function twice for each pair to calcuate both  similarity[i, j] and similarity[j, i]. 
Use the `SimilarityFromSymmetricFunction` class to get the same results in half the time:

```python
from greylock.similarity import SimilarityFromSymmetricFunction

metacommunity = Metacommunity(np.array([[1, 1], [1, 0], [0, 1]]),
                              similarity=SimilarityFromSymmetricFunction(feature_similarity, X=X))
```

The similarity function will only be called for pairs of rows `species[i], species[j]` where i < j, and the similarity of $species_i$ to $species_j$ will be re-used for the similarity of $species_j$ to $species_i$. Thus, a nearly 2-fold speed-up is possible, if the similarity function is computationally expensive. (For a discussion of _nonsymmetric_ similarity, see [Leinster and Cobbold](https://doi.org/10.1890/10-2402.1).)

## Parallelization using the ray package

For very large datasets, the computation of the similarity matrix can be completed in a fraction of the time by parallelizing over many cores or even over a Kubernetes cluster.
Support for parallelizing this computation using the [`ray` package](https://pypi.org/project/ray/) is built into `greylock`. However, this is an optional dependency, 
as it is not required for small datasets, and 
the installation of `ray` into your environment may entail some conflicting dependency issues. Thus, before trying to use `ray`, be sure to install the extra:

```
pip install 'greylock[ray]'
```

To actually use Ray, replace the use of `SimilarityFromFunction` and `SimilarityFromSymmetricFunction` with `SimilarityFromRayFunction` and `SimilarityFromSymmetricRayFunction` respectively.
Each `chunk_size` rows of the similarity matrix are processed as a separate job. Thanks to this parallelization, up to an N-fold speedup is possible 
(where N is the number of cores or nodes).

# Command-line usage
The `greylock` package can also be used from the command line as a module (via `python -m`). To illustrate using `greylock` this way, we re-use again the example with counts_2b_1 and S_2b, now with counts_2b_1 also saved as a csv file (note again `index=False`):
```python
counts_2b_1.to_csv("counts_2b_1.csv", index=False)
```

Then from the command line: 

`python -m greylock -i counts_2b_1.csv -s S_2b.csv -v 0 1 inf`

The output is a table with all the diversity indices for q=0, 1, and ∞. Note that while .csv or .tsv are acceptable as input, the output is always tab-delimited. The input filepath (`-i`) and the similarity matrix filepath (`-s`) can be URLs to data files hosted on the web. Also note that values of $q>100$ are all calculated as $q=\infty$.

For further options, consult the help:

`python -m greylock -h`

# Applications

For applications of the `greylock` package to various fields (immunomics, metagenomics, medical imaging and pathology), we refer to the Jupyter notebooks below:

- [Immunomics](https://github.com/ArnaoutLab/diversity_notebooks_and_data/blob/main/immunomics/immunomics_fig3.ipynb)
- [Metagenomics](https://github.com/ArnaoutLab/diversity_notebooks_and_data/blob/main/metagenomics/metagenomics_figs4-5.ipynb)
- [Medical imaging](https://github.com/ArnaoutLab/diversity_notebooks_and_data/blob/main/medical_imaging/medical_imaging_fig6-7.ipynb)
- [Pathology](https://github.com/ArnaoutLab/diversity_notebooks_and_data/blob/main/pathology/pathology_fig8.ipynb)

The examples in the Basic usage section are also made available as a notebook [here](https://github.com/ArnaoutLab/diversity_notebooks_and_data/blob/main/fruits_and_animals/fruits_and_animals_fig1_2.ipynb). For more information, please see our [preprint](https://arxiv.org/abs/2401.00102).

# Alternatives

To date, we know of no other python package that implements the partitioned frequency- and similarity-sensitive diversity measures defined by [Reeve at al.](https://arxiv.org/abs/1404.6520). However, there is a [R package](https://github.com/boydorr/rdiversity) and a [Julia package](https://github.com/EcoJulia/Diversity.jl).

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "greylock",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Elliot Hill <elliot.douglas.hill@gmail.com>, Alex Morgan <amorgan2@bidmc.harvard.edu>, Phuc Nguyen <pnguye10@bidmc.harvard.edu>, Jasper Braun <jasperbraun90@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/5c/a1/14e60507573f2d17447e161c8b855226cbe97513778c6398253da4a93dfa/greylock-1.0.0.tar.gz",
    "platform": null,
    "description": "![alt text](images/diversity_logo.png)\n\n# <h1> <i>greylock</i>: A Python package for measuring the composition of complex datasets</h1>\n\n[![Python version](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)](https://www.python.org/downloads/release/python-380/)\n[![Tests](https://github.com/ArnaoutLab/diversity/actions/workflows/tests.yml/badge.svg)](https://github.com/ArnaoutLab/diversity/actions/workflows/tests.yml)\n\n- [About](#about)\n  - [Definitions](#definitions)\n  - [Partitioned diversity](#partitioned-diversity)\n  - [Frequency-sensitive diversity](#frequency-sensitive-diversity)\n  - [Similarity-sensitive diversity](#similarity-sensitive-diversity)\n  - [Rescaled diversity indices](#rescaled-diversity-indices)\n  - [One package to rule them all](#one-package-to-rule-them-all)\n- [Basic usage](#basic-usage)\n  - [alpha diversities](#alpha-diversities)\n  - [beta diversities](#beta-diversities)\n- [Advanced usage](#advanced-usage)\n- [Command-line usage](#command-line-usage)\n- [Applications](#applications)\n- [Alternatives](#alternatives)\n\n# About\n\n`greylock` calculates effective numbers in an extended version of the Hill framework, with extensions due to Leinster and Cobbold and Reeve et al. \u201cExtending\u201d a hill makes a mountain. At 3,489 feet (1,063 meters), Mount Greylock is Massachusetts\u2019 tallest mountain. It is named for Gray Lock (c. 1670\u20131750),  a historical figure of the Abnaki, an indigenous people of New England.\n\n## Availability and installation\nThe package is available on GitHub at https://github.com/ArnaoutLab/diversity. It can be installed by running\n\n`pip install greylock`\n\nfrom the command-line interface. The test suite runs successfully on Macintosh, Windows, and Unix systems. The unit tests (including a coverage report) can be run after installation by\n\n```\npip install 'greylock[tests]'\npytest --pyargs greylock --cov greylock\n```\n\n## How to cite this work\n\nIf you use this package, please cite it as:\n\nNguyen et al., <i>greylock</i>. <https://github.com/ArnaoutLab/diversity>\n\n## Definitions\n\nA ***community*** is a collection of elements called ***individuals***, each of which is assigned a label called its ***species***, where multiple individuals may have the same species. An example of a community is all the animals and plants living in a lake. A ***metacommunity*** consists of several communities. An example of a metacommunity is all the animals in a lake split into different depths. Each community that makes up a metacommunity is called a ***subcommunity***.\n\nEven though the terms metacommunity and subcommunity originate in ecology, we use them in a broader sense. If one is interested in analyzing a subset of a dataset, then the subset is a subcommunity and the entire dataset is the metacommunity. Alternatively, if one is interested in how individual datasets (e.g. from individual research subjects) compare to all datasets used in a study, the individual datasets are subcommunities and the set of all datasets is the metacommunity. (When there is only a single dataset under study, we use \u201csubcommunity\u201d and \u201cmetacommunity\u201d interchangeably as convenient.)\n\nA ***diversity index*** is a statistic associated with a community, which describes how much the species of its individuals vary. For example, a community of many individuals of the same species has a very low diversity whereas a community with multiple species and the same amount of individuals per species has a high diversity.\n\n## Partitioned diversity\n\nSome diversity indices compare the diversities of the subcommunities with respect to the overall metacommunity. For example, two subcommunities with the same frequency distribution but no shared species each comprise half of the combined metacommunity diversity.\n\n## Frequency-sensitive diversity\n\n[In 1973, Hill introduced a framework](https://doi.org/10.2307/1934352) which unifies commonly used diversity indices into a single parameterized family of diversity measures. The so-called ***viewpoint parameter*** can be thought of as the sensitivity to rare species. At one end of the spectrum, when the viewpoint parameter is set to 0, species frequency is ignored entirely, and only the number of distinct species matters, while at the other end of the spectrum, when the viewpoint parameter is set to $\\infty$, only the highest frequency species in a community is considered by the corresponding diversity measure. Common diversity measures such as ***species richness***, ***Shannon entropy***, the ***Gini-Simpson index***, and the ***Berger-Parker index*** have simple and natural relationships with Hill's indices at different values for the viewpoint parameter ($0$, $1$, $2$, $\\infty$, respectively).\n\n## Similarity-sensitive diversity\n\nIn addition to being sensitive to frequency, it often makes sense to account for similarity in a diversity measure. For example, a community of two different types of rodents may be considered less diverse than a community where one of the rodent species was replaced by the same number of individuals of a bird species. [Reeve et al.](https://arxiv.org/abs/1404.6520) and [Leinster and Cobbold](https://doi.org/10.1890/10-2402.1) present a general mathematically rigorous way of incorporating similarity measures into Hill's framework. The result is a family of similarity-sensitive diversity indices parameterized by the same viewpoint parameter as well as the similarity function used for the species in the meta- or subcommunities of interest. These similarity-sensitive diversity measures account for both the pairwise similarity between all species and their frequencies.\n\n## Rescaled diversity indices\n\nIn addition to the diversity measures introduced by Reeve et al, we also included two new rescaled measures $\\hat{\\rho}$ and $\\hat{\\beta}$, as well as their metacommunity counterparts. The motivation for introducing these measures is that $\\rho$ can become very large if the number of subcommunities is large. Similarly, $\\beta$ can become very small in this case. The rescaled versions are designed so that they remain of order unity even when there are lots of subcommunities.\n\n## One package to rule them all\n\nThe `greylock` package is able to calculate all of the similarity- and frequency-sensitive subcommunity and metacommunity diversity measures described in [Reeve et al.](https://arxiv.org/abs/1404.6520). See the paper for more in-depth information on their derivation and interpretation.\n\n\n**Supported subcommunity diversity measures**:\n\n  - $\\alpha$ - diversity of subcommunity $j$ in isolation, per individual\n  - $\\bar{\\alpha}$ - diversity of subcommunity $j$ in isolation\n  - $\\rho$ - redundancy of subcommunity $j$\n  - $\\bar{\\rho}$ - representativeness of subcommunity $j$\n  - $\\hat{\\rho}$ - rescaled version of redundancy ($\\rho$)\n  - $\\beta$ - distinctiveness of subcommunity $j$\n  - $\\bar{\\beta}$ - effective number of distinct subcommunities\n  - $\\hat{\\beta}$ - rescaled version of distinctiveness ($\\beta$) \n  - $\\gamma$ - contribution of subcommunity $j$ toward metacommunity diversity\n\n\n**Supported metacommunity diversity measures**:\n  - $A$ - naive-community metacommunity diversity\n  - $\\bar{A}$ - average diversity of subcommunities\n  - $R$ - average redundancy of subcommunities\n  - $\\bar{R}$ - average representativeness of subcommunities\n  - $\\hat{R}$ - average rescaled redundancy of subcommunities\n  - $B$ - average distinctiveness of subcommunities\n  - $\\bar{B}$ - effective number of distinct subcommunities\n  - $\\hat{B}$ - average rescaled distinctiveness of subcommunities\n  - $G$ - metacommunity diversity\n\n\n# Basic usage\n## Alpha diversities \n\nWe illustrate the basic usage of `greylock` on simple, field-of-study-agnostic datasets of fruits and animals. First, consider two datasets of size $n=35$ that each contains counts of six types of fruit: apples, oranges, bananas, pears, blueberries, and grapes.\n\n<img src='images/fruits-1.png' width='350'>\n\nDataset 1a is mostly apples; in dataset 1b, all fruits are represented at almost identical frequencies. The frequencies of the fruits in each dataset is tabulated below:\n\n|           | Dataset 1a | Dataset 1b | \n| :-------- | ---------: | ---------: | \n| apple     |         30 |          6 | \n| orange    |          1 |          6 |\n| banana    |          1 |          6 |\n| pear      |          1 |          6 |\n| blueberry |          1 |          6 |\n| grape     |          1 |          5 |\n| total     |         35 |         35 | \n\nA frequency-sensitive metacommunity can be created in Python by passing a `counts` DataFrame to a `Metacommunity` object:\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom greylock import Metacommunity\n\ncounts_1a = pd.DataFrame({\"Dataset 1a\": [30, 1, 1, 1, 1, 1]}, \n   index=[\"apple\", \"orange\", \"banana\", \"pear\", \"blueberry\", \"grape\"])\n\nmetacommunity_1a = Metacommunity(counts_1a)\n```\n\nOnce a metacommunity has been created, diversity measures can be calculated. For example, to calculate $D_1$, we type:\n\n```python\nmetacommunity_1a.subcommunity_diversity(viewpoint=1, measure='alpha')\n```\n\nThe output shows that $D_1=1.90$. To calculate the corresponding metacommunity diversity index:\n\n```python\nmetacommunity_1a.metacommunity_diversity(viewpoint=1, measure='alpha')\n```\n\nIn this example, the metacommunity indices are the same as the subcommunity ones, since there is only one subcommunity. To calculated multiple diversity measures at once and store them in a DataFrame, we type:\n\n```python \nmetacommunity_1a.to_dataframe(viewpoint=[0, 1, np.inf])\n```\n\nwhich produces the following output:\n\n|      | community     | viewpoint | alpha |  rho | beta | gamma | normalized_alpha | normalized_rho | normalized_beta | rho_hat | beta_hat |\n| ---: | :------------ | --------: | ----: | ---: | ---: | ----: | ---------------: | -------------: | --------------: | ------: | -------: |\n|    0 | metacommunity |      0.00 |  6.00 | 1.00 | 1.00 |  6.00 |             6.00 |           1.00 |            1.00 |    1.00 |     1.00 |\n|    1 | Dataset 1a    |      0.00 |  6.00 | 1.00 | 1.00 |  6.00 |             6.00 |           1.00 |            1.00 |    1.00 |     1.00 |\n|    2 | metacommunity |      1.00 |  1.90 | 1.00 | 1.00 |  1.90 |             1.90 |           1.00 |            1.00 |    1.00 |     1.00 |\n|    3 | Dataset 1a    |      1.00 |  1.90 | 1.00 | 1.00 |  1.90 |             1.90 |           1.00 |            1.00 |    1.00 |     1.00 |\n|    4 | metacommunity |       inf |  1.17 | 1.00 | 1.00 |  1.17 |             1.17 |           1.00 |            1.00 |    1.00 |     1.00 |\n|    5 | Dataset 1a    |       inf |  1.17 | 1.00 | 1.00 |  1.17 |             1.17 |           1.00 |            1.00 |    1.00 |     1.00 |\n\n\nNext, let us repeat for Dataset 1b. Again, we make the `counts` dataframe and a `Metacommunity` object:\n\n```python\ncounts_1b = pd.DataFrame({\"Community 1b\": [6, 6, 6, 6, 6, 5]},\n    index=[\"apple\", \"orange\", \"banana\", \"pear\", \"blueberry\", \"grape\"])\n\nmetacommunity_1b = Metacommunity(counts_1b)\n```\n\nTo obtain $D_1$, we run:\n\n```python\nmetacommunity_1b.subcommunity_diversity(viewpoint=1, measure='alpha')\n```\n\nWe find that $D_1 \\approx 5.99$ for Dataset 1b. The larger value of $D_1$ for Dataset 1b aligns with the intuitive sense that more balance in the frequencies of unique elements means a more diverse dataset. To output multiple diversity measures at once, we run:\n\n```python\nmetacommunity_1b.to_dataframe(viewpoint=[0, 1, np.inf])\n```\n\nwhich produces the output:\n\n|      | community     | viewpoint | alpha |  rho | beta | gamma | normalized_alpha | normalized_rho | normalized_beta | rho_hat | beta_hat |\n| ---: | :------------ | --------: | ----: | ---: | ---: | ----: | ---------------: | -------------: | --------------: | ------: | -------: |\n|    0 | metacommunity |      0.00 |  6.00 | 1.00 | 1.00 |  6.00 |             6.00 |           1.00 |            1.00 |    1.00 |     1.00 |\n|    1 | Dataset 1b    |      0.00 |  6.00 | 1.00 | 1.00 |  6.00 |             6.00 |           1.00 |            1.00 |    1.00 |     1.00 |\n|    2 | metacommunity |      1.00 |  5.99 | 1.00 | 1.00 |  5.99 |             5.99 |           1.00 |            1.00 |    1.00 |     1.00 |\n|    3 | Dataset 1b    |      1.00 |  5.99 | 1.00 | 1.00 |  5.99 |             5.99 |           1.00 |            1.00 |    1.00 |     1.00 |\n|    4 | metacommunity |       inf |  5.83 | 1.00 | 1.00 |  5.83 |             5.83 |           1.00 |            1.00 |    1.00 |     1.00 |\n|    5 | Dataset 1b    |       inf |  5.83 | 1.00 | 1.00 |  5.83 |             5.83 |           1.00 |            1.00 |    1.00 |     1.00 |\n\nThe `greylock` package can also calculate similarity-sensitive diversity measures for any user-supplied definition of similarity. To illustrate, we now consider a second example in which the dataset elements are all unique. Uniqueness means element frequencies are identical, so similarity is the only factor that influences diversity calculations.\n\n<img src='images/fig2_thumbnail.png' width='350'>\n\nThe datasets now each contain a set of animals in which each animal appears only once. We consider phylogenetic similarity (approximated roughly, for purposes of this example). Dataset 2a consists entirely of birds, so all entries in the similarity matrix are close to $1$:\n\n```python\nlabels_2a = [\"owl\", \"eagle\", \"flamingo\", \"swan\", \"duck\", \"chicken\", \"turkey\", \"dodo\", \"dove\"]\nno_species_2a = len(labels_2a)\nS_2a = np.identity(n=no_species_2a)\n\n\nS_2a[0][1:9] = (0.91, 0.88, 0.88, 0.88, 0.88, 0.88, 0.88, 0.88) # owl\nS_2a[1][2:9] = (      0.88, 0.89, 0.88, 0.88, 0.88, 0.89, 0.88) # eagle\nS_2a[2][3:9] = (            0.90, 0.89, 0.88, 0.88, 0.88, 0.89) # flamingo\nS_2a[3][4:9] = (                  0.92, 0.90, 0.89, 0.88, 0.88) # swan\nS_2a[4][5:9] = (                        0.91, 0.89, 0.88, 0.88) # duck\nS_2a[5][6:9] = (                              0.92, 0.88, 0.88) # chicken\nS_2a[6][7:9] = (                                    0.89, 0.88) # turkey\nS_2a[7][8:9] = (                                          0.88) # dodo\n                                                                # dove\n\n\nS_2a = np.maximum( S_2a, S_2a.transpose() )\n```\nWe may optionally convert this to a DataFrame for inspection:\n```python\nS_2a_df = pd.DataFrame({labels_2a[i]: S_2a[i] for i in range(no_species_2a)}, index=labels_2a)\n```\n\nwhich corresponds to the following table:\n\n|           |      owl |     eagle | flamingo |      swan |    duck |   chicken |    turkey |     dodo |       dove |\n| :-------- | -------: | --------: | -------: | --------: | ------: | --------: | --------: | -------: | ---------: |\n|       owl |        1 |      0.91 |     0.88 |      0.88 |    0.88 |      0.88 |      0.88 |     0.88 |       0.88 |\n|     eagle |     0.91 |         1 |     0.88 |      0.89 |    0.88 |      0.88 |      0.88 |     0.89 |       0.88 |\n|  flamingo |     0.88 |      0.88 |        1 |      0.90 |    0.89 |      0.88 |      0.88 |     0.88 |       0.89 |\n|      swan |     0.88 |      0.89 |     0.90 |         1 |    0.92 |      0.90 |      0.89 |     0.88 |       0.88 |\n|      duck |     0.88 |      0.88 |     0.89 |      0.92 |       1 |      0.91 |      0.89 |     0.88 |       0.88 |\n|   chicken |     0.88 |      0.88 |     0.88 |      0.90 |    0.91 |         1 |      0.92 |     0.88 |       0.88 |\n|    turkey |     0.88 |      0.88 |     0.88 |      0.89 |    0.89 |      0.92 |         1 |     0.89 |       0.88 |\n|      dodo |     0.88 |      0.89 |     0.88 |      0.88 |    0.88 |      0.88 |      0.89 |        1 |       0.88 |\n|      dove |     0.88 |      0.88 |     0.89 |      0.88 |    0.88 |      0.88 |      0.88 |     0.88 |          1 |\n\n\nWe make a DataFrame of counts in the same way as in the previous example:\n\n```python\ncounts_2a = pd.DataFrame({\"Community 2a\": [1, 1, 1, 1, 1, 1, 1, 1, 1]}, index=labels_2a)\n```\n\nTo compute the similarity-sensitive diversity indices, we now pass the similarity matrix to the similarity argument of the metacommunity object.\nIn this example we pass the similarity matrix in the form of a numpy array:\n\n```python\nmetacommunity_2a = Metacommunity(counts_2a, similarity=S_2a)\n```\n\n(If we wanted to use the similarity matrix in DataFrame format, we use a custom Similarity subclass.\n\n```python\nfrom greylock.similarity import SimilarityFromDataFrame\nmetacommunity_2a = Metacommunity(counts_2a, similarity=SimilarityFromDataFrame(S_2a_df))\n```\n\nNote that even though the code looks a little different, the calculation will be exactly the same.)\n\nWe can find $D_0^Z$ similarly to the above:\n\n```python\nmetacommunity_2a.subcommunity_diversity(viewpoint=0, measure='alpha')\n```\n\nThe output tells us that $D_0^Z=1.11$. The fact that this number is close to 1 reflects the fact that all individuals in this community are very similar to each other (all birds).\n\nIn contrast, Dataset 2b consists of members from two different phyla: vertebrates and invertebrates. As above, we define a similarity matrix:\n\n```python\nlabels_2b = (\"ladybug\", \"bee\", \"butterfly\", \"lobster\", \"fish\", \"turtle\", \"parrot\", \"llama\", \"orangutan\")\nno_species_2b = len(labels_2b)\nS_2b = np.identity(n=no_species_2b)\nS_2b[0][1:9] = (0.60, 0.55, 0.45, 0.25, 0.22, 0.23, 0.18, 0.16) # ladybug\nS_2b[1][2:9] = (      0.60, 0.48, 0.22, 0.23, 0.21, 0.16, 0.14) # bee\nS_2b[2][3:9] = (            0.42, 0.27, 0.20, 0.22, 0.17, 0.15) # bu\u2019fly\nS_2b[3][4:9] = (                  0.28, 0.26, 0.26, 0.20, 0.18) # lobster\nS_2b[4][5:9] = (                        0.75, 0.70, 0.66, 0.63) # fish\nS_2b[5][6:9] = (                              0.85, 0.70, 0.70) # turtle\nS_2b[6][7:9] = (                                    0.75, 0.72) # parrot\nS_2b[7][8:9] = (                                          0.85) # llama\n                                                                #orangutan\n\nS_2b = np.maximum( S_2b, S_2b.transpose() )\n# optional, convert to DataFrame for inspection:\nS_2b_df = pd.DataFrame({labels_2b[i]: S_2b[i] for i in range(no_species_2b)}, index=labels_2b)\n```\n\nwhich corresponds to the following table:\n|           |  ladybug |       bee |    b'fly |   lobster |    fish |    turtle |    parrot |    llama |  orangutan |\n| :-------- | -------: | --------: | -------: | --------: | ------: | --------: | --------: | -------: | ---------: |\n| ladybug   |        1 |      0.60 |     0.55 |      0.45 |    0.25 |      0.22 |      0.23 |     0.18 |       0.16 |\n| bee       |     0.60 |         1 |     0.60 |      0.48 |    0.22 |      0.23 |      0.21 |     0.16 |       0.14 |\n| b'fly     |     0.55 |      0.60 |        1 |      0.42 |    0.27 |      0.20 |      0.22 |     0.17 |       0.15 |\n| lobster   |     0.45 |      0.48 |     0.42 |         1 |    0.28 |      0.26 |      0.26 |     0.20 |       0.18 |\n| fish      |     0.25 |      0.22 |     0.27 |      0.28 |       1 |      0.75 |      0.70 |     0.66 |       0.63 |\n| turtle    |     0.22 |      0.23 |     0.20 |      0.26 |    0.75 |         1 |      0.85 |     0.70 |       0.70 |\n| parrot    |     0.23 |      0.21 |     0.22 |      0.26 |    0.70 |      0.85 |         1 |     0.75 |       0.72 |\n| llama     |     0.18 |      0.16 |     0.17 |      0.20 |    0.66 |      0.70 |      0.75 |        1 |       0.85 |\n| orangutan |     0.16 |      0.14 |      0.15|      0.18 |     0.63|      0.70 |      0.72 |     0.85 |          1 |\n\nThe values of the similarity matrix indicate high similarity among the vertebrates, high similarity among the invertebrates and low similarity between vertebrates and invertebrates.\n\nTo calculate the alpha diversity (with $q=0$ as above), we proceed as before, defining counts, creating a Metacommunity object, and calling its `subcommunity_diversity` method with the desired settings:\n\n```python\ncounts_2b = pd.DataFrame({\"Community 2b\": [1, 1, 1, 1, 1, 1, 1, 1, 1]}, index=labels_2b)\nmetacommunity_2b = Metacommunity(counts_2b, similarity=S_2b)\nmetacommunity_2b.subcommunity_diversity(viewpoint=0, measure='alpha')\n```\n\nwhich outputs $D_0^Z=2.16$. That this number is close to 2 reflects the fact that members in this community belong to two broad classes of animals: vertebrates and invertebrates. The remaining $0.16$ above $2$ is interpreted as reflecting the diversity within each phylum.\n\n## Beta diversities\nRecall beta diversity is between-group diversity. To illustrate, we will re-imagine Dataset 2b as a metacommunity made up of 2 subcommunities\u2014the invertebrates and the vertebrates\u2014defined as follows:\n\n```python\ncounts_2b_1 = pd.DataFrame(\n{\n   \"Subcommunity_2b_1\": [1, 1, 1, 1, 0, 0, 0, 0, 0], # invertebrates\n      \"Subcommunity_2b_2\": [0, 0, 0, 0, 1, 1, 1, 1, 1], #   vertebrates\n},\nindex=labels_2b\n)\n```\n\nWe can obtain the representativeness $\\bar{\\rho}$ (\u201crho-bar\u201d) of each subcommunity, here at $q=0$, as follows:\n\n```python\nmetacommunity_2b_1 = Metacommunity(counts_2b_1, similarity=S_2b)\nmetacommunity_2b_1.subcommunity_diversity(viewpoint=0, \nmeasure='rho_hat')\n```\n\nwith the output $[0.41, 0.21]$. Recall $\\hat{\\rho}$ indicates how well a subcommunity represents the metacommunity. We find that $\\hat{\\rho}$ of the two subcommunities are rather low\u2014 $0.41$ and $0.21$ for the invertebrates and the vertebrates, respectively\u2014reflecting the low similarity between these groups. \nNote the invertebrates are more diverse than the vertebrates, which we can see by calculating $q=0$ $\\alpha$ diversity of these subcommunities:\n\n```python\nmetacommunity_2b_1.subcommunity_diversity(viewpoint=0, measure='alpha')\n```\n\nwhich outputs $[3.54, 2.30]$. In contrast, suppose we split Dataset 2b into two subsets at random, without regard to phylum:\n\n```python\ncounts_2b_2 = pd.DataFrame(\n{\n   \"Subcommunity_2b_3\": [1, 0, 1, 0, 1, 0, 1, 0, 1],\n   \"Subcommunity_2b_4\": [0, 1, 0, 1, 0, 1, 0, 1, 0],\n},\nindex=labels_2b\n)\n```\n\nProceeding again as above,\n\n```python\nmetacommunity_2b_2 = Metacommunity(counts_2b_2, similarity=S_2b)\nmetacommunity_2b_2.subcommunity_diversity(viewpoint=0, measure='rho_hat')\n```\n\nyielding $[0.68, 1.07]$. We find that the $\\hat{\\rho}$ of the two subsets are now, respectively, $0.68$ and $1.07$. These high values reflect the fact that the vertebrates and the invertebrates are roughly equally represented.\n\n# Advanced usage\n\nIn the examples above, the entire similarity matrix has been created in RAM (as a `numpy.ndarray` or `pandas.DataFrame`) before being passed to the `Metacommunity` constructor. However, this may not be the best tactic for large datasets. The `greylock` package offers better options in these cases. Given that the\nsimillarity matrix is of complexity $O(n^2)$ (where $n$ is the number of species), the creation, storage, and use of the similarity matrix are the most computationally resource-intense aspects of calculating diversity. Careful consideration of how to handle the similarity matrix can extend the range of problems that are tractable by many orders of magnitude.\n\nAny large similarity matrix that is created in Python as a `numpy.ndarray` benefits from being memory-mapped, as NumPy can then use the data without requiring it all to be in memory. See the NumPy [memmap documentation](https://numpy.org/doc/stable/reference/generated/numpy.memmap.html) for guidance. Because `memmap` is a subclass of `ndarray`, using this type of file storage for the similarity matrix requires no modification to your use of the Metacommunity API. This conversion, and the resulting storage of the data on disk, has the advantage that if you revise the downstream analysis, or perform additional analyses, re-calculation of the similarity matrix may be skipped.\n\nThe strategy of calculate-once, use-many-times afforded by storage of the similarity matrix to a file allows you to do the work of calculating the similarity matrix in an entirely separate process. You may choose to calculate the similarity matrix in a more performant language, such as C++, and/or inspect the matrix in Excel. In these cases, it is  convenient to store the similarity matrix in a non-Python-specific format, such as a .csv or .tsv file. The entire csv/tsv file need not be read into memory before invoking the `Metacommunity` constructor.\nRather, one may instantiate `SimilarityFromFile`, whose constructor is given the path of a cvs or tsv file. The `SimilarityFromFile` object will use the file's contents in a memory-efficient way, reading in chunks as they are used. \n\nTo illustrate passing a csv file, we re-use the counts_2b_1 and S_2b from above and save the latter as .csv files (note `index=False`, since the csv files should *not* contain row labels):\n```python\nS_2b_df.to_csv(\"S_2b.csv\", index=False)\n```\nthen we can build a metacommunity as follows\n```python\nfrom greylock.similarity import SimilarityFromFile\nmetacommunity_2b_1 = Metacommunity(counts_2b_1,\n                                   similarity=SimilarityFromFile('S_2b.csv', chunk_size=5))\n```\nThe optional `chunk_size` argument to `SimilarityFromFile`'s constructor specifies how many rows of the similarity matrix are read from the file at a time.\n\nAlternatively, to avoid a large footprint on either RAM or disk, the similarity matrix can be constructed and processed on the fly. \nA `SimilarityFromFunction` object generates a similarity matrix from a similarity function, and an array or `DataFrame` of features to `X`. Each row of X represents the feature values of a species. \nFor example, given numeric features all of the same type:\n\n```python\nfrom greylock.similarity import SimilarityFromFunction\n\nX = np.array([\n  [1, 2], \n  [3, 4], \n  [5, 6]\n])\n\ndef similarity_function(species_i, species_j):\n  return 1 / (1 + np.linalg.norm(species_i - species_j))\n\nmetacommunity = Metacommunity(np.array([[1, 1], [1, 0], [0, 1]]),\n                              similarity=SimilarityFromFunction(similarity_function,\n                                                               X=X, chunk_size=10))\n```\n\n(The optional `chunk_size` parameter specifies how many rows of the similarity matrix to generate at once; larger values should be faster, as long as the chunks are not too large\ncompared to available RAM.)\n\nIf there are features of various types, and it would be convenient to address features by name, features can be supplied in a DataFrame. (Note that, because of the use of named tuples to represent species in the similarity function, it is helpful if the column names are valid Python identifiers.)\n\n```\nX = pd.DataFrame(\n    {\n        \"breathes\": [\n            \"water\",\n            \"air\",\n            \"air\",\n        ],\n        \"covering\": [\n            \"scales\",\n            \"scales\",\n            \"fur\",\n        ],\n        \"n_legs\": [\n            0,\n            0,\n            4,\n        ],\n    },\n    index=[\n        \"tuna\",\n        \"snake\",\n        \"rabbit\",\n    ],\n)\n\ndef feature_similarity(animal_i, animal_j):\n    if animal_i.breathes != animal_j.breathes:\n        return 0.0\n    if animal_i.covering == animal_j.covering:\n        result = 1\n    else:\n        result = 0.5\n    if animal_i.n_legs != animal_j.n_legs:\n        result *= 0.5\n    return result\n\nmetacommunity = Metacommunity(np.array([[1, 1], [1, 0], [0, 1]]),\n                              similarity=SimilarityFromFunction(feature_similarity, X=X))\n```\n\nA two-fold speed-up is possible when the following (typical) conditions hold:\n\n* The similarity matrix is symmetric (i.e. similarity[i, j] == similarity[j, i] for all i and j).\n* The similarity of each species with itself is 1.0.\n* The number of subcommunities is much smaller than the number of species.\n\nIn this case, we don't really need to call the simularity function twice for each pair to calcuate both  similarity[i, j] and similarity[j, i]. \nUse the `SimilarityFromSymmetricFunction` class to get the same results in half the time:\n\n```python\nfrom greylock.similarity import SimilarityFromSymmetricFunction\n\nmetacommunity = Metacommunity(np.array([[1, 1], [1, 0], [0, 1]]),\n                              similarity=SimilarityFromSymmetricFunction(feature_similarity, X=X))\n```\n\nThe similarity function will only be called for pairs of rows `species[i], species[j]` where i < j, and the similarity of $species_i$ to $species_j$ will be re-used for the similarity of $species_j$ to $species_i$. Thus, a nearly 2-fold speed-up is possible, if the similarity function is computationally expensive. (For a discussion of _nonsymmetric_ similarity, see [Leinster and Cobbold](https://doi.org/10.1890/10-2402.1).)\n\n## Parallelization using the ray package\n\nFor very large datasets, the computation of the similarity matrix can be completed in a fraction of the time by parallelizing over many cores or even over a Kubernetes cluster.\nSupport for parallelizing this computation using the [`ray` package](https://pypi.org/project/ray/) is built into `greylock`. However, this is an optional dependency, \nas it is not required for small datasets, and \nthe installation of `ray` into your environment may entail some conflicting dependency issues. Thus, before trying to use `ray`, be sure to install the extra:\n\n```\npip install 'greylock[ray]'\n```\n\nTo actually use Ray, replace the use of `SimilarityFromFunction` and `SimilarityFromSymmetricFunction` with `SimilarityFromRayFunction` and `SimilarityFromSymmetricRayFunction` respectively.\nEach `chunk_size` rows of the similarity matrix are processed as a separate job. Thanks to this parallelization, up to an N-fold speedup is possible \n(where N is the number of cores or nodes).\n\n# Command-line usage\nThe `greylock` package can also be used from the command line as a module (via `python -m`). To illustrate using `greylock` this way, we re-use again the example with counts_2b_1 and S_2b, now with counts_2b_1 also saved as a csv file (note again `index=False`):\n```python\ncounts_2b_1.to_csv(\"counts_2b_1.csv\", index=False)\n```\n\nThen from the command line: \n\n`python -m greylock -i counts_2b_1.csv -s S_2b.csv -v 0 1 inf`\n\nThe output is a table with all the diversity indices for q=0, 1, and \u221e. Note that while .csv or .tsv are acceptable as input, the output is always tab-delimited. The input filepath (`-i`) and the similarity matrix filepath (`-s`) can be URLs to data files hosted on the web. Also note that values of $q>100$ are all calculated as $q=\\infty$.\n\nFor further options, consult the help:\n\n`python -m greylock -h`\n\n# Applications\n\nFor applications of the `greylock` package to various fields (immunomics, metagenomics, medical imaging and pathology), we refer to the Jupyter notebooks below:\n\n- [Immunomics](https://github.com/ArnaoutLab/diversity_notebooks_and_data/blob/main/immunomics/immunomics_fig3.ipynb)\n- [Metagenomics](https://github.com/ArnaoutLab/diversity_notebooks_and_data/blob/main/metagenomics/metagenomics_figs4-5.ipynb)\n- [Medical imaging](https://github.com/ArnaoutLab/diversity_notebooks_and_data/blob/main/medical_imaging/medical_imaging_fig6-7.ipynb)\n- [Pathology](https://github.com/ArnaoutLab/diversity_notebooks_and_data/blob/main/pathology/pathology_fig8.ipynb)\n\nThe examples in the Basic usage section are also made available as a notebook [here](https://github.com/ArnaoutLab/diversity_notebooks_and_data/blob/main/fruits_and_animals/fruits_and_animals_fig1_2.ipynb). For more information, please see our [preprint](https://arxiv.org/abs/2401.00102).\n\n# Alternatives\n\nTo date, we know of no other python package that implements the partitioned frequency- and similarity-sensitive diversity measures defined by [Reeve at al.](https://arxiv.org/abs/1404.6520). However, there is a [R package](https://github.com/boydorr/rdiversity) and a [Julia package](https://github.com/EcoJulia/Diversity.jl).\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python package for measuring the composition of complex datasets",
    "version": "1.0.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/ArnaoutLab/diversity/issues",
        "Homepage": "https://github.com/ArnaoutLab/diversity"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dd26e65c000e420cd8b4ebe324a8d8d9bff4a4d7b603b55e14fd2e171d0f9a47",
                "md5": "a05ba26da79a1bd59ebead08003ba327",
                "sha256": "2d443224954542c695b8c9839c47102bb42df7c687da867d3817a3bfd7f20faa"
            },
            "downloads": -1,
            "filename": "greylock-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a05ba26da79a1bd59ebead08003ba327",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 41305,
            "upload_time": "2024-11-14T15:40:42",
            "upload_time_iso_8601": "2024-11-14T15:40:42.705723Z",
            "url": "https://files.pythonhosted.org/packages/dd/26/e65c000e420cd8b4ebe324a8d8d9bff4a4d7b603b55e14fd2e171d0f9a47/greylock-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5ca114e60507573f2d17447e161c8b855226cbe97513778c6398253da4a93dfa",
                "md5": "a86e1bf4ca9094d7567aa182def1c12e",
                "sha256": "6a2017807e300e6ddc065043ba1b218cf719abded0022dd55ea777f591fd6762"
            },
            "downloads": -1,
            "filename": "greylock-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a86e1bf4ca9094d7567aa182def1c12e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 833599,
            "upload_time": "2024-11-14T15:40:44",
            "upload_time_iso_8601": "2024-11-14T15:40:44.360741Z",
            "url": "https://files.pythonhosted.org/packages/5c/a1/14e60507573f2d17447e161c8b855226cbe97513778c6398253da4a93dfa/greylock-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-14 15:40:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ArnaoutLab",
    "github_project": "diversity",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "greylock"
}

None