texture-viz

Name	texture-viz JSON
Version	0.0.5 JSON
	download
home_page	https://github.com/cmudig/Texture
Summary	Process and profile text datasets interactively
upload_time	2024-12-03 16:34:59
maintainer	None
docs_url	None
author	Will Epperson
requires_python	<4.0,>=3.10
license	None
keywords	text nlp data profiling llm
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Texture: Structured Text Analytics

[![PyPi](https://img.shields.io/pypi/v/texture-viz.svg)](https://pypi.org/project/texture-viz/)

Texture is a system for exploring and creating structured insights with your text datasets.

1. **Interactive Attribute Profiles**: Texture visualizes structured attributes alongside your text data in interactive, cross-filterable charts.
2. **Flexible attribute definitions**: Attribute charts can come from different tables and any level of a document such as words, sentences, or documents.
3. **Embedding based operations**: Texture helps you use vector embeddings to search for similar text and summarize your data.

![screenshot of Texture interface](.github/screenshots/texture_sc.png)

## Install and run

Install texture with pip:

```bash
pip install texture-viz
```

Then you can run in a python script or notebook by providing a dataframe with your text data and attributes.

```python
import texture
texture.run(df)
```

## Texture Configuration

You can optionally pass arguments to the [`run`](./texture/runner.py) command to configure the interface. Configuration options are:

- `data: pd.DataFrame`: The dataframe to parse and visualize.
- `schema`: a dataset schema describing the columns, types, and tables (calculated automatically if none provided)
- `load_tables: Dict[str, pd.DataFrame]`: A dictionary of tables to load into the schema. The key is the table name and the value is the dataframe.
- `create_new_embedding_func`: A function that takes a string and returns a vector embedding (see example below)

There are several reserved column names in the main table that are used in the interface:

- `id`: A unique identifier for each row.
- `vector`: A column containing embeddings for the text data.
- `umap_x` and `umap_y`: Columns containing 2d projections of the embeddings.

We provide various preprocessing functions to calculate embeddings, projections, and word tables. You can use these functions to preprocess your data before launching the Texture app.

```python
import pandas as pd
import texture
from texture.models import DatasetSchema, Column, DerivedSchema

P = "https://raw.githubusercontent.com/cmudig/Texture/main/examples/vis_papers/"

df_main = pd.read_parquet(P + "1_main.parquet")
df_words = pd.read_parquet(P + "2_words.parquet")
df_authors = pd.read_parquet(P + "3_authors.parquet")
df_keywords = pd.read_parquet(P + "4_keywords.parquet")

load_tables = {
    "main_table": df_main,
    "words_table": df_words,
    "authors_table": df_authors,
    "keywords_table": df_keywords,
}

# Create schema for the dataset that decides how the data will be visualized
schema = DatasetSchema(
    name="main_table",
    columns=[
        Column(name="Title", type="text"),
        Column(name="Abstract", type="text"),
        Column(
            name="word",
            type="categorical",
            derivedSchema=DerivedSchema(
                is_segment=True,
                table_name="words_table",
                derived_from="Abstract",
                derived_how=None,
            ),
        ),
        Column(
            name="pos",
            type="categorical",
            derivedSchema=DerivedSchema(
                is_segment=True,
                table_name="words_table",
                derived_from="Abstract",
                derived_how=None,
            ),
        ),
        Column(
            name="author",
            type="categorical",
            derivedSchema=DerivedSchema(
                is_segment=False,
                table_name="authors_table",
                derived_from=None,
                derived_how=None,
            ),
        ),
        Column(
            name="keyword",
            type="categorical",
            derivedSchema=DerivedSchema(
                is_segment=False,
                table_name="keywords_table",
                derived_from=None,
                derived_how=None,
            ),
        ),
        Column(name="Year", type="number"),
        Column(name="Conference", type="categorical"),
        Column(name="PaperType", type="categorical"),
        Column(name="CitationCount_CrossRef", type="number"),
        Column(name="Award", type="categorical"),
    ],
    primary_key=Column(name="id", type="number"),
    origin="uploaded",
    has_embeddings=True,
    has_projection=True,
)

def get_embedding(value: str):
    import sentence_transformers

    model = sentence_transformers.SentenceTransformer("all-mpnet-base-v2")
    e = model.encode(value)

    return e

texture.run(
    schema=schema, load_tables=load_tables, create_new_embedding_func=get_embedding
)
```

## Dev install

See [DEV.md](DEV.md) for dev workflows and setup.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/cmudig/Texture",
    "name": "texture-viz",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "text, nlp, data profiling, llm",
    "author": "Will Epperson",
    "author_email": "willepp@live.com",
    "download_url": "https://files.pythonhosted.org/packages/df/e5/a7dfe1d876e307332a060fd685d4586c190753edf8ce59a661bd8fd6e316/texture_viz-0.0.5.tar.gz",
    "platform": null,
    "description": "# Texture: Structured Text Analytics\n\n[![PyPi](https://img.shields.io/pypi/v/texture-viz.svg)](https://pypi.org/project/texture-viz/)\n\nTexture is a system for exploring and creating structured insights with your text datasets.\n\n1. **Interactive Attribute Profiles**: Texture visualizes structured attributes alongside your text data in interactive, cross-filterable charts.\n2. **Flexible attribute definitions**: Attribute charts can come from different tables and any level of a document such as words, sentences, or documents.\n3. **Embedding based operations**: Texture helps you use vector embeddings to search for similar text and summarize your data.\n\n![screenshot of Texture interface](.github/screenshots/texture_sc.png)\n\n## Install and run\n\nInstall texture with pip:\n\n```bash\npip install texture-viz\n```\n\nThen you can run in a python script or notebook by providing a dataframe with your text data and attributes.\n\n```python\nimport texture\ntexture.run(df)\n```\n\n## Texture Configuration\n\nYou can optionally pass arguments to the [`run`](./texture/runner.py) command to configure the interface. Configuration options are:\n\n- `data: pd.DataFrame`: The dataframe to parse and visualize.\n- `schema`: a dataset schema describing the columns, types, and tables (calculated automatically if none provided)\n- `load_tables: Dict[str, pd.DataFrame]`: A dictionary of tables to load into the schema. The key is the table name and the value is the dataframe.\n- `create_new_embedding_func`: A function that takes a string and returns a vector embedding (see example below)\n\nThere are several reserved column names in the main table that are used in the interface:\n\n- `id`: A unique identifier for each row.\n- `vector`: A column containing embeddings for the text data.\n- `umap_x` and `umap_y`: Columns containing 2d projections of the embeddings.\n\nWe provide various preprocessing functions to calculate embeddings, projections, and word tables. You can use these functions to preprocess your data before launching the Texture app.\n\n```python\nimport pandas as pd\nimport texture\nfrom texture.models import DatasetSchema, Column, DerivedSchema\n\nP = \"https://raw.githubusercontent.com/cmudig/Texture/main/examples/vis_papers/\"\n\ndf_main = pd.read_parquet(P + \"1_main.parquet\")\ndf_words = pd.read_parquet(P + \"2_words.parquet\")\ndf_authors = pd.read_parquet(P + \"3_authors.parquet\")\ndf_keywords = pd.read_parquet(P + \"4_keywords.parquet\")\n\nload_tables = {\n    \"main_table\": df_main,\n    \"words_table\": df_words,\n    \"authors_table\": df_authors,\n    \"keywords_table\": df_keywords,\n}\n\n# Create schema for the dataset that decides how the data will be visualized\nschema = DatasetSchema(\n    name=\"main_table\",\n    columns=[\n        Column(name=\"Title\", type=\"text\"),\n        Column(name=\"Abstract\", type=\"text\"),\n        Column(\n            name=\"word\",\n            type=\"categorical\",\n            derivedSchema=DerivedSchema(\n                is_segment=True,\n                table_name=\"words_table\",\n                derived_from=\"Abstract\",\n                derived_how=None,\n            ),\n        ),\n        Column(\n            name=\"pos\",\n            type=\"categorical\",\n            derivedSchema=DerivedSchema(\n                is_segment=True,\n                table_name=\"words_table\",\n                derived_from=\"Abstract\",\n                derived_how=None,\n            ),\n        ),\n        Column(\n            name=\"author\",\n            type=\"categorical\",\n            derivedSchema=DerivedSchema(\n                is_segment=False,\n                table_name=\"authors_table\",\n                derived_from=None,\n                derived_how=None,\n            ),\n        ),\n        Column(\n            name=\"keyword\",\n            type=\"categorical\",\n            derivedSchema=DerivedSchema(\n                is_segment=False,\n                table_name=\"keywords_table\",\n                derived_from=None,\n                derived_how=None,\n            ),\n        ),\n        Column(name=\"Year\", type=\"number\"),\n        Column(name=\"Conference\", type=\"categorical\"),\n        Column(name=\"PaperType\", type=\"categorical\"),\n        Column(name=\"CitationCount_CrossRef\", type=\"number\"),\n        Column(name=\"Award\", type=\"categorical\"),\n    ],\n    primary_key=Column(name=\"id\", type=\"number\"),\n    origin=\"uploaded\",\n    has_embeddings=True,\n    has_projection=True,\n)\n\ndef get_embedding(value: str):\n    import sentence_transformers\n\n    model = sentence_transformers.SentenceTransformer(\"all-mpnet-base-v2\")\n    e = model.encode(value)\n\n    return e\n\ntexture.run(\n    schema=schema, load_tables=load_tables, create_new_embedding_func=get_embedding\n)\n```\n\n## Dev install\n\nSee [DEV.md](DEV.md) for dev workflows and setup.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Process and profile text datasets interactively",
    "version": "0.0.5",
    "project_urls": {
        "Homepage": "https://github.com/cmudig/Texture",
        "Repository": "https://github.com/cmudig/Texture"
    },
    "split_keywords": [
        "text",
        " nlp",
        " data profiling",
        " llm"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "790a823f57c15fe4884c153b3ce75252576eb1e001ab7034ce6a9b01b443d593",
                "md5": "da669798c3a38823caa27059d0c96376",
                "sha256": "f15bbd6c302166de0c69a4f75079bb01ea4aa2329779f1ca83b4ea47ec40d947"
            },
            "downloads": -1,
            "filename": "texture_viz-0.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "da669798c3a38823caa27059d0c96376",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 3200893,
            "upload_time": "2024-12-03T16:34:53",
            "upload_time_iso_8601": "2024-12-03T16:34:53.942911Z",
            "url": "https://files.pythonhosted.org/packages/79/0a/823f57c15fe4884c153b3ce75252576eb1e001ab7034ce6a9b01b443d593/texture_viz-0.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dfe5a7dfe1d876e307332a060fd685d4586c190753edf8ce59a661bd8fd6e316",
                "md5": "1a43b63312aabc6dbb35fbc1554871f8",
                "sha256": "172b36d11924444e73236cc5b1bb6618c691a52e1d8985a225f149b290a530b6"
            },
            "downloads": -1,
            "filename": "texture_viz-0.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "1a43b63312aabc6dbb35fbc1554871f8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 3128725,
            "upload_time": "2024-12-03T16:34:59",
            "upload_time_iso_8601": "2024-12-03T16:34:59.362463Z",
            "url": "https://files.pythonhosted.org/packages/df/e5/a7dfe1d876e307332a060fd685d4586c190753edf8ce59a661bd8fd6e316/texture_viz-0.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-03 16:34:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cmudig",
    "github_project": "Texture",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "texture-viz"
}

Will Epperson