sketch


Namesketch JSON
Version 0.5.2 PyPI version JSON
download
home_page
SummaryCompute, store and operate on data sketches
upload_time2024-02-08 15:17:33
maintainer
docs_urlNone
author
requires_python>=3.8
licenseMIT License Copyright (c) 2023 Justin Waugh, Mike Biven Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords data sketch model etl automatic join ai embedding profiling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![](https://dcbadge.vercel.app/api/server/kW9nBQErGe?compact=true&style=flat)](https://discord.gg/kW9nBQErGe)

# sketch

Sketch is an AI code-writing assistant for pandas users that understands the context of your data, greatly improving the relevance of suggestions. Sketch is usable in seconds and doesn't require adding a plugin to your IDE.

```bash
pip install sketch
```

## Demo 

Here we follow a "standard" (hypothetical) data-analysis workflow, showing a Natural Language interface that successfully navigates many tasks in the data stack landscape. 

- Data Catalogging:
  - General tagging (eg. PII identification)
  - Metadata generation (names and descriptions)
- Data Engineering:
  - Data cleaning and masking (compliance)
  - Derived feature creation and extraction
- Data Analysis:
  - Data questions
  - Data visualization

https://user-images.githubusercontent.com/916073/212602281-4ebd090f-09c4-495d-b48d-0b4c37b9f665.mp4

Try it out in colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/bluecoconut/410a979d94613ea2aaf29987cf0233bc/sketch-demo.ipynb)

## How to use

It's as simple as importing sketch, and then using the `.sketch` extension on any pandas dataframe.

```python
import sketch
```

Now, any pandas dataframe you have will have an extension registered to it. Access this new extension with your dataframes name `.sketch`

### `.sketch.ask`

Ask is a basic question-answer system on sketch, this will return an answer in text that is based off of the summary statistics and description of the data. 

Use ask to get an understanding of the data, get better column names, ask hypotheticals (how would I go about doing X with this data), and more.

```python
df.sketch.ask("Which columns are integer type?")
```

### `.sketch.howto`

Howto is the basic "code-writing" prompt in sketch. This will return a code-block you should be able to copy paste and use as a starting point (or possibly ending!) for any question you have to ask of the data. Ask this how to clean the data, normalize, create new features, plot, and even build models!

```python
df.sketch.howto("Plot the sales versus time")
```

### `.sketch.apply`

apply is a more advanced prompt that is more useful for data generation. Use it to parse fields, generate new features, and more. This is built directly on [lambdaprompt](https://github.com/approximatelabs/lambdaprompt). In order to use this, you will need to set up a free account with OpenAI, and set an environment variable with your API key. `OPENAI_API_KEY=YOUR_API_KEY`

```python
df['review_keywords'] = df.sketch.apply("Keywords for the review [{{ review_text }}] of product [{{ product_name }}] (comma separated):")
```

```python
df['capitol'] = pd.DataFrame({'State': ['Colorado', 'Kansas', 'California', 'New York']}).sketch.apply("What is the capitol of [{{ State }}]?")
```

## Sketch currently uses `prompts.approx.dev` to help run with minimal setup

You can also directly use a few pre-built hugging face models (right now `MPT-7B` and `StarCoder`), which will run entirely locally (once you download the model weights from HF).
Do this by setting environment 3 variables:

```python
os.environ['LAMBDAPROMPT_BACKEND'] = 'StarCoder'
os.environ['SKETCH_USE_REMOTE_LAMBDAPROMPT'] = 'False'
os.environ['HF_ACCESS_TOKEN'] = 'your_hugging_face_token'
```

You can also directly call OpenAI directly (and not use our endpoint) by using your own API key. To do this, set 2 environment variables.

(1) `SKETCH_USE_REMOTE_LAMBDAPROMPT=False`
(2) `OPENAI_API_KEY=YOUR_API_KEY`

## How it works

Sketch uses efficient approximation algorithms (data sketches) to quickly summarize your data, and feed that information into language models. Right now it does this by summarizing the columns and writing these summary statistics as additional context to be used by the code-writing prompt. In the future we hope to feed these sketches directly into custom made "data + language" foundation models to get more accurate results.


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "sketch",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "data,sketch,model,etl,automatic,join,ai,embedding,profiling",
    "author": "",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/a1/f7/3100729c4ef68b534a2e68d42f1fdec5e5770af6e6053e094b6b45f45bca/sketch-0.5.2.tar.gz",
    "platform": null,
    "description": "[![](https://dcbadge.vercel.app/api/server/kW9nBQErGe?compact=true&style=flat)](https://discord.gg/kW9nBQErGe)\n\n# sketch\n\nSketch is an AI code-writing assistant for pandas users that understands the context of your data, greatly improving the relevance of suggestions. Sketch is usable in seconds and doesn't require adding a plugin to your IDE.\n\n```bash\npip install sketch\n```\n\n## Demo \n\nHere we follow a \"standard\" (hypothetical) data-analysis workflow, showing a Natural Language interface that successfully navigates many tasks in the data stack landscape. \n\n- Data Catalogging:\n  - General tagging (eg. PII identification)\n  - Metadata generation (names and descriptions)\n- Data Engineering:\n  - Data cleaning and masking (compliance)\n  - Derived feature creation and extraction\n- Data Analysis:\n  - Data questions\n  - Data visualization\n\nhttps://user-images.githubusercontent.com/916073/212602281-4ebd090f-09c4-495d-b48d-0b4c37b9f665.mp4\n\nTry it out in colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/bluecoconut/410a979d94613ea2aaf29987cf0233bc/sketch-demo.ipynb)\n\n## How to use\n\nIt's as simple as importing sketch, and then using the `.sketch` extension on any pandas dataframe.\n\n```python\nimport sketch\n```\n\nNow, any pandas dataframe you have will have an extension registered to it. Access this new extension with your dataframes name `.sketch`\n\n### `.sketch.ask`\n\nAsk is a basic question-answer system on sketch, this will return an answer in text that is based off of the summary statistics and description of the data. \n\nUse ask to get an understanding of the data, get better column names, ask hypotheticals (how would I go about doing X with this data), and more.\n\n```python\ndf.sketch.ask(\"Which columns are integer type?\")\n```\n\n### `.sketch.howto`\n\nHowto is the basic \"code-writing\" prompt in sketch. This will return a code-block you should be able to copy paste and use as a starting point (or possibly ending!) for any question you have to ask of the data. Ask this how to clean the data, normalize, create new features, plot, and even build models!\n\n```python\ndf.sketch.howto(\"Plot the sales versus time\")\n```\n\n### `.sketch.apply`\n\napply is a more advanced prompt that is more useful for data generation. Use it to parse fields, generate new features, and more. This is built directly on [lambdaprompt](https://github.com/approximatelabs/lambdaprompt). In order to use this, you will need to set up a free account with OpenAI, and set an environment variable with your API key. `OPENAI_API_KEY=YOUR_API_KEY`\n\n```python\ndf['review_keywords'] = df.sketch.apply(\"Keywords for the review [{{ review_text }}] of product [{{ product_name }}] (comma separated):\")\n```\n\n```python\ndf['capitol'] = pd.DataFrame({'State': ['Colorado', 'Kansas', 'California', 'New York']}).sketch.apply(\"What is the capitol of [{{ State }}]?\")\n```\n\n## Sketch currently uses `prompts.approx.dev` to help run with minimal setup\n\nYou can also directly use a few pre-built hugging face models (right now `MPT-7B` and `StarCoder`), which will run entirely locally (once you download the model weights from HF).\nDo this by setting environment 3 variables:\n\n```python\nos.environ['LAMBDAPROMPT_BACKEND'] = 'StarCoder'\nos.environ['SKETCH_USE_REMOTE_LAMBDAPROMPT'] = 'False'\nos.environ['HF_ACCESS_TOKEN'] = 'your_hugging_face_token'\n```\n\nYou can also directly call OpenAI directly (and not use our endpoint) by using your own API key. To do this, set 2 environment variables.\n\n(1) `SKETCH_USE_REMOTE_LAMBDAPROMPT=False`\n(2) `OPENAI_API_KEY=YOUR_API_KEY`\n\n## How it works\n\nSketch uses efficient approximation algorithms (data sketches) to quickly summarize your data, and feed that information into language models. Right now it does this by summarizing the columns and writing these summary statistics as additional context to be used by the code-writing prompt. In the future we hope to feed these sketches directly into custom made \"data + language\" foundation models to get more accurate results.\n\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2023 Justin Waugh, Mike Biven  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Compute, store and operate on data sketches",
    "version": "0.5.2",
    "project_urls": {
        "homepage": "https://github.com/approximatelabs/sketch"
    },
    "split_keywords": [
        "data",
        "sketch",
        "model",
        "etl",
        "automatic",
        "join",
        "ai",
        "embedding",
        "profiling"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "863cbb45a67be6d3272d3712b67a854c012c5ca495c0d4ddd0f7345944081dd6",
                "md5": "d18f6c30f29f340b9bff86e826fb083f",
                "sha256": "41d2bf14575a5cf5446b6ef1bc787cd7b8a4a6453aac339215c5e5ca1715cf47"
            },
            "downloads": -1,
            "filename": "sketch-0.5.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d18f6c30f29f340b9bff86e826fb083f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 16977,
            "upload_time": "2024-02-08T15:17:31",
            "upload_time_iso_8601": "2024-02-08T15:17:31.593104Z",
            "url": "https://files.pythonhosted.org/packages/86/3c/bb45a67be6d3272d3712b67a854c012c5ca495c0d4ddd0f7345944081dd6/sketch-0.5.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a1f73100729c4ef68b534a2e68d42f1fdec5e5770af6e6053e094b6b45f45bca",
                "md5": "852958ff3bfc23c235d81c35f291a351",
                "sha256": "45bfd2d41ad4939a0c8ad8ad86cef00ea1709a5e3c4c32e2ed20d255c0a09b9f"
            },
            "downloads": -1,
            "filename": "sketch-0.5.2.tar.gz",
            "has_sig": false,
            "md5_digest": "852958ff3bfc23c235d81c35f291a351",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 20158,
            "upload_time": "2024-02-08T15:17:33",
            "upload_time_iso_8601": "2024-02-08T15:17:33.382413Z",
            "url": "https://files.pythonhosted.org/packages/a1/f7/3100729c4ef68b534a2e68d42f1fdec5e5770af6e6053e094b6b45f45bca/sketch-0.5.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-08 15:17:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "approximatelabs",
    "github_project": "sketch",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "sketch"
}
        
Elapsed time: 0.17607s