palimpzest


Namepalimpzest JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryPalimpzest is a system which enables anyone to process AI-powered analytical queries simply by defining them in a declarative language
upload_time2024-05-30 03:03:32
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords relational optimization llm ai programming extraction tools document search integration
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![pz-banner](logos/palimpzest-cropped.png)

# Palimpzest (PZ)
- **Read our (pre-print) paper:** [**read the paper**](https://arxiv.org/pdf/2405.14696)
- Read our short blog post: [read the blog post](https://dsg.csail.mit.edu/projects/palimpzest/)
- Check out our Colab Demo: [colab demo](https://colab.research.google.com/drive/1zqOxnh_G6eZ8_xax6PvDr-EjMt7hp4R5?usp=sharing)

# Getting started
You can install the Palimpzest package and CLI on your machine by cloning this repository and running:
```bash
$ git clone git@github.com:mikecafarella/palimpzest.git
$ cd palimpzest
$ pip install .
```


## Palimpzest CLI
Installing Palimpzest also installs its CLI tool `pz` which provides users with basic utilities for creating and managing their own Palimpzest system. Running `pz --help` diplays an overview of the CLI's commands:
```bash
$ pz --help
Usage: pz [OPTIONS] COMMAND [ARGS]...

  The CLI tool for Palimpzest.

Options:
  --help  Show this message and exit.

Commands:
  help (h)                        Print the help message for PZ.
  init (i)                        Initialize data directory for PZ.
  ls-data (ls,lsdata)             Print a table listing the datasets
                                  registered with PZ.
  register-data (r,reg,register)  Register a data file or data directory with
                                  PZ.
  rm-data (rm,rmdata)             Remove a dataset that was registered with
                                  PZ.
```

Users can initialize their own system by running `pz init`. This will create Palimpzest's working directory in `~/.palimpzest`:
```bash
$ pz init
Palimpzest system initialized in: /Users/matthewrusso/.palimpzest
```

If we list the set of datasets registered with Palimpzest, we'll see there currently are none:
```bash
$ pz ls
+------+------+------+
| Name | Type | Path |
+------+------+------+
+------+------+------+

Total datasets: 0
```

### Registering Datasets
To add (or "register") a dataset with Palimpzest, we can use the `pz register-data` command (also aliased as `pz reg`) to specify that a file or directory at a given `--path` should be registered as a dataset with the specified `--name`:
```bash
$ pz reg --path README.md --name rdme
Registered rdme
```

If we list Palimpzest's datasets again we will see that `README.md` has been registered under the dataset named `rdme`:
```bash
$ pz ls
+------+------+------------------------------------------+
| Name | Type |                   Path                   |
+------+------+------------------------------------------+
| rdme | file | /Users/matthewrusso/palimpzest/README.md |
+------+------+------------------------------------------+

Total datasets: 1
```

To remove a dataset from Palimpzest, simply use the `pz rm-data` command (also aliased as `pz rm`) and specify the `--name` of the dataset you would like to remove:
```bash
$ pz rm --name rdme
Deleted rdme
```

Finally, listing our datasets once more will show that the dataset has been deleted:
```bash
$ pz ls
+------+------+------+
| Name | Type | Path |
+------+------+------+
+------+------+------+

Total datasets: 0
```

### Cache Management
Palimpzest will cache intermediate results by default. It can be useful to remove them from the cache when trying to evaluate the performance improvement(s) of code changes. We provide a utility command `pz clear-cache` (also aliased as `pz clr`) to clear the cache:
```bash
$ pz clr
Cache cleared
```

### Config Management
You may wish to work with multiple configurations of Palimpzest in order to, e.g., evaluate the difference in performance between various LLM services for your data extraction task. To see the config Palimpzest is currently using, you can run the `pz print-config` command (also aliased as `pz config`):
```bash
$ pz config
--- default ---
filecachedir: /some/local/filepath
llmservice: openai
name: default
parallel: false
```
By default, Palimpzest uses the configuration named `default`. As shown above, if you run a script using Palimpzest out-of-the-box, it will use OpenAI endpoints for all of its API calls.

Now, let's say you wanted to try using [together.ai's](https://www.together.ai/) for your API calls, you could do this by creating a new config with the `pz create-config` command (also aliased as `pz cc`):
```bash
$ pz cc --name together-conf --llmservice together --parallel True --set
Created and set config: together-conf
```
The `--name` parameter is required and specifies the unique name for your config. The `--llmservice` and `--parallel` options specify the service to use and whether or not to process files in parallel. Finally, if the `--set` flag is present, Palimpzest will update its current config to point to the newly created config.

We can confirm that Palimpzest checked out our new config by running `pz config`:
```bash
$ pz config
--- together-conf ---
filecachedir: /some/local/filepath
llmservice: together
name: together-conf
parallel: true
```

You can switch which config you are using at any time by using the `pz set-config` command (also aliased as `pz set`):
```bash
$ pz set --name default
Set config: default

$ pz config
--- default ---
filecachedir: /some/local/filepath
llmservice: openai
name: default
parallel: false

$ pz set --name together-conf
Set config: together-conf

$ pz config
--- together-conf ---
filecachedir: /some/local/filepath
llmservice: together
name: together-conf
parallel: true
```

Finally, you can delete a config with the `pz rm-config` command (also aliased as `pz rmc`):
```bash
$ pz rmc --name together-conf
Deleted config: together-conf
```
Note that you cannot delete the `default` config, and if you delete the config that you currently have set, Palimpzest will set the current config to be `default`.

## Configuring for Parallel Execution

There are a few things you need to do in order to use remote parallel services.

If you want to use parallel LLM execution on together.ai, you have to modify the config.yaml (by default, Palimpzest uses `~/.palimpzest/config_default.yaml`) so that `llmservice: together` and `parallel: True` are set.

If you want to use parallel PDF processing at modal.com, you have to:
1. Set `pdfprocessing: modal` in the config.yaml file.
2. Run `modal deploy src/palimpzest/tools/allenpdf.py`.  This will remotely install the modal function so you can run it. (Actually, it's probably already installed there, but do this just in case.  Also do it if there's been a change to the server-side function inside that file.)


## Python Demo

Below are simple instructions to run pz on a test data set of enron emails that is included with the system:

- Initialize the configuration by running `pz --init`.

- Add the enron data set with:
`pz reg --path testdata/enron-tiny --name enron-tiny`
then run it through the test program with:
      `tests/simpleDemo.py --task enron --datasetid enron-tiny`

- Add the test paper set with:
    `pz reg --path testdata/pdfs-tiny --name pdfs-tiny`
then run it through the test program with:
`tests/simpleDemo.py --task paper --datasetid pdfs-tiny`


- Palimpzest defaults to using OpenAI. You’ll need to export an environment variable `OPENAI_API_KEY`



            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "palimpzest",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "relational, optimization, llm, AI programming, extraction, tools, document, search, integration",
    "author": null,
    "author_email": "MIT DSG Semantic Management Lab <michjc@csail.mit.edu>",
    "download_url": "https://files.pythonhosted.org/packages/49/e1/2ef4e61f904a0523acf53b053a50f91ec13d425d9f70eb0867fe6976f244/palimpzest-0.2.0.tar.gz",
    "platform": null,
    "description": "![pz-banner](logos/palimpzest-cropped.png)\n\n# Palimpzest (PZ)\n- **Read our (pre-print) paper:** [**read the paper**](https://arxiv.org/pdf/2405.14696)\n- Read our short blog post: [read the blog post](https://dsg.csail.mit.edu/projects/palimpzest/)\n- Check out our Colab Demo: [colab demo](https://colab.research.google.com/drive/1zqOxnh_G6eZ8_xax6PvDr-EjMt7hp4R5?usp=sharing)\n\n# Getting started\nYou can install the Palimpzest package and CLI on your machine by cloning this repository and running:\n```bash\n$ git clone git@github.com:mikecafarella/palimpzest.git\n$ cd palimpzest\n$ pip install .\n```\n\n\n## Palimpzest CLI\nInstalling Palimpzest also installs its CLI tool `pz` which provides users with basic utilities for creating and managing their own Palimpzest system. Running `pz --help` diplays an overview of the CLI's commands:\n```bash\n$ pz --help\nUsage: pz [OPTIONS] COMMAND [ARGS]...\n\n  The CLI tool for Palimpzest.\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  help (h)                        Print the help message for PZ.\n  init (i)                        Initialize data directory for PZ.\n  ls-data (ls,lsdata)             Print a table listing the datasets\n                                  registered with PZ.\n  register-data (r,reg,register)  Register a data file or data directory with\n                                  PZ.\n  rm-data (rm,rmdata)             Remove a dataset that was registered with\n                                  PZ.\n```\n\nUsers can initialize their own system by running `pz init`. This will create Palimpzest's working directory in `~/.palimpzest`:\n```bash\n$ pz init\nPalimpzest system initialized in: /Users/matthewrusso/.palimpzest\n```\n\nIf we list the set of datasets registered with Palimpzest, we'll see there currently are none:\n```bash\n$ pz ls\n+------+------+------+\n| Name | Type | Path |\n+------+------+------+\n+------+------+------+\n\nTotal datasets: 0\n```\n\n### Registering Datasets\nTo add (or \"register\") a dataset with Palimpzest, we can use the `pz register-data` command (also aliased as `pz reg`) to specify that a file or directory at a given `--path` should be registered as a dataset with the specified `--name`:\n```bash\n$ pz reg --path README.md --name rdme\nRegistered rdme\n```\n\nIf we list Palimpzest's datasets again we will see that `README.md` has been registered under the dataset named `rdme`:\n```bash\n$ pz ls\n+------+------+------------------------------------------+\n| Name | Type |                   Path                   |\n+------+------+------------------------------------------+\n| rdme | file | /Users/matthewrusso/palimpzest/README.md |\n+------+------+------------------------------------------+\n\nTotal datasets: 1\n```\n\nTo remove a dataset from Palimpzest, simply use the `pz rm-data` command (also aliased as `pz rm`) and specify the `--name` of the dataset you would like to remove:\n```bash\n$ pz rm --name rdme\nDeleted rdme\n```\n\nFinally, listing our datasets once more will show that the dataset has been deleted:\n```bash\n$ pz ls\n+------+------+------+\n| Name | Type | Path |\n+------+------+------+\n+------+------+------+\n\nTotal datasets: 0\n```\n\n### Cache Management\nPalimpzest will cache intermediate results by default. It can be useful to remove them from the cache when trying to evaluate the performance improvement(s) of code changes. We provide a utility command `pz clear-cache` (also aliased as `pz clr`) to clear the cache:\n```bash\n$ pz clr\nCache cleared\n```\n\n### Config Management\nYou may wish to work with multiple configurations of Palimpzest in order to, e.g., evaluate the difference in performance between various LLM services for your data extraction task. To see the config Palimpzest is currently using, you can run the `pz print-config` command (also aliased as `pz config`):\n```bash\n$ pz config\n--- default ---\nfilecachedir: /some/local/filepath\nllmservice: openai\nname: default\nparallel: false\n```\nBy default, Palimpzest uses the configuration named `default`. As shown above, if you run a script using Palimpzest out-of-the-box, it will use OpenAI endpoints for all of its API calls.\n\nNow, let's say you wanted to try using [together.ai's](https://www.together.ai/) for your API calls, you could do this by creating a new config with the `pz create-config` command (also aliased as `pz cc`):\n```bash\n$ pz cc --name together-conf --llmservice together --parallel True --set\nCreated and set config: together-conf\n```\nThe `--name` parameter is required and specifies the unique name for your config. The `--llmservice` and `--parallel` options specify the service to use and whether or not to process files in parallel. Finally, if the `--set` flag is present, Palimpzest will update its current config to point to the newly created config.\n\nWe can confirm that Palimpzest checked out our new config by running `pz config`:\n```bash\n$ pz config\n--- together-conf ---\nfilecachedir: /some/local/filepath\nllmservice: together\nname: together-conf\nparallel: true\n```\n\nYou can switch which config you are using at any time by using the `pz set-config` command (also aliased as `pz set`):\n```bash\n$ pz set --name default\nSet config: default\n\n$ pz config\n--- default ---\nfilecachedir: /some/local/filepath\nllmservice: openai\nname: default\nparallel: false\n\n$ pz set --name together-conf\nSet config: together-conf\n\n$ pz config\n--- together-conf ---\nfilecachedir: /some/local/filepath\nllmservice: together\nname: together-conf\nparallel: true\n```\n\nFinally, you can delete a config with the `pz rm-config` command (also aliased as `pz rmc`):\n```bash\n$ pz rmc --name together-conf\nDeleted config: together-conf\n```\nNote that you cannot delete the `default` config, and if you delete the config that you currently have set, Palimpzest will set the current config to be `default`.\n\n## Configuring for Parallel Execution\n\nThere are a few things you need to do in order to use remote parallel services.\n\nIf you want to use parallel LLM execution on together.ai, you have to modify the config.yaml (by default, Palimpzest uses `~/.palimpzest/config_default.yaml`) so that `llmservice: together` and `parallel: True` are set.\n\nIf you want to use parallel PDF processing at modal.com, you have to:\n1. Set `pdfprocessing: modal` in the config.yaml file.\n2. Run `modal deploy src/palimpzest/tools/allenpdf.py`.  This will remotely install the modal function so you can run it. (Actually, it's probably already installed there, but do this just in case.  Also do it if there's been a change to the server-side function inside that file.)\n\n\n## Python Demo\n\nBelow are simple instructions to run pz on a test data set of enron emails that is included with the system:\n\n- Initialize the configuration by running `pz --init`.\n\n- Add the enron data set with:\n`pz reg --path testdata/enron-tiny --name enron-tiny`\nthen run it through the test program with:\n      `tests/simpleDemo.py --task enron --datasetid enron-tiny`\n\n- Add the test paper set with:\n    `pz reg --path testdata/pdfs-tiny --name pdfs-tiny`\nthen run it through the test program with:\n`tests/simpleDemo.py --task paper --datasetid pdfs-tiny`\n\n\n- Palimpzest defaults to using OpenAI. You\u2019ll need to export an environment variable `OPENAI_API_KEY`\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Palimpzest is a system which enables anyone to process AI-powered analytical queries simply by defining them in a declarative language",
    "version": "0.2.0",
    "project_urls": {
        "homepage": "https://github.com/mitdbg/palimpzest/",
        "repository": "https://https://github.com/mitdbg/palimpzest/"
    },
    "split_keywords": [
        "relational",
        " optimization",
        " llm",
        " ai programming",
        " extraction",
        " tools",
        " document",
        " search",
        " integration"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "516c842bac1e37953f5a2c828ba700d65a571516746244a6b363162738f43008",
                "md5": "b585138b9bc80aa666f648026e0e1064",
                "sha256": "f2ef980a960add9e95fd9b64870d1dfa46c9bf53ff2e946f59eb59a355dccca4"
            },
            "downloads": -1,
            "filename": "palimpzest-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b585138b9bc80aa666f648026e0e1064",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 106569,
            "upload_time": "2024-05-30T03:03:30",
            "upload_time_iso_8601": "2024-05-30T03:03:30.954705Z",
            "url": "https://files.pythonhosted.org/packages/51/6c/842bac1e37953f5a2c828ba700d65a571516746244a6b363162738f43008/palimpzest-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "49e12ef4e61f904a0523acf53b053a50f91ec13d425d9f70eb0867fe6976f244",
                "md5": "bea12636c68fa18785e04407c81a7a29",
                "sha256": "0c0692e2580dae4d2da5a1c5f08dab7e3e4f7e8b7fe8d08c38d567363fc97d30"
            },
            "downloads": -1,
            "filename": "palimpzest-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "bea12636c68fa18785e04407c81a7a29",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 95341,
            "upload_time": "2024-05-30T03:03:32",
            "upload_time_iso_8601": "2024-05-30T03:03:32.915936Z",
            "url": "https://files.pythonhosted.org/packages/49/e1/2ef4e61f904a0523acf53b053a50f91ec13d425d9f70eb0867fe6976f244/palimpzest-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-30 03:03:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mitdbg",
    "github_project": "palimpzest",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "palimpzest"
}
        
Elapsed time: 0.31291s