evalgen


Nameevalgen JSON
Version 0.1.3 PyPI version JSON
download
home_pagehttps://github.com/scribbledata/evalgen
SummaryGenerate eval datasets from arbitrary sources
upload_time2024-08-12 17:05:41
maintainerNone
docs_urlNone
authorScribble Data, Inc
requires_python>=3.10
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # EvalGen

EvalGen is a Python package designed to generate evaluation datasets from various sources. It includes modules for database access, specification generation, and integration with OpenAI's language models.

Idea is to point the tool at some file/database and generate a
transformation specification that can be repeatedly applied to
generate updated datasets as new data comes in. 

---

**NOTE**

Not ready for production use. Work has just started on this. If this work is of interest to you, drop a note to [pingali@scribbledata.io](mailto:pingali@scribbledata.io)

---

## Features
- Connect to multiple sources including databases and files
- Allow custom transformation to be applied to these sources
- Generate transformation specification using LLM once and apply repeatedly
- Multiple LLMs supported
- Most of this repo's code is written by LLM :)


## Installation
1. Clone the repository:
   ```bash
   git clone https://github.com/yourusername/evalgen.git
   cd evalgen

## Execution
```bash
$ evalgen
Usage: evalgen [OPTIONS] COMMAND [ARGS]...

  EvalGen CLI: A command-line interface for generating and applying data
  transformation specifications.

Options:
  --help  Show this message and exit.

Commands:
  apply-spec     Apply a specification to transform data.
  generate-spec  Generate a specification by interacting with the user to..
```
## Example 1

First generate the code snippet for transformation specification and store it in `spec.py`

```
$ evalgen generate-spec --loader-param .../data.csv
Available columns:
-------  -------  ------------------------------------------------------------------------
dt       object   ['2024-06-01 06:33:18.', '2024-06-01 07:13:22.', '2024-06-02 03:01:08.']
xid      object   ['XL000093954855', 'XY000093954855', 'MY000093954855']
status   object   ['R2', 'D2']
source   object   ['alpha', 'beta', 'theta']
content  object   ['After removing used ', 'End connection', '[Alpha] St']
-------  -------  ------------------------------------------------------------------------
Enter comma-separated column names to include [dt,xid,status,source,content]: source, content
Describe the transformation you want to apply
select rows that have transaction mentioned in them. Select both the source and content columns

Generated Code Snippet:

from evalgen import Specification

class GeneratedSpecification(Specification):

    def transform(self, df):
        transformed_df = df[df['content'].str.contains('transaction')][['source', 'content']]
        return transformed_df

```

Now apply the specific

```
$ evalgen apply-spec  --spec-class spec --loader-param .../data.csv
{"source":"alpha","content":"Transaction to Chile : amount 1000 "}
{"source":"alpha","content":"checking the route availability"}
...
```

## Example 2

```
# Set the env variable
$ export DBURL="sqlite:////home/.../cars.sqlite"

# Pass the env variable or pass the full path
$ evalgen apply-spec --spec-class cars --loader-param DBURL
Available tables:
- cars1
- cars1_anonymized
- cars2
Enter the name of the table you want to extract: cars1
Available data:
-----------------------------------------------  -------  ------------------------------------------------------------------------
Height                                           float64  ['61.0', '96.0', '104.0']
Dimensions Length                                float64  ['19.0', '93.0', '28.0']
Dimensions Width                                 float64  ['189.0', '143.0', '85.0']
Engine Information Driveline                     object   ['Rear-wheel drive', 'All-wheel drive', 'Front-wheel drive']
Engine Information Engine Type                   object   ['Nissan 3.7L 6 Cylind', 'Volkswagen 2.5L 5 Cy', 'Hyundai 3.5L 6 Cylin']
...
-----------------------------------------------  -------  ------------------------------------------------------------------------
Describe the transformation you want to apply
Select all cars with horsepower > 150.
For these cars multiply the mpg by 1.5
select identification year, mpg columns
...

from evalgen import Specification
import pandas as pd

class GeneratedSpecification(Specification):

    def get_query_params(self):
        '''
        Query parameters used to select data
        '''
        return {"table": "cars1", "limit": 1000}

    def transform(self, df):
        # Select all cars with horsepower > 150
        df = df[df['Engine Information Engine Statistics Horsepower'] > 150]

        # Multiply mpg by 1.5
        df['Fuel Information City mpg'] = df['Fuel Information City mpg'] * 1.5
        df['Fuel Information Highway mpg'] = df['Fuel Information Highway mpg'] * 1.5

        # Select identification year and mpg columns
        df = df[['Identification Year', 'Fuel Information City mpg', 'Fuel Information Highway mpg']]

        # Rename columns
        df = df.rename(columns={'Identification Year': 'Year', 'Fuel Information City mpg': 'City mpg', 'Fuel Information Highway mpg': 'Highway mpg'})

        return df

```

Store the above transformation specification somewhere where the
script can find it. You add the directory to the module paths in evalgen.yaml

```
$ ls modules/
cars.py
$ cat evalgen.yaml
module_paths:
  - modules
```

Now you can run the apply spec
```
$ evalgen apply-spec --spec-class cars --loader-param DBURL  --output-file eval-dataset.jsonl
Data successfully transformed and saved to eval-dataset.jsonl
$ head eval-dataset.jsonl
{"Year":2009.0,"City mpg":27.0,"Highway mpg":37.5}
{"Year":2009.0,"City mpg":33.0,"Highway mpg":42.0}
{"Year":2009.0,"City mpg":31.5,"Highway mpg":45.0}
{"Year":2009.0,"City mpg":31.5,"Highway mpg":42.0}
...
```

## Setup

Set up environment:

1. Create a .env file in the project root
   a. Add DB_URL=your_database_url_here to the file
   b. Add OPENAI_API_KEY=your_openai_api_key_here to the file
b. evalgen.yaml in the local directory
    ```yaml
    module_paths:
      - /path/to/your/modules
      - /another/path/to/modules
    ```

    These modifications include the new functionality for loading subclasses of `Specification` and generating a sample YAML configuration.

## Todo

1. Test multiple sources
2. Specification templates
3. Test API usage

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/scribbledata/evalgen",
    "name": "evalgen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Scribble Data, Inc",
    "author_email": "support@scribbledata.io",
    "download_url": "https://files.pythonhosted.org/packages/1d/8b/f3d2549cba1c66e2b09d4490542f955e40425f4d97c5c18b09671a3df2d5/evalgen-0.1.3.tar.gz",
    "platform": null,
    "description": "# EvalGen\n\nEvalGen is a Python package designed to generate evaluation datasets from various sources. It includes modules for database access, specification generation, and integration with OpenAI's language models.\n\nIdea is to point the tool at some file/database and generate a\ntransformation specification that can be repeatedly applied to\ngenerate updated datasets as new data comes in. \n\n---\n\n**NOTE**\n\nNot ready for production use. Work has just started on this. If this work is of interest to you, drop a note to [pingali@scribbledata.io](mailto:pingali@scribbledata.io)\n\n---\n\n## Features\n- Connect to multiple sources including databases and files\n- Allow custom transformation to be applied to these sources\n- Generate transformation specification using LLM once and apply repeatedly\n- Multiple LLMs supported\n- Most of this repo's code is written by LLM :)\n\n\n## Installation\n1. Clone the repository:\n   ```bash\n   git clone https://github.com/yourusername/evalgen.git\n   cd evalgen\n\n## Execution\n```bash\n$ evalgen\nUsage: evalgen [OPTIONS] COMMAND [ARGS]...\n\n  EvalGen CLI: A command-line interface for generating and applying data\n  transformation specifications.\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  apply-spec     Apply a specification to transform data.\n  generate-spec  Generate a specification by interacting with the user to..\n```\n## Example 1\n\nFirst generate the code snippet for transformation specification and store it in `spec.py`\n\n```\n$ evalgen generate-spec --loader-param .../data.csv\nAvailable columns:\n-------  -------  ------------------------------------------------------------------------\ndt       object   ['2024-06-01 06:33:18.', '2024-06-01 07:13:22.', '2024-06-02 03:01:08.']\nxid      object   ['XL000093954855', 'XY000093954855', 'MY000093954855']\nstatus   object   ['R2', 'D2']\nsource   object   ['alpha', 'beta', 'theta']\ncontent  object   ['After removing used ', 'End connection', '[Alpha] St']\n-------  -------  ------------------------------------------------------------------------\nEnter comma-separated column names to include [dt,xid,status,source,content]: source, content\nDescribe the transformation you want to apply\nselect rows that have transaction mentioned in them. Select both the source and content columns\n\nGenerated Code Snippet:\n\nfrom evalgen import Specification\n\nclass GeneratedSpecification(Specification):\n\n    def transform(self, df):\n        transformed_df = df[df['content'].str.contains('transaction')][['source', 'content']]\n        return transformed_df\n\n```\n\nNow apply the specific\n\n```\n$ evalgen apply-spec  --spec-class spec --loader-param .../data.csv\n{\"source\":\"alpha\",\"content\":\"Transaction to Chile : amount 1000 \"}\n{\"source\":\"alpha\",\"content\":\"checking the route availability\"}\n...\n```\n\n## Example 2\n\n```\n# Set the env variable\n$ export DBURL=\"sqlite:////home/.../cars.sqlite\"\n\n# Pass the env variable or pass the full path\n$ evalgen apply-spec --spec-class cars --loader-param DBURL\nAvailable tables:\n- cars1\n- cars1_anonymized\n- cars2\nEnter the name of the table you want to extract: cars1\nAvailable data:\n-----------------------------------------------  -------  ------------------------------------------------------------------------\nHeight                                           float64  ['61.0', '96.0', '104.0']\nDimensions Length                                float64  ['19.0', '93.0', '28.0']\nDimensions Width                                 float64  ['189.0', '143.0', '85.0']\nEngine Information Driveline                     object   ['Rear-wheel drive', 'All-wheel drive', 'Front-wheel drive']\nEngine Information Engine Type                   object   ['Nissan 3.7L 6 Cylind', 'Volkswagen 2.5L 5 Cy', 'Hyundai 3.5L 6 Cylin']\n...\n-----------------------------------------------  -------  ------------------------------------------------------------------------\nDescribe the transformation you want to apply\nSelect all cars with horsepower > 150.\nFor these cars multiply the mpg by 1.5\nselect identification year, mpg columns\n...\n\nfrom evalgen import Specification\nimport pandas as pd\n\nclass GeneratedSpecification(Specification):\n\n    def get_query_params(self):\n        '''\n        Query parameters used to select data\n        '''\n        return {\"table\": \"cars1\", \"limit\": 1000}\n\n    def transform(self, df):\n        # Select all cars with horsepower > 150\n        df = df[df['Engine Information Engine Statistics Horsepower'] > 150]\n\n        # Multiply mpg by 1.5\n        df['Fuel Information City mpg'] = df['Fuel Information City mpg'] * 1.5\n        df['Fuel Information Highway mpg'] = df['Fuel Information Highway mpg'] * 1.5\n\n        # Select identification year and mpg columns\n        df = df[['Identification Year', 'Fuel Information City mpg', 'Fuel Information Highway mpg']]\n\n        # Rename columns\n        df = df.rename(columns={'Identification Year': 'Year', 'Fuel Information City mpg': 'City mpg', 'Fuel Information Highway mpg': 'Highway mpg'})\n\n        return df\n\n```\n\nStore the above transformation specification somewhere where the\nscript can find it. You add the directory to the module paths in evalgen.yaml\n\n```\n$ ls modules/\ncars.py\n$ cat evalgen.yaml\nmodule_paths:\n  - modules\n```\n\nNow you can run the apply spec\n```\n$ evalgen apply-spec --spec-class cars --loader-param DBURL  --output-file eval-dataset.jsonl\nData successfully transformed and saved to eval-dataset.jsonl\n$ head eval-dataset.jsonl\n{\"Year\":2009.0,\"City mpg\":27.0,\"Highway mpg\":37.5}\n{\"Year\":2009.0,\"City mpg\":33.0,\"Highway mpg\":42.0}\n{\"Year\":2009.0,\"City mpg\":31.5,\"Highway mpg\":45.0}\n{\"Year\":2009.0,\"City mpg\":31.5,\"Highway mpg\":42.0}\n...\n```\n\n## Setup\n\nSet up environment:\n\n1. Create a .env file in the project root\n   a. Add DB_URL=your_database_url_here to the file\n   b. Add OPENAI_API_KEY=your_openai_api_key_here to the file\nb. evalgen.yaml in the local directory\n    ```yaml\n    module_paths:\n      - /path/to/your/modules\n      - /another/path/to/modules\n    ```\n\n    These modifications include the new functionality for loading subclasses of `Specification` and generating a sample YAML configuration.\n\n## Todo\n\n1. Test multiple sources\n2. Specification templates\n3. Test API usage\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Generate eval datasets from arbitrary sources",
    "version": "0.1.3",
    "project_urls": {
        "Homepage": "https://github.com/scribbledata/evalgen"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1d8bf3d2549cba1c66e2b09d4490542f955e40425f4d97c5c18b09671a3df2d5",
                "md5": "8c7a262752f3af08382e174f728beeb9",
                "sha256": "4a4a1a99665892889ce7b4d080ccad301b0852312c46e78e275a018418ff21d4"
            },
            "downloads": -1,
            "filename": "evalgen-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "8c7a262752f3af08382e174f728beeb9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 14197,
            "upload_time": "2024-08-12T17:05:41",
            "upload_time_iso_8601": "2024-08-12T17:05:41.498740Z",
            "url": "https://files.pythonhosted.org/packages/1d/8b/f3d2549cba1c66e2b09d4490542f955e40425f4d97c5c18b09671a3df2d5/evalgen-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-12 17:05:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "scribbledata",
    "github_project": "evalgen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "evalgen"
}
        
Elapsed time: 0.67504s