scholaretl

Name	scholaretl JSON
Version	0.0.6 JSON
	download
home_page	None
Summary	ETL for parsing scientific papers.
upload_time	2024-10-30 14:08:28
maintainer	None
docs_url	None
author	Blue Brain Project, EPFL
requires_python	>=3.10
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# Scholaretl

An Extract, Transfrom and Load (ETL) API made to parse scientific papers. This package is meant to be used with scholarag, our Retreival Augmented Generation (RAG) tool. It is mainly used to parse scientific paper coming from different sources, to make it compatible with ususal databases.

0. [Quickstart](#quickstart)
1. [List of endpoints](#list-of-endpoints)
2. [Docker Image](#docker-image)
3. [Grobid parsing](#grobid-parsing)
4. [Funding and Acknowledgement](#funding-and-acknowledgement)

## Quickstart

#### Step 1 : Install the package.

Simply install the package with PyPi.

```bash
pip install scholaretl
```

You can also clone the GitHub repo and install the package yourself.

#### Step 2 : Run the FastApi app.

A simple script is installed with the package, and allows to run the app locally. By default the API is open on port 8000.

```bash
scholaretl-api
```

See the `-h` flag for non default arguments.

#### Step 3 : Test the app.

Now that the server is running, you can either curl it to get information.

```bash
curl http://localhost:8000/settings
```

Or open a browser at : `http://localhost:8000/docs` and try some of the endpoints. For example, use the `parse/pypdf` endpoint to parse a local pdf file. Parsing xml files works out of the box. Keep in mind that the xml parsing endpoints are meant to be used with files comming from specific scientific journals. (see [List of endpoints](#list-of-endpoints))

## List of endpoints

Once the app is deployed, all these endpoints will be available to use :
* `/parse/pubmed_xml`: parses XMLs coming from PubMed.
* `/parse/jats_xml`: Parses XMLs coming from PMC.
* `/parse/tei_xml`: Parses XMLs produced by Grobid.
* `/parse/xocs_xml`: Parses XMLs coming from Scopus (Elsevier)
* `/parse/pypdf`: Parses PDFs without keeping the structure of the document.
* `/parse/grobidpdf`: Parses PDFs keeping the structure of the document (REQUIRES grobid, see [Grobid parsing](#grobid-parsing)).

## Docker image

If a docker container is required, it can be build using the provided Dockerfile. Make sure you have Docker installed.

```bash
docker build -t scholaretl:latest . --platform linux/amd64
```
It can then be tested by runing the container locally. The flag `--platform linux/amd64` depends on the desired deployement and should be changed accordingly. `Scholaretl:latest` can be sutomized at will.
The image can then be activated using :
```bash
docker run -d -p 8080:8080 scholaretl:latest
```
The Api will accept requests on port `8080`, ie you can acces the UI at : `http://localhost:8080/docs`.

## Grobid parsing

To parse documents with the Grobid enpoint, It requires a Grobid server to be running. To deploy it, simply run

```bash
docker run -p 8070:8070 -d lfoppiano/grobid:0.7.3
```

Then pass the server's url to the script in a .env file:

```bash
echo SCHOLARETL__GROBID__URL=http://localhost:8070 > .env
scholaretl-api
```
You can also add the server's url in the `.env` manually. See the `env.example` file for more information.

If using docker, pass the server's URL as an environment variable.

```bash
docker run -p 8080:8080 -d -e SCHOLARETL__GROBID__URL=http://host.docker.internal:8070 scholaretl:latest
```

## Funding and Acknowledgement

The development of this software was supported by funding to the Blue Brain Project, a research center of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH Board of the Swiss Federal Institutes of Technology.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scholaretl",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Blue Brain Project, EPFL",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/f0/09/116e63404a37e0d0f9e1e0e3c478dafe1058f388e7173cfaa31d3a9f568b/scholaretl-0.0.6.tar.gz",
    "platform": null,
    "description": "# Scholaretl\n\nAn Extract, Transfrom and Load (ETL) API made to parse scientific papers. This package is meant to be used with scholarag, our Retreival Augmented Generation (RAG) tool. It is mainly used to parse scientific paper coming from different sources, to make it compatible with ususal databases.\n\n0. [Quickstart](#quickstart)\n1. [List of endpoints](#list-of-endpoints)\n2. [Docker Image](#docker-image)\n3. [Grobid parsing](#grobid-parsing)\n4. [Funding and Acknowledgement](#funding-and-acknowledgement)\n\n\n## Quickstart\n\n#### Step 1 : Install the package.\n\nSimply install the package with PyPi.\n\n```bash\npip install scholaretl\n```\n\nYou can also clone the GitHub repo and install the package yourself.\n\n#### Step 2 : Run the FastApi app.\n\nA simple script is installed with the package, and allows to run the app locally. By default the API is open on port 8000.\n\n```bash\nscholaretl-api\n```\n\nSee the `-h` flag for non default arguments.\n\n#### Step 3 : Test the app.\n\nNow that the server is running, you can either curl it to get information.\n\n```bash\ncurl http://localhost:8000/settings\n```\n\nOr open a browser at : `http://localhost:8000/docs` and try some of the endpoints. For example, use the `parse/pypdf` endpoint to parse a local pdf file. Parsing xml files works out of the box. Keep in mind that the xml parsing endpoints are meant to be used with files comming from specific scientific journals. (see [List of endpoints](#list-of-endpoints))\n\n\n## List of endpoints\n\nOnce the app is deployed, all these endpoints will be available to use :\n* `/parse/pubmed_xml`: parses XMLs coming from PubMed.\n* `/parse/jats_xml`: Parses XMLs coming from PMC.\n* `/parse/tei_xml`: Parses XMLs produced by Grobid.\n* `/parse/xocs_xml`: Parses XMLs coming from Scopus (Elsevier)\n* `/parse/pypdf`: Parses PDFs without keeping the structure of the document.\n* `/parse/grobidpdf`: Parses PDFs keeping the structure of the document (REQUIRES grobid, see [Grobid parsing](#grobid-parsing)).\n\n## Docker image\n\nIf a docker container is required, it can be build using the provided Dockerfile. Make sure you have Docker installed.\n\n```bash\ndocker build -t scholaretl:latest . --platform linux/amd64\n```\nIt can then be tested by runing the container locally. The flag `--platform linux/amd64` depends on the desired deployement and should be changed accordingly. `Scholaretl:latest` can be sutomized at will.\nThe image can then be activated using :\n```bash\ndocker run -d -p 8080:8080 scholaretl:latest\n```\nThe Api will accept requests on port `8080`, ie you can acces the UI at : `http://localhost:8080/docs`.\n\n## Grobid parsing\n\n\nTo parse documents with the Grobid enpoint, It requires a Grobid server to be running. To deploy it, simply run\n\n```bash\ndocker run -p 8070:8070 -d lfoppiano/grobid:0.7.3\n```\n\nThen pass the server's url to the script in a .env file:\n\n```bash\necho SCHOLARETL__GROBID__URL=http://localhost:8070 > .env\nscholaretl-api\n```\nYou can also add the server's url in the `.env` manually. See the `env.example` file for more information.\n\nIf using docker, pass the server's URL as an environment variable.\n\n```bash\ndocker run -p 8080:8080 -d -e SCHOLARETL__GROBID__URL=http://host.docker.internal:8070 scholaretl:latest\n```\n\n## Funding and Acknowledgement\n\nThe development of this software was supported by funding to the Blue Brain Project, a research center of the \u00c9cole polytechnique f\u00e9d\u00e9rale de Lausanne (EPFL), from the Swiss government\u2019s ETH Board of the Swiss Federal Institutes of Technology.\n\nCopyright (c) 2024 Blue Brain Project/EPFL\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "ETL for parsing scientific papers.",
    "version": "0.0.6",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fa150053c30fc1a01b8397d25d356bc2936e5cc3fdc3c756cc074dc7e732d837",
                "md5": "5218d3289378aef87b9aa0f67147dbc8",
                "sha256": "1a123bae59d0ccc239db4adb0bd2d98fb971d1c6cea4416e2be715e74c9cd479"
            },
            "downloads": -1,
            "filename": "scholaretl-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5218d3289378aef87b9aa0f67147dbc8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 26524,
            "upload_time": "2024-10-30T14:08:27",
            "upload_time_iso_8601": "2024-10-30T14:08:27.178373Z",
            "url": "https://files.pythonhosted.org/packages/fa/15/0053c30fc1a01b8397d25d356bc2936e5cc3fdc3c756cc074dc7e732d837/scholaretl-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f009116e63404a37e0d0f9e1e0e3c478dafe1058f388e7173cfaa31d3a9f568b",
                "md5": "e502fb1099991822d95fb9bfb843e982",
                "sha256": "447b0c25685ad842ecdc56780f10d8bba44713ea0372c37b67e401d5af0a6862"
            },
            "downloads": -1,
            "filename": "scholaretl-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "e502fb1099991822d95fb9bfb843e982",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 25161,
            "upload_time": "2024-10-30T14:08:28",
            "upload_time_iso_8601": "2024-10-30T14:08:28.371132Z",
            "url": "https://files.pythonhosted.org/packages/f0/09/116e63404a37e0d0f9e1e0e3c478dafe1058f388e7173cfaa31d3a9f568b/scholaretl-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-30 14:08:28",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "scholaretl"
}

Blue Brain Project, EPFL