act-scio


Nameact-scio JSON
Version 0.0.61 PyPI version JSON
download
home_pagehttps://github.com/mnemonic-no/act-scio2
SummaryACT SCIO
upload_time2023-06-08 08:10:01
maintainer
docs_urlNone
authormnemonic AS
requires_python>=3.6, <4
licenseISC
keywords act mnemonic
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # act-scio2
Scio v2 is a reimplementation of [Scio](https://github.com/mnemonic-no/act-scio) in Python3.

Scio uses [tika](https://tika.apache.org) to extract text from documents (PDF, HTML, DOC, etc).

The result is sent to the Scio Analyzer that extracts information using a combination of NLP
(Natural Language Processing) and pattern matching.

## Changelog

### 0.0.42

SCIO now supports setting TLP on data upload, to annotate documents with `tlp` tag. Documents downloaded by feeds will have a default TLP white, but this can be changed in the config for feeds.

## Source code

The source code the workers are available on [github](https://github.com/mnemonic-no/act-scio2).

## Setup

To setup, first install from PyPi:

```bash
sudo pip3 install act-scio
```

You will also need to install [beanstalkd](https://beanstalkd.github.io/). On debian/ubuntu you can run:

```bash
sudo apt install beanstalkd
```

Configure beanstalk to accept larger payloads with the `-z` option. For red hat derived setups this can be configured in `/etc/sysconfig/beanstalkd`:

```bash
MAX_JOB_SIZE=-z 524288
```

You then need to install NLTK data files. A helper utility to do this is included:

```bash
scio-nltk-download
```

You will also need to create a default configuration:

```bash
scio-config user
```

## API

To run the api, execute:


```bash
scio-api
```

This will setup the API on 127.0.0.1:3000. Use `--port <PORT> and --host <IP>` to listen on another port and/or another interface.

For documentation of the API endpoint see [API.md](API.md).

## Configuration

You can create a default configuration using this command (should be run as the user running scio):

```bash
scio-config user
```

Common configuration can be found under ~/.config/scio/etc/scio.ini

## Running Manually

### Scio Tika Server

The Scio Tika server reads jobs from the beanstalk tube `scio_doc` and the extracted text will be sent to the tube `scio_analyze`.

The first time the server runs, it will download tika using maven. It will use a proxy if `$https_proxy` is set.

```bash
scio-tika-server
```

`scio-tika-server` uses [tika-python](https://github.com/chrismattmann/tika-python) which depends on tika-server.jar. If your server has internet access, this will downloaded automatically. If not or you need proxy to connect to the internet, follow the instructions on "Airagap Environment Setup" here: [https://github.com/chrismattmann/tika-python](https://github.com/chrismattmann/tika-python). Currently only tested with tika-server version 2.7.0.

### Scio Analyze Server

Scio Analyze Server reads (by default) jobs from the beanstalk tube `scio_analyze`.

```bash
scio-analyze
```

You can also read directly from stdin like this:

```bash
echo "The companies in the Bus; Finanical, Aviation and Automobile industry are large." | scio-analyze --beanstalk= --elasticsearch=
```

### Scio Submit

Submit document (from file or URI) to `scio_api`.

Example:

```bash
scio-submit \
   --uri https://www2.fireeye.com/rs/848-DID-242/images/rpt-apt29-hammertoss.pdf \
   --scio-baseuri http://localhost:3000/submit \
   --tlp white
```

## Running as a service

Systemd compatible service scripts can be found under examples/systemd.

To install:

```bash
sudo cp examples/systemd/*.service /usr/lib/systemd/system
sudo systemctl enable scio-tika-server
sudo systemctl enable scio-analyze
sudo service start scio-tika-server
sudo service start scio-analyze
```

## scio-feed cron job

To continously fetch new content from feeds, you can add scio-feed to cron like this (make sure the directory $HOME/logs exists):

```
# Fetch scio feeds every hour
0 * * * * /usr/local/bin/scio-feeds >> $HOME/logs/scio-feed.log.$(date +\%s) 2>&1

# Delete logs from scio-feeds older than 7 days
0 * * * * find $HOME/logs/ -name 'scio-feed.log.*' -mmin +10080 -exec rm {} \;
```

## Local development

Use pip to install in [local development mode](https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs). act-scio uses namespacing, so it is not compatible with using `setup.py install` or `setup.py develop`.

In repository, run:

```bash
pip3 install --user -e .
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mnemonic-no/act-scio2",
    "name": "act-scio",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6, <4",
    "maintainer_email": "",
    "keywords": "ACT,mnemonic",
    "author": "mnemonic AS",
    "author_email": "opensource@mnemonic.no",
    "download_url": "https://files.pythonhosted.org/packages/d6/5b/11e0c07061a840377bebd11665db21a295428e429a6d2b14da4d3662da1e/act-scio-0.0.61.tar.gz",
    "platform": null,
    "description": "# act-scio2\nScio v2 is a reimplementation of [Scio](https://github.com/mnemonic-no/act-scio) in Python3.\n\nScio uses [tika](https://tika.apache.org) to extract text from documents (PDF, HTML, DOC, etc).\n\nThe result is sent to the Scio Analyzer that extracts information using a combination of NLP\n(Natural Language Processing) and pattern matching.\n\n## Changelog\n\n### 0.0.42\n\nSCIO now supports setting TLP on data upload, to annotate documents with `tlp` tag. Documents downloaded by feeds will have a default TLP white, but this can be changed in the config for feeds.\n\n## Source code\n\nThe source code the workers are available on [github](https://github.com/mnemonic-no/act-scio2).\n\n## Setup\n\nTo setup, first install from PyPi:\n\n```bash\nsudo pip3 install act-scio\n```\n\nYou will also need to install [beanstalkd](https://beanstalkd.github.io/). On debian/ubuntu you can run:\n\n```bash\nsudo apt install beanstalkd\n```\n\nConfigure beanstalk to accept larger payloads with the `-z` option. For red hat derived setups this can be configured in `/etc/sysconfig/beanstalkd`:\n\n```bash\nMAX_JOB_SIZE=-z 524288\n```\n\nYou then need to install NLTK data files. A helper utility to do this is included:\n\n```bash\nscio-nltk-download\n```\n\nYou will also need to create a default configuration:\n\n```bash\nscio-config user\n```\n\n## API\n\nTo run the api, execute:\n\n\n```bash\nscio-api\n```\n\nThis will setup the API on 127.0.0.1:3000. Use `--port <PORT> and --host <IP>` to listen on another port and/or another interface.\n\nFor documentation of the API endpoint see [API.md](API.md).\n\n## Configuration\n\nYou can create a default configuration using this command (should be run as the user running scio):\n\n```bash\nscio-config user\n```\n\nCommon configuration can be found under ~/.config/scio/etc/scio.ini\n\n## Running Manually\n\n### Scio Tika Server\n\nThe Scio Tika server reads jobs from the beanstalk tube `scio_doc` and the extracted text will be sent to the tube `scio_analyze`.\n\nThe first time the server runs, it will download tika using maven. It will use a proxy if `$https_proxy` is set.\n\n```bash\nscio-tika-server\n```\n\n`scio-tika-server` uses [tika-python](https://github.com/chrismattmann/tika-python) which depends on tika-server.jar. If your server has internet access, this will downloaded automatically. If not or you need proxy to connect to the internet, follow the instructions on \"Airagap Environment Setup\" here: [https://github.com/chrismattmann/tika-python](https://github.com/chrismattmann/tika-python). Currently only tested with tika-server version 2.7.0.\n\n### Scio Analyze Server\n\nScio Analyze Server reads (by default) jobs from the beanstalk tube `scio_analyze`.\n\n```bash\nscio-analyze\n```\n\nYou can also read directly from stdin like this:\n\n```bash\necho \"The companies in the Bus; Finanical, Aviation and Automobile industry are large.\" | scio-analyze --beanstalk= --elasticsearch=\n```\n\n### Scio Submit\n\nSubmit document (from file or URI) to `scio_api`.\n\nExample:\n\n```bash\nscio-submit \\\n   --uri https://www2.fireeye.com/rs/848-DID-242/images/rpt-apt29-hammertoss.pdf \\\n   --scio-baseuri http://localhost:3000/submit \\\n   --tlp white\n```\n\n## Running as a service\n\nSystemd compatible service scripts can be found under examples/systemd.\n\nTo install:\n\n```bash\nsudo cp examples/systemd/*.service /usr/lib/systemd/system\nsudo systemctl enable scio-tika-server\nsudo systemctl enable scio-analyze\nsudo service start scio-tika-server\nsudo service start scio-analyze\n```\n\n## scio-feed cron job\n\nTo continously fetch new content from feeds, you can add scio-feed to cron like this (make sure the directory $HOME/logs exists):\n\n```\n# Fetch scio feeds every hour\n0 * * * * /usr/local/bin/scio-feeds >> $HOME/logs/scio-feed.log.$(date +\\%s) 2>&1\n\n# Delete logs from scio-feeds older than 7 days\n0 * * * * find $HOME/logs/ -name 'scio-feed.log.*' -mmin +10080 -exec rm {} \\;\n```\n\n## Local development\n\nUse pip to install in [local development mode](https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs). act-scio uses namespacing, so it is not compatible with using `setup.py install` or `setup.py develop`.\n\nIn repository, run:\n\n```bash\npip3 install --user -e .\n```\n",
    "bugtrack_url": null,
    "license": "ISC",
    "summary": "ACT SCIO",
    "version": "0.0.61",
    "project_urls": {
        "Homepage": "https://github.com/mnemonic-no/act-scio2"
    },
    "split_keywords": [
        "act",
        "mnemonic"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d65b11e0c07061a840377bebd11665db21a295428e429a6d2b14da4d3662da1e",
                "md5": "45eb328c825375213cfe4e062fe096c3",
                "sha256": "f9649c104b2cd633764b88584e97833965e93c7bc02038043b6a7a29aec60344"
            },
            "downloads": -1,
            "filename": "act-scio-0.0.61.tar.gz",
            "has_sig": false,
            "md5_digest": "45eb328c825375213cfe4e062fe096c3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6, <4",
            "size": 2421541,
            "upload_time": "2023-06-08T08:10:01",
            "upload_time_iso_8601": "2023-06-08T08:10:01.277384Z",
            "url": "https://files.pythonhosted.org/packages/d6/5b/11e0c07061a840377bebd11665db21a295428e429a6d2b14da4d3662da1e/act-scio-0.0.61.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-08 08:10:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mnemonic-no",
    "github_project": "act-scio2",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "act-scio"
}
        
Elapsed time: 0.07528s