papersorter


Namepapersorter JSON
Version 0.2 PyPI version JSON
download
home_pagehttps://github.com/ChangLabSNU/papersorter
SummaryFilters RSS feeds, predicts interest, and notifies Slack with top academic articles.
upload_time2024-06-05 14:40:11
maintainerNone
docs_urlNone
authorHyeshik Chang
requires_pythonNone
licenseMIT
keywords article alerts rss feed personalized content
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PaperSorter

PaperSorter is an academic paper recommendation system that utilizes
machine learning techniques to match users' interests. The system
retrieves article alerts from RSS feeds and processes the title,
author, journal name, and abstract of each article using
[Upstage's Solar LLM](https://www.upstage.ai/solar-llm) to generate
embedding vectors. These vectors serve as input for a regression
model that predicts the user's level of interest in each paper.
PaperSorter sends notifications about high-scoring articles to a
designated Slack channel, enabling timely discussion of relevant
publications among colleagues. The prediction model can be trained
incrementally with additional labels for new articles provided by
user.

<img src="https://github.com/ChangLabSNU/PaperSorter/assets/1702891/5ef2df1f-610b-4272-b496-ecf2a480dda2" width="660px">

## Installing

To install PaperSorter, use pip:

```
pip install papersorter
```

## Preparing

### TheOldReader

PaperSorter uses [TheOldReader](https://theoldreader.com) as its
feed source. After signing up for TheOldReader, you will receive
API access using your email and password. Before running PaperSorter,
make sure to set the `TOR_EMAIL` and `TOR_PASSWORD` environment
variables with your TheOldReader email and password, respectively.
This will allow PaperSorter to authenticate and retrieve the necessary
data from your feeds.

### Upstage Solar LLM

Solar LLM's embedding API converts article titles and contents into
numerical vectors. Sign up on the [Upstage console](https://console.upstage.ai/)
and create an API key as per the
[documentation](https://developers.upstage.ai/docs/getting-started/quick-start#get-an-api-key).
Store the key securely and set the `UPSTAGE_API_KEY` environment
variable before running PaperSorter.

### Slack Incoming WebHook

To send notifications to a Slack channel, create an incoming webhook
address as described in the [Slack documentation](https://api.slack.com/messaging/webhooks).
Store the address securely and set the `PAPERSORTER_WEBHOOK_URL` environment
variable before running PaperSorter.


## Initialization and Training

To train a predictor for your article interests, ensure your
TheOldReader account contains at least 1000 articles, including at
least 100 positively labeled articles marked with stars. Ideally,
aim for around 5000 articles with 500 starred items for optimal
performance.

After populating your TheOldReader account, initialize the feed and
embedding databases using:

```
papersorter init
```

Next, train your first model with:

```
papersorter train
```

If the ROCAUC performance metric meets your expectations, you're
ready to send notifications about new interesting articles.

## Getting Updates and Send Notifications

For the regular updates, this command retrieves updates, converts new
items to embeddings, and finds interesting articles:

```
papersorter update
```

To send notifications for new interesting articles, run:

```
papersorter broadcast
```

You will receive formatted notifications in your Slack channel.

## Running as a Cron Job

Here is an example of a shell script that runs PaperSorter's `update`
and `broadcast` jobs in the background. This script sends notifications
about new interesting articles between 7 am and 9 pm, while only
performing updates during the night.

```
#!/bin/bash
PAPERSORTER_CMD=/path/to/papersorter
PAPERSORTER_DATADIR=/path/to/data
LOGFILE=background-updates.log
CURRENT_HOUR=$(date +%H)

cd $PAPERSORTER_DATADIR
$PAPERSORTER_CMD update -q --log-file $LOGFILE

if [ "$CURRENT_HOUR" -ge 7 ] && [ "$CURRENT_HOUR" -le 21 ]; then
    $PAPERSORTER_CMD broadcast -q --log-file $LOGFILE
fi
```

Here is an example line for the crontab. It runs the update script on
every hour at ten minutes past the hour.

```
10 * * * * /bin/bash /path/to/run-update.sh
```

## Feedback and Updating the Model

To improve the model, provide more labels for the articles. First,
extract the list of articles with the following command:

```
papersorter train -o model-temporary.pkl -f feedback.xlsx
```

This generates an Excel file, `feedback.xlsx`, containing titles,
authors, prediction scores, and other details. Review each row and
fill in the `label` column with `1` (interesting) or `0` (not interesting).
Leave it blank if unsure. Once you've labeled some articles, update
the feed database with:

```
papersorter feedback -i feedback.xlsx
```

Retrain the predictor with the updated labels using:

```
papersorter train
```

The new predictor is stored as `model.pkl`, and your next feeds will
be assessed with the updated model.

## Author

Hyeshik Chang <hyeshik@snu.ac.kr>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ChangLabSNU/papersorter",
    "name": "papersorter",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "article alerts, RSS feed, personalized content",
    "author": "Hyeshik Chang",
    "author_email": "hyeshik@snu.ac.kr",
    "download_url": "https://files.pythonhosted.org/packages/7c/6f/bf7468bfddb71ed7e0ac61d275f9d3ef9e2fd54eda0d6f2ce76385e44d8b/papersorter-0.2.tar.gz",
    "platform": null,
    "description": "# PaperSorter\n\nPaperSorter is an academic paper recommendation system that utilizes\nmachine learning techniques to match users' interests. The system\nretrieves article alerts from RSS feeds and processes the title,\nauthor, journal name, and abstract of each article using\n[Upstage's Solar LLM](https://www.upstage.ai/solar-llm) to generate\nembedding vectors. These vectors serve as input for a regression\nmodel that predicts the user's level of interest in each paper.\nPaperSorter sends notifications about high-scoring articles to a\ndesignated Slack channel, enabling timely discussion of relevant\npublications among colleagues. The prediction model can be trained\nincrementally with additional labels for new articles provided by\nuser.\n\n<img src=\"https://github.com/ChangLabSNU/PaperSorter/assets/1702891/5ef2df1f-610b-4272-b496-ecf2a480dda2\" width=\"660px\">\n\n## Installing\n\nTo install PaperSorter, use pip:\n\n```\npip install papersorter\n```\n\n## Preparing\n\n### TheOldReader\n\nPaperSorter uses [TheOldReader](https://theoldreader.com) as its\nfeed source. After signing up for TheOldReader, you will receive\nAPI access using your email and password. Before running PaperSorter,\nmake sure to set the `TOR_EMAIL` and `TOR_PASSWORD` environment\nvariables with your TheOldReader email and password, respectively.\nThis will allow PaperSorter to authenticate and retrieve the necessary\ndata from your feeds.\n\n### Upstage Solar LLM\n\nSolar LLM's embedding API converts article titles and contents into\nnumerical vectors. Sign up on the [Upstage console](https://console.upstage.ai/)\nand create an API key as per the\n[documentation](https://developers.upstage.ai/docs/getting-started/quick-start#get-an-api-key).\nStore the key securely and set the `UPSTAGE_API_KEY` environment\nvariable before running PaperSorter.\n\n### Slack Incoming WebHook\n\nTo send notifications to a Slack channel, create an incoming webhook\naddress as described in the [Slack documentation](https://api.slack.com/messaging/webhooks).\nStore the address securely and set the `PAPERSORTER_WEBHOOK_URL` environment\nvariable before running PaperSorter.\n\n\n## Initialization and Training\n\nTo train a predictor for your article interests, ensure your\nTheOldReader account contains at least 1000 articles, including at\nleast 100 positively labeled articles marked with stars. Ideally,\naim for around 5000 articles with 500 starred items for optimal\nperformance.\n\nAfter populating your TheOldReader account, initialize the feed and\nembedding databases using:\n\n```\npapersorter init\n```\n\nNext, train your first model with:\n\n```\npapersorter train\n```\n\nIf the ROCAUC performance metric meets your expectations, you're\nready to send notifications about new interesting articles.\n\n## Getting Updates and Send Notifications\n\nFor the regular updates, this command retrieves updates, converts new\nitems to embeddings, and finds interesting articles:\n\n```\npapersorter update\n```\n\nTo send notifications for new interesting articles, run:\n\n```\npapersorter broadcast\n```\n\nYou will receive formatted notifications in your Slack channel.\n\n## Running as a Cron Job\n\nHere is an example of a shell script that runs PaperSorter's `update`\nand `broadcast` jobs in the background. This script sends notifications\nabout new interesting articles between 7 am and 9 pm, while only\nperforming updates during the night.\n\n```\n#!/bin/bash\nPAPERSORTER_CMD=/path/to/papersorter\nPAPERSORTER_DATADIR=/path/to/data\nLOGFILE=background-updates.log\nCURRENT_HOUR=$(date +%H)\n\ncd $PAPERSORTER_DATADIR\n$PAPERSORTER_CMD update -q --log-file $LOGFILE\n\nif [ \"$CURRENT_HOUR\" -ge 7 ] && [ \"$CURRENT_HOUR\" -le 21 ]; then\n    $PAPERSORTER_CMD broadcast -q --log-file $LOGFILE\nfi\n```\n\nHere is an example line for the crontab. It runs the update script on\nevery hour at ten minutes past the hour.\n\n```\n10 * * * * /bin/bash /path/to/run-update.sh\n```\n\n## Feedback and Updating the Model\n\nTo improve the model, provide more labels for the articles. First,\nextract the list of articles with the following command:\n\n```\npapersorter train -o model-temporary.pkl -f feedback.xlsx\n```\n\nThis generates an Excel file, `feedback.xlsx`, containing titles,\nauthors, prediction scores, and other details. Review each row and\nfill in the `label` column with `1` (interesting) or `0` (not interesting).\nLeave it blank if unsure. Once you've labeled some articles, update\nthe feed database with:\n\n```\npapersorter feedback -i feedback.xlsx\n```\n\nRetrain the predictor with the updated labels using:\n\n```\npapersorter train\n```\n\nThe new predictor is stored as `model.pkl`, and your next feeds will\nbe assessed with the updated model.\n\n## Author\n\nHyeshik Chang <hyeshik@snu.ac.kr>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Filters RSS feeds, predicts interest, and notifies Slack with top academic articles.",
    "version": "0.2",
    "project_urls": {
        "Download": "https://github.com/ChangLabSNU/papersorter/releases",
        "Homepage": "https://github.com/ChangLabSNU/papersorter"
    },
    "split_keywords": [
        "article alerts",
        " rss feed",
        " personalized content"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1ed3f2a5eaf3e81a6d9bf9f278ca24b859ee1f0599d23b3f00d82596f5fb0372",
                "md5": "c05479efe7a4a484b6a7d63bce736c33",
                "sha256": "41a16fa39c3ad7b246208aefd07c6a48b35f90e4d792324d407c4f0ff9e53a03"
            },
            "downloads": -1,
            "filename": "papersorter-0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c05479efe7a4a484b6a7d63bce736c33",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 26948,
            "upload_time": "2024-06-05T14:40:05",
            "upload_time_iso_8601": "2024-06-05T14:40:05.413077Z",
            "url": "https://files.pythonhosted.org/packages/1e/d3/f2a5eaf3e81a6d9bf9f278ca24b859ee1f0599d23b3f00d82596f5fb0372/papersorter-0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7c6fbf7468bfddb71ed7e0ac61d275f9d3ef9e2fd54eda0d6f2ce76385e44d8b",
                "md5": "d0dab988c2e2b1c58f0c77f34ee9bb3a",
                "sha256": "dae2bb1098b06550255394a12e70c871c21b1892ad9e56ce91658e5f8d39b80a"
            },
            "downloads": -1,
            "filename": "papersorter-0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "d0dab988c2e2b1c58f0c77f34ee9bb3a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 19294,
            "upload_time": "2024-06-05T14:40:11",
            "upload_time_iso_8601": "2024-06-05T14:40:11.097431Z",
            "url": "https://files.pythonhosted.org/packages/7c/6f/bf7468bfddb71ed7e0ac61d275f9d3ef9e2fd54eda0d6f2ce76385e44d8b/papersorter-0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-05 14:40:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ChangLabSNU",
    "github_project": "papersorter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "papersorter"
}
        
Elapsed time: 0.47699s