igcs


Nameigcs JSON
Version 0.0.2 PyPI version JSON
download
home_pageNone
SummaryInstruction‑Guided Content Selection toolkit and datasets
upload_time2025-07-23 15:20:08
maintainerShmuel Amar
docs_urlNone
authorShmuel Amar
requires_python>=3.9
licenseNone
keywords instruction‑guided content‑selection extractive‑summarisation evidence‑extraction benchmark llm transfer‑learning nlp
VCS
bugtrack_url
requirements fuzzysearch pydantic spacy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 📝 Instruction Guided Content Selection (IGCS)

This repo contains the code and dataset for the TACL paper **"A Unifying Scheme for Extractive Content Selection Tasks"**.

* How to use the library? follow the [getting started](#-getting-started) to use the igcs library for extractive content selection.
* How to reproduce? If you wish to use or reproduce our work, follow the [datasets](#-datasets), [models](#-trained-models) and [training](#-training) sections below.


## 💡 Motivation

Many NLP tasks require selecting relevant text spans from given source texts. 
Despite this shared objective, such content selection tasks have traditionally been studied in isolation, 
each with its own modeling approaches, datasets, and evaluations.

Instruction Guided Content Selection (IGCS) unifies many tasks such as extractive summarization, 
evidence retrieval and argument mining under the same scheme of selecting extractive spans in given sources.

## 📊 Key Findings

1. Training with a diverse mix of content selection tasks helps boost LLM performance even on new extractive tasks. Generic transfer learning at its best!
![Figure 1](./images/fig1.png)


2. For tasks requiring longer selections, LLMs consistently perform better when processing one document at a time instead of the entire set at once. This is not so much the case for tasks with short selections.
![Figure 2](./images/fig2-small.png)

Check out the paper for more info!

# 🚀 Getting Started

## 📦 Installing igcs Python package

First install the igcs library (requires Python >= 3.11):

```bash
pip install -U igcs
```

Or install with full dependencies for training, inference and reproduction:

```bash
pip install 'igcs[train]'
```

To develop this library, install with `develop` extras:


```
pip install 'igcs[develop]'
```

## 🛠️ Full Example Usage

Additional example, including the [demo space]() hosted in huggingface's spaces are in the [./examples/](./examples) dir.

Prepare the prompt, call a model, parse the response and ground the selections:

```python
from igcs import grounding
from igcs.entities import Doc
from openai import OpenAI

selection_instruction = "Select content related to Obama's non-presidential roles"
docs = [
    Doc(id=0,
        text="Barack Hussein Obama II[a] (born August 4, 1961) is an American politician who was the 44th president of the "
        "United States from 2009 to 2017. A member of the Democratic Party, he was the first African American president. "
        "Obama previously served as a U.S. senator representing Illinois from 2005 to 2008 and as an Illinois state senator "
        "from 1997 to 2004.",
        ),
    Doc(id=1,
        text="In 1991, Obama accepted a two-year position as Visiting Law and Government Fellow at the University of Chicago "
        "Law School to work on his first book.[63][65] He then taught constitutional law at the University of Chicago Law School "
        "for twelve years, first as a lecturer from 1992 to 1996, and then as a senior lecturer from 1996 to 2004.[66]",
    ),
]

# 1. Prepare the input prompt and documents

prompt = (
    f"Given the following document(s), {selection_instruction}. "
    f"Output the exact text phrases from the given document(s) as a valid json array of strings. " 
    f"Do not change the copied text.\n\n"
    + "\n\n".join([f"Document #{doc.id}:\n{doc.text.strip()}\n" for doc in docs])
)

# 2. Generate selection with any model (see training models below). This examples uses the free OpenRouter:
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="<OPENROUTER_API_KEY>")

completion = client.chat.completions.create(
    model="moonshotai/kimi-k2:free",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0,
)
resp = completion.choices[0].message.content

# 3. Ground the selected spans

selected_spans = grounding.parse_selection(resp)
selections = grounding.ground_selections(
    selected_spans,
    docs=docs,
    # controls fuzzy matching sensitivity
    max_dist_rel=0.15,
    max_dist_abs=10,
)
print(selections)
```


Expected output (`mode` can be either exact, normalized, fuzzy match or hallucination):

```python
[
    Selection(doc_id=0, start_pos=250, end_pos=302, content='U.S. senator representing Illinois from 2005 to 2008', metadata={'total_count': 1, 'mode': 'exact_match'}),
    Selection(doc_id=0, start_pos=313, end_pos=353, content='Illinois state senator from 1997 to 2004', metadata={'total_count': 1, 'mode': 'exact_match'}),
    Selection(
        doc_id=1,
        start_pos=47,
        end_pos=121,
        content='Visiting Law and Government Fellow at the University of Chicago Law School',
        metadata={'total_count': 1, 'mode': 'exact_match'}
    ),
    Selection(
        doc_id=1,
        start_pos=165,
        end_pos=335,
        content='taught constitutional law at the University of Chicago Law School for twelve years, first as a lecturer from 1992 to 1996, and then as a senior lecturer from 1996 to 2004',
        metadata={'total_count': 1, 'mode': 'exact_match'}
    )
]
```



# 🤖 Trained Models

Huggingface trained models from the paper can be found at [Huggingface Hub](https://huggingface.co/collections/shmuelamar/igcs-instruction-guided-content-selection-687c92705699bb4a7ae0045e).


# 🗂️ Datasets

All dataset files found under [./igcs-dataset](./igcs-dataset) dir splitted into train, dev and test.
To prepare OpenAsp data which is currently only contains placeholders (as it contains licensed DUC datasets), 
please run the following script after obtaining OpenAsp dataset from the [OpenAsp GitHub repo](https://github.com/liatschiff/OpenAsp).

The script requires the OpenAsp directory (by default called `openasp-v1`) will replace files inplace, use `--help` for more info:

Note: don't forget to install extra requirements with `pip install -U igcs[train]` to build the datasets.

```bash
python scripts/prepare_openasp_files.py ./OpenAsp/openasp-v1
```

## 📈 Predicting with IGCSBench and GenCS

To predict on IGCSBench or GenCS (called ReverseInstructions here) dataset, use the following command:

use --help for more details on different prediction modes.

```bash
python src/igcs/predict.py \
  -i 'OpenAsp/test' 'AspectNews/test' 'SciFact/test' 'DebateSum/test' 'SaliencyDetection/test' 'EvidenceDetection/test' 'ReverseInstructions/test' \
  --model "GPT4" 
```

Full help script for commands:

```text
usage: predict.py [-h] [--mode MODE] -i INFILE [INFILE ...] -m MODEL [-o OUTFILE] [-n NUM_SAMPLES] [--skip-eval] [--shuffle] [--dry-run] [--icl-num-samples ICL_NUM_SAMPLES]
                  [--icl-samples-from-eval] [--randomize-icl-samples] [--prompt-variant PROMPT_VARIANT]

options:
  -h, --help            show this help message and exit
  --mode MODE           One of zeroshot (default), icl (in-context learning), single_doc (one source document at a time), or icl_single_doc for the last two modes combined.
  -i INFILE [INFILE ...], --infile INFILE [INFILE ...]
                        Input prompts file in JSON-Lines format for prediction.file can be also a predefined dataset such as OpenAsp/test
  -m MODEL, --model MODEL
                        Model to predict results on
  -o OUTFILE, --outfile OUTFILE
                        Output predictions file in JSON-Lines format. The scripts adds `selection` key to every row in the input file, keeping other keys intact.
  -n NUM_SAMPLES, --num-samples NUM_SAMPLES
                        Predict only on the first n samples specified. Defaults to all samples.
  --skip-eval           If set, disable evaluation step at end.
  --shuffle             If set, shuffles predicted samples. Can be combined with --num-samples.
  --dry-run             If set, does not predict but only prints prompts.
  --icl-num-samples ICL_NUM_SAMPLES
                        Number of samples from the train set to include in in-context learning mode. Only relevant if mode is icl
  --icl-samples-from-eval
                        When set uses ICL samples from the eval set (required for datasets without train set)
  --randomize-icl-samples
                        Whether to randomize per eval sample the ICL samples or use the same samples for all the eval set
  --prompt-variant PROMPT_VARIANT
                        The index of the prompt template variant to use
```

## 🔄 Recreating IGCSBench and GenCS

Please follow the code in [./src/igcs/datasets](src/igcs/datasets) for recreating IGCSBench and GenCS.


# 🏋️‍♂️ Training

Example training of Llama-3-8B-Instruct on GenCS-Union dataset.

Note: don't forget to install extra requirements with `pip install -U igcs[train]` to train.


```bash
export CUDA_VISIBLE_DEVICES=4,5,7
nohup 2>&1 accelerate launch --main_process_port 31337 src/igcs/train_model/__init__.py     \
    --output_dir "my-trained-model"   \
    --train_dataset  'ReverseInstructions'  \
    --dataset_dir './igcs-dataset/prompts'     \
    --model_name 'meta-llama/Meta-Llama-3-8B-Instruct'   \
    --gradient_accumulation_steps 1 \
    --batch_size 4 \
    --seq_length 4096 \
    --num_train_epochs 3   \
    --evaluation_strategy "no" \
    --neftune_noise_alpha 5.0 \
    --warmup_ratio 0.06 > "train_${MODEL_CODENAME}.log" &
```

# ⚖️ License

The code in this repo is available under MIT and APACHE2 dual license mode.
All datasets are provided and can be used according to their original licensing - please verify that before using them.

## 🤝 Contributing
Contributions and pull requests are welcome - feel free to open a PR.
Found a bug or have a suggestion? Please file an issue.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "igcs",
    "maintainer": "Shmuel Amar",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "instruction\u2011guided, content\u2011selection, extractive\u2011summarisation, evidence\u2011extraction, benchmark, LLM, transfer\u2011learning, NLP",
    "author": "Shmuel Amar",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/c6/e2/555050bfeca22768546528c825794923511e835dc59b1810b369beffabe8/igcs-0.0.2.tar.gz",
    "platform": null,
    "description": "# \ud83d\udcdd Instruction Guided Content Selection (IGCS)\n\nThis repo contains the code and dataset for the TACL paper **\"A Unifying Scheme for Extractive Content Selection Tasks\"**.\n\n* How to use the library? follow the [getting started](#-getting-started) to use the igcs library for extractive content selection.\n* How to reproduce? If you wish to use or reproduce our work, follow the [datasets](#-datasets), [models](#-trained-models) and [training](#-training) sections below.\n\n\n## \ud83d\udca1 Motivation\n\nMany NLP tasks require selecting relevant text spans from given source texts. \nDespite this shared objective, such content selection tasks have traditionally been studied in isolation, \neach with its own modeling approaches, datasets, and evaluations.\n\nInstruction Guided Content Selection (IGCS) unifies many tasks such as extractive summarization, \nevidence retrieval and argument mining under the same scheme of selecting extractive spans in given sources.\n\n## \ud83d\udcca Key Findings\n\n1. Training with a diverse mix of content selection tasks helps boost LLM performance even on new extractive tasks. Generic transfer learning at its best!\n![Figure 1](./images/fig1.png)\n\n\n2. For tasks requiring longer selections, LLMs consistently perform better when processing one document at a time instead of the entire set at once. This is not so much the case for tasks with short selections.\n![Figure 2](./images/fig2-small.png)\n\nCheck out the paper for more info!\n\n# \ud83d\ude80 Getting Started\n\n## \ud83d\udce6 Installing igcs Python package\n\nFirst install the igcs library (requires Python >= 3.11):\n\n```bash\npip install -U igcs\n```\n\nOr install with full dependencies for training, inference and reproduction:\n\n```bash\npip install 'igcs[train]'\n```\n\nTo develop this library, install with `develop` extras:\n\n\n```\npip install 'igcs[develop]'\n```\n\n## \ud83d\udee0\ufe0f Full Example Usage\n\nAdditional example, including the [demo space]() hosted in huggingface's spaces are in the [./examples/](./examples) dir.\n\nPrepare the prompt, call a model, parse the response and ground the selections:\n\n```python\nfrom igcs import grounding\nfrom igcs.entities import Doc\nfrom openai import OpenAI\n\nselection_instruction = \"Select content related to Obama's non-presidential roles\"\ndocs = [\n    Doc(id=0,\n        text=\"Barack Hussein Obama II[a] (born August 4, 1961) is an American politician who was the 44th president of the \"\n        \"United States from 2009 to 2017. A member of the Democratic Party, he was the first African American president. \"\n        \"Obama previously served as a U.S. senator representing Illinois from 2005 to 2008 and as an Illinois state senator \"\n        \"from 1997 to 2004.\",\n        ),\n    Doc(id=1,\n        text=\"In 1991, Obama accepted a two-year position as Visiting Law and Government Fellow at the University of Chicago \"\n        \"Law School to work on his first book.[63][65] He then taught constitutional law at the University of Chicago Law School \"\n        \"for twelve years, first as a lecturer from 1992 to 1996, and then as a senior lecturer from 1996 to 2004.[66]\",\n    ),\n]\n\n# 1. Prepare the input prompt and documents\n\nprompt = (\n    f\"Given the following document(s), {selection_instruction}. \"\n    f\"Output the exact text phrases from the given document(s) as a valid json array of strings. \" \n    f\"Do not change the copied text.\\n\\n\"\n    + \"\\n\\n\".join([f\"Document #{doc.id}:\\n{doc.text.strip()}\\n\" for doc in docs])\n)\n\n# 2. Generate selection with any model (see training models below). This examples uses the free OpenRouter:\nclient = OpenAI(base_url=\"https://openrouter.ai/api/v1\", api_key=\"<OPENROUTER_API_KEY>\")\n\ncompletion = client.chat.completions.create(\n    model=\"moonshotai/kimi-k2:free\",\n    messages=[{\"role\": \"user\", \"content\": prompt}],\n    temperature=0.0,\n)\nresp = completion.choices[0].message.content\n\n# 3. Ground the selected spans\n\nselected_spans = grounding.parse_selection(resp)\nselections = grounding.ground_selections(\n    selected_spans,\n    docs=docs,\n    # controls fuzzy matching sensitivity\n    max_dist_rel=0.15,\n    max_dist_abs=10,\n)\nprint(selections)\n```\n\n\nExpected output (`mode` can be either exact, normalized, fuzzy match or hallucination):\n\n```python\n[\n    Selection(doc_id=0, start_pos=250, end_pos=302, content='U.S. senator representing Illinois from 2005 to 2008', metadata={'total_count': 1, 'mode': 'exact_match'}),\n    Selection(doc_id=0, start_pos=313, end_pos=353, content='Illinois state senator from 1997 to 2004', metadata={'total_count': 1, 'mode': 'exact_match'}),\n    Selection(\n        doc_id=1,\n        start_pos=47,\n        end_pos=121,\n        content='Visiting Law and Government Fellow at the University of Chicago Law School',\n        metadata={'total_count': 1, 'mode': 'exact_match'}\n    ),\n    Selection(\n        doc_id=1,\n        start_pos=165,\n        end_pos=335,\n        content='taught constitutional law at the University of Chicago Law School for twelve years, first as a lecturer from 1992 to 1996, and then as a senior lecturer from 1996 to 2004',\n        metadata={'total_count': 1, 'mode': 'exact_match'}\n    )\n]\n```\n\n\n\n# \ud83e\udd16 Trained Models\n\nHuggingface trained models from the paper can be found at [Huggingface Hub](https://huggingface.co/collections/shmuelamar/igcs-instruction-guided-content-selection-687c92705699bb4a7ae0045e).\n\n\n# \ud83d\uddc2\ufe0f Datasets\n\nAll dataset files found under [./igcs-dataset](./igcs-dataset) dir splitted into train, dev and test.\nTo prepare OpenAsp data which is currently only contains placeholders (as it contains licensed DUC datasets), \nplease run the following script after obtaining OpenAsp dataset from the [OpenAsp GitHub repo](https://github.com/liatschiff/OpenAsp).\n\nThe script requires the OpenAsp directory (by default called `openasp-v1`) will replace files inplace, use `--help` for more info:\n\nNote: don't forget to install extra requirements with `pip install -U igcs[train]` to build the datasets.\n\n```bash\npython scripts/prepare_openasp_files.py ./OpenAsp/openasp-v1\n```\n\n## \ud83d\udcc8 Predicting with IGCSBench and GenCS\n\nTo predict on IGCSBench or GenCS (called ReverseInstructions here) dataset, use the following command:\n\nuse --help for more details on different prediction modes.\n\n```bash\npython src/igcs/predict.py \\\n  -i 'OpenAsp/test' 'AspectNews/test' 'SciFact/test' 'DebateSum/test' 'SaliencyDetection/test' 'EvidenceDetection/test' 'ReverseInstructions/test' \\\n  --model \"GPT4\" \n```\n\nFull help script for commands:\n\n```text\nusage: predict.py [-h] [--mode MODE] -i INFILE [INFILE ...] -m MODEL [-o OUTFILE] [-n NUM_SAMPLES] [--skip-eval] [--shuffle] [--dry-run] [--icl-num-samples ICL_NUM_SAMPLES]\n                  [--icl-samples-from-eval] [--randomize-icl-samples] [--prompt-variant PROMPT_VARIANT]\n\noptions:\n  -h, --help            show this help message and exit\n  --mode MODE           One of zeroshot (default), icl (in-context learning), single_doc (one source document at a time), or icl_single_doc for the last two modes combined.\n  -i INFILE [INFILE ...], --infile INFILE [INFILE ...]\n                        Input prompts file in JSON-Lines format for prediction.file can be also a predefined dataset such as OpenAsp/test\n  -m MODEL, --model MODEL\n                        Model to predict results on\n  -o OUTFILE, --outfile OUTFILE\n                        Output predictions file in JSON-Lines format. The scripts adds `selection` key to every row in the input file, keeping other keys intact.\n  -n NUM_SAMPLES, --num-samples NUM_SAMPLES\n                        Predict only on the first n samples specified. Defaults to all samples.\n  --skip-eval           If set, disable evaluation step at end.\n  --shuffle             If set, shuffles predicted samples. Can be combined with --num-samples.\n  --dry-run             If set, does not predict but only prints prompts.\n  --icl-num-samples ICL_NUM_SAMPLES\n                        Number of samples from the train set to include in in-context learning mode. Only relevant if mode is icl\n  --icl-samples-from-eval\n                        When set uses ICL samples from the eval set (required for datasets without train set)\n  --randomize-icl-samples\n                        Whether to randomize per eval sample the ICL samples or use the same samples for all the eval set\n  --prompt-variant PROMPT_VARIANT\n                        The index of the prompt template variant to use\n```\n\n## \ud83d\udd04 Recreating IGCSBench and GenCS\n\nPlease follow the code in [./src/igcs/datasets](src/igcs/datasets) for recreating IGCSBench and GenCS.\n\n\n# \ud83c\udfcb\ufe0f\u200d\u2642\ufe0f Training\n\nExample training of Llama-3-8B-Instruct on GenCS-Union dataset.\n\nNote: don't forget to install extra requirements with `pip install -U igcs[train]` to train.\n\n\n```bash\nexport CUDA_VISIBLE_DEVICES=4,5,7\nnohup 2>&1 accelerate launch --main_process_port 31337 src/igcs/train_model/__init__.py     \\\n    --output_dir \"my-trained-model\"   \\\n    --train_dataset  'ReverseInstructions'  \\\n    --dataset_dir './igcs-dataset/prompts'     \\\n    --model_name 'meta-llama/Meta-Llama-3-8B-Instruct'   \\\n    --gradient_accumulation_steps 1 \\\n    --batch_size 4 \\\n    --seq_length 4096 \\\n    --num_train_epochs 3   \\\n    --evaluation_strategy \"no\" \\\n    --neftune_noise_alpha 5.0 \\\n    --warmup_ratio 0.06 > \"train_${MODEL_CODENAME}.log\" &\n```\n\n# \u2696\ufe0f License\n\nThe code in this repo is available under MIT and APACHE2 dual license mode.\nAll datasets are provided and can be used according to their original licensing - please verify that before using them.\n\n## \ud83e\udd1d Contributing\nContributions and pull requests are welcome - feel free to open a PR.\nFound a bug or have a suggestion? Please file an issue.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Instruction\u2011Guided Content Selection toolkit and datasets",
    "version": "0.0.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/shmuelamar/igcs/issues",
        "Homepage": "https://github.com/shmuelamar/igcs",
        "Source": "https://github.com/shmuelamar/igcs"
    },
    "split_keywords": [
        "instruction\u2011guided",
        " content\u2011selection",
        " extractive\u2011summarisation",
        " evidence\u2011extraction",
        " benchmark",
        " llm",
        " transfer\u2011learning",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8131b455c7d869269910acd7d0b0c47cf85f2b972fb82ccb6cde30572b2b3f36",
                "md5": "fc3031e8f8ab5fbabf2f93144736fe85",
                "sha256": "7f15ef33a68f197e87e53d6442645278fdc47104ce7f73e055a9054bb376b2f1"
            },
            "downloads": -1,
            "filename": "igcs-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fc3031e8f8ab5fbabf2f93144736fe85",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 105005,
            "upload_time": "2025-07-23T15:20:06",
            "upload_time_iso_8601": "2025-07-23T15:20:06.620754Z",
            "url": "https://files.pythonhosted.org/packages/81/31/b455c7d869269910acd7d0b0c47cf85f2b972fb82ccb6cde30572b2b3f36/igcs-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c6e2555050bfeca22768546528c825794923511e835dc59b1810b369beffabe8",
                "md5": "ce3add2bd382fcd5f086b6b789df9c7a",
                "sha256": "ab1a06a62f52628f36aab4eb9715879ecf89b15d50c2265aba7577b9c4d2dfc1"
            },
            "downloads": -1,
            "filename": "igcs-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "ce3add2bd382fcd5f086b6b789df9c7a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 104482,
            "upload_time": "2025-07-23T15:20:08",
            "upload_time_iso_8601": "2025-07-23T15:20:08.460740Z",
            "url": "https://files.pythonhosted.org/packages/c6/e2/555050bfeca22768546528c825794923511e835dc59b1810b369beffabe8/igcs-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-23 15:20:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shmuelamar",
    "github_project": "igcs",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "fuzzysearch",
            "specs": [
                [
                    "<",
                    "1"
                ],
                [
                    ">=",
                    "0.7.3"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "<",
                    "3"
                ],
                [
                    ">=",
                    "2.6.4"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.7.2"
                ],
                [
                    "<",
                    "3.8"
                ]
            ]
        }
    ],
    "lcname": "igcs"
}
        
Elapsed time: 0.69030s