fast-psq

Name	fast-psq JSON
Version	0.1.0 JSON
	download
home_page	https://github.com/hltcoe/PSQ
Summary	Efficient Implementation of Probabilistic Structured Queries
upload_time	2024-04-29 20:55:38
maintainer	None
docs_url	None
author	Eugene Yang
requires_python	>=3.8
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# Efficient Implementation of Probabilistic Structured Queries

This package is an implementation of the Probablistic Structured Queries algorithm for
cross-langauge information retrieval.
It leverages alignment table from statistical machine translation to translate the document
bag-of-tokens into the query language.

Raw translation tables are available on Huggingface Models [`hltcoe/psq_translation_tables`](https://huggingface.co/hltcoe/psq_translation_tables)

## Get started

`fast_psq` is available on PyPI.
```bash
pip install fast_psq
```

Alternatively, you can also install directly from the GitHub main branch by using the following
command.
```bash
pip install pip@git+https://github.com/hltcoe/PSQ
```

`fast_psq` works with `ir_datasets` and `ir_measures` quite well for accessing IR evaluation collections
and evaluating results. You can install the two packages with the following command.
```bash
pip install ir_datasets ir_measures
```

## Indexing

The indexing script takes a translation table (i.e., alignment matrix) and a document `jsonl` file.
We release a number of them on Huggingface Model, which can be automatically downloaded
in the script by placing the path in the `--psq_file` flag in the format of `{repo_id}:{flie_path}`.
Alternatively, you can also pass in a local `.json.gz` file that contains a dictionary of dictionaries, mapping from
source tokens (string) to target tokens (string) to alignment probabilities.
However, the default tokenizer in the script uses `mosestokenier`, which may not match the one in your own
alignment matrix. You should either use `mosestokenier` when aligning the bitext or replace the tokenizer with yours.

The document file should be a `jsonl` file with one document in each row.
You can specify the field for document id, title, and body text by passing in the field name
in the file through `--docid`, `--title`, and `--body` respectively.
Alternatively, you can also use `--doc_source` with `irds:` as prefix to use a dataset in `ir_datasets`.

The following is an example indexing command.
```bash
python -m fast_psq.index \
--doc_file irds:neuclir/1/zh/trec-2022 \
--lang zh \
--psq_file hltcoe/psq_translation_tables:zh.table.dict.gz \
--min_translation_prob 0.00010 \
--max_translation_alternatives 64 \
--max_translation_cdf 0.99 \
--docid doc_id \
--title title \
--body text \
--min_translation_prob 1e-4 \
--max_translation_alternatives 64 \
--output_dir ./indexes/neuclir-zh.f32/ \
--compression \
--nworkers 64
```

Please use `python -m fast_psq.index --help` for more information about the arguments.

## Searching

The search script takes the index and a `tsv` query file and output a TREC style result file.
Similarly, we support `ir_datasets` as well with `irds:` prefix in both `--query_source` and `--qrels` arguments.

The following command is an example.
```bash
python -m fast_psq.search \
--query_source irds:neuclir/1/zh/trec-2022 \
--query_field title \
--index_dir ./indexes/neuclir-zh.f32/ \
--qrels irds:neuclir/1/zh/trec-2022 \
--query_lang en \
--output_file ./neuclir-zh.en.title.f32.trec
```

Please use `python -m fast_psq.search --help` for more information about the arguments.

## Citation

```bibtex
@article{psq-repro,
title = {Efficiency-Effectiveness Tradeoff of Probabilistic Structured Queries for Cross-Language Information Retrieval},
author = {Eugene Yang and Suraj Nair and Dawn Lawrie and James Mayfield and Douglas W. Oard and Kevin Duh},
journal = {arXiv preprint arXiv},
year = {2024}
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hltcoe/PSQ",
    "name": "fast-psq",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Eugene Yang",
    "author_email": "eugene.yang@jhu.edu",
    "download_url": "https://files.pythonhosted.org/packages/74/56/cef8d491ac07c09799bfc765921e0745a77d74ad3a903bca17f5665d2b26/fast_psq-0.1.0.tar.gz",
    "platform": null,
    "description": "# Efficient Implementation of Probabilistic Structured Queries\n\nThis package is an implementation of the Probablistic Structured Queries algorithm for \ncross-langauge information retrieval. \nIt leverages alignment table from statistical machine translation to translate the document \nbag-of-tokens into the query language. \n\nRaw translation tables are available on Huggingface Models [`hltcoe/psq_translation_tables`](https://huggingface.co/hltcoe/psq_translation_tables)\n\n## Get started\n\n`fast_psq` is available on PyPI.\n```bash\npip install fast_psq\n```\n\nAlternatively, you can also install directly from the GitHub main branch by using the following \ncommand. \n```bash\npip install pip@git+https://github.com/hltcoe/PSQ\n```\n\n`fast_psq` works with `ir_datasets` and `ir_measures` quite well for accessing IR evaluation collections \nand evaluating results. You can install the two packages with the following command. \n```bash\npip install ir_datasets ir_measures\n```\n\n## Indexing\n\nThe indexing script takes a translation table (i.e., alignment matrix) and a document `jsonl` file. \nWe release a number of them on Huggingface Model, which can be automatically downloaded \nin the script by placing the path in the `--psq_file` flag in the format of `{repo_id}:{flie_path}`. \nAlternatively, you can also pass in a local `.json.gz` file that contains a dictionary of dictionaries, mapping from\nsource tokens (string) to target tokens (string) to alignment probabilities. \nHowever, the default tokenizer in the script uses `mosestokenier`, which may not match the one in your own \nalignment matrix. You should either use `mosestokenier` when aligning the bitext or replace the tokenizer with yours. \n\nThe document file should be a `jsonl` file with one document in each row. \nYou can specify the field for document id, title, and body text by passing in the field name\nin the file through `--docid`, `--title`, and `--body` respectively. \nAlternatively, you can also use `--doc_source` with `irds:` as prefix to use a dataset in `ir_datasets`.\n\nThe following is an example indexing command.\n```bash\npython -m fast_psq.index \\\n--doc_file irds:neuclir/1/zh/trec-2022 \\\n--lang zh \\\n--psq_file hltcoe/psq_translation_tables:zh.table.dict.gz \\\n--min_translation_prob 0.00010 \\\n--max_translation_alternatives 64 \\\n--max_translation_cdf 0.99 \\\n--docid doc_id \\\n--title title \\\n--body text \\\n--min_translation_prob 1e-4 \\\n--max_translation_alternatives 64 \\\n--output_dir ./indexes/neuclir-zh.f32/ \\\n--compression \\\n--nworkers 64\n```\n\nPlease use `python -m fast_psq.index --help` for more information about the arguments. \n\n## Searching\n\nThe search script takes the index and a `tsv` query file and output a TREC style result file. \nSimilarly, we support `ir_datasets` as well with `irds:` prefix in both `--query_source` and `--qrels` arguments. \n\nThe following command is an example. \n```bash\npython -m fast_psq.search \\\n--query_source irds:neuclir/1/zh/trec-2022 \\\n--query_field title \\\n--index_dir ./indexes/neuclir-zh.f32/ \\\n--qrels irds:neuclir/1/zh/trec-2022 \\\n--query_lang en \\\n--output_file ./neuclir-zh.en.title.f32.trec\n```\n\nPlease use `python -m fast_psq.search --help` for more information about the arguments. \n\n## Citation\n\n```bibtex\n@article{psq-repro,\n    title = {Efficiency-Effectiveness Tradeoff of Probabilistic Structured Queries for Cross-Language Information Retrieval},\n    author = {Eugene Yang and Suraj Nair and Dawn Lawrie and James Mayfield and Douglas W. Oard and Kevin Duh},\n    journal = {arXiv preprint arXiv},\n    year = {2024}\n}\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Efficient Implementation of Probabilistic Structured Queries",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/hltcoe/PSQ"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d2b29e4de9cb5f18444211fc71e739ccee251535fcd2fcffa29cb550035d286b",
                "md5": "40e6c314ca887a454f922acf7ae81ee7",
                "sha256": "cacc7d6be72ab003bc303140422c0b0ea3aa04a383e93a2624e68f0ea7265cc2"
            },
            "downloads": -1,
            "filename": "fast_psq-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "40e6c314ca887a454f922acf7ae81ee7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 14854,
            "upload_time": "2024-04-29T20:55:35",
            "upload_time_iso_8601": "2024-04-29T20:55:35.837738Z",
            "url": "https://files.pythonhosted.org/packages/d2/b2/9e4de9cb5f18444211fc71e739ccee251535fcd2fcffa29cb550035d286b/fast_psq-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7456cef8d491ac07c09799bfc765921e0745a77d74ad3a903bca17f5665d2b26",
                "md5": "0c7770adfbd690606af4d79bb5934c4a",
                "sha256": "5f069881c3274d65fe0175632e83baad5acd046d847bb20c52a9dfb3d71ec77a"
            },
            "downloads": -1,
            "filename": "fast_psq-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0c7770adfbd690606af4d79bb5934c4a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 14542,
            "upload_time": "2024-04-29T20:55:38",
            "upload_time_iso_8601": "2024-04-29T20:55:38.419875Z",
            "url": "https://files.pythonhosted.org/packages/74/56/cef8d491ac07c09799bfc765921e0745a77d74ad3a903bca17f5665d2b26/fast_psq-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-29 20:55:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hltcoe",
    "github_project": "PSQ",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "fast-psq"
}

Eugene Yang