# SoFair Filter
Simple command line tool for identifying candidate documents for software mention extraction.
## Installation
```bash
pip install sofairfilter
```
The default configuration uses the flash attention (https://github.com/Dao-AILab/flash-attention) that must be installed separately afterward. You can install it with:
```bash
pip install flash-attn --no-build-isolation
```
## Usage
To process a folder containing text documents and filter them based on the presence of software mentions, you can use the following command:
```bash
sofairfilter folder_with_txt_documents
```
It will print paths to the documents that contain software mentions.
### Custom Configuration
You can run it with a custom configuration file using the `--config` option:
```bash
sofairfilter folder_with_txt_documents --config path/to/config.yaml
```
The default configuration is:
```yaml
model_factory: # Model configuration.
model_path: SoFairOA/sofair-modernBERT-base-filter # Name or path to the model.
attn_implementation: flash_attention_2 # The attention implementation to use in the model (if relevant). Can be any of "eager" (manual implementation of the attention), "sdpa" (using F.scaled_dot_product_attention), or "flash_attention_2" (using Dao-AILab/flash-attention). By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual "eager" implementation.
cache_dir: # Path to Hugging Face cache directory.
quantization: # Configuration for bits and bytes quantization.
load_in_8bit: false # This flag is used to enable 8-bit quantization with LLM.int8().
load_in_4bit: false # This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from `bitsandbytes`.
llm_int8_threshold: 6.0 # This corresponds to the outlier threshold for outlier detection as described in `LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale` paper: https://arxiv.org/abs/2208.07339 Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).
llm_int8_skip_modules: # An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as Jukebox that has several heads in different places and not necessarily at the last position. For example for `CausalLM` models, the last `lm_head` is kept in its original `dtype`.
llm_int8_enable_fp32_cpu_offload: false # This flag is used for advanced use cases and users that are aware of this feature. If you want to split your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use this flag. This is useful for offloading large models such as `google/flan-t5-xxl`. Note that the int8 operations will not be run on CPU.
llm_int8_has_fp16_weight: false # This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not have to be converted back and forth for the backward pass.
bnb_4bit_compute_dtype: # This sets the computational type which might be different than the input type. For example, inputs might be fp32, but computation can be set to bf16 for speedups.
bnb_4bit_quant_type: fp4 # This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types which are specified by `fp4` or `nf4`.
bnb_4bit_use_double_quant: false # This flag is used for nested quantization where the quantization constants from the first quantization are quantized again.
bnb_4bit_quant_storage: # This sets the storage type to pack the quanitzed 4-bit prarams.
torch_dtype: bfloat16 # Override the default torch.dtype and load the model under a specific dtype
trust_remote_code: false # Whether to trust remote code.
config: # Configuration for the model.
device: cuda # Device map for the model. If not specified, the model will be loaded on the CPU. Defaults to auto.
labels: # Classification labels, the position is specifying label id. Leave empty for automatic detection of labels from dataset or using labels from model configuration.
tokenizer: # Hugging Face tokenizer for the model. Leave empty if you wish to initialize it from the model.
threshold: # The threshold for the model's confidence probability. Documents with a probability below this threshold will be filtered out. By default, no threshold is applied and a class with the highest probability is selected.
batch_size: 32 # Batch size for processing documents.
```
See help for more options:
```bash
sofairfilter --help
```
## Evaluation
We evaluated this model on the test set of [SoFairOA/sofair_softcite_somesci](https://huggingface.co/datasets/SoFairOA/sofair_softcite_somesci) (sofair_softcite_somesci_documents) dataset:
<table>
<tr>
<th>precision</th>
<td>0.8625730994152047</td>
</tr>
<tr>
<th>recall</th>
<td>0.9104938271604939</td>
</tr>
<tr>
<th>f1</th>
<td>0.8858858858858859</td>
</tr>
<tr>
<th>accuracy</th>
<td>0.9268527430221367</td>
</tr>
</table>
Scripts used for evaluation are available in the `experiments/sofair_softcite_somesci` folder.
Raw data
{
"_id": null,
"home_page": null,
"name": "sofairfilter",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "software mentions, filtering, candidate documents, text processing, NLP",
"author": "Martin Do\u010dekal",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/85/e4/434f04cbef9f4f58742e0ae56b03eeb042a893da4954632de32b6ece2f16/sofairfilter-1.0.0.tar.gz",
"platform": null,
"description": "# SoFair Filter\nSimple command line tool for identifying candidate documents for software mention extraction.\n\n## Installation\n\n```bash\npip install sofairfilter\n```\n\nThe default configuration uses the flash attention (https://github.com/Dao-AILab/flash-attention) that must be installed separately afterward. You can install it with:\n\n```bash\npip install flash-attn --no-build-isolation\n```\n\n## Usage\nTo process a folder containing text documents and filter them based on the presence of software mentions, you can use the following command:\n\n```bash\nsofairfilter folder_with_txt_documents\n```\n\nIt will print paths to the documents that contain software mentions.\n\n### Custom Configuration\n\nYou can run it with a custom configuration file using the `--config` option:\n\n```bash\nsofairfilter folder_with_txt_documents --config path/to/config.yaml\n```\n\nThe default configuration is:\n\n```yaml\nmodel_factory: # Model configuration.\n model_path: SoFairOA/sofair-modernBERT-base-filter # Name or path to the model.\n attn_implementation: flash_attention_2 # The attention implementation to use in the model (if relevant). Can be any of \"eager\" (manual implementation of the attention), \"sdpa\" (using F.scaled_dot_product_attention), or \"flash_attention_2\" (using Dao-AILab/flash-attention). By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual \"eager\" implementation.\n cache_dir: # Path to Hugging Face cache directory.\n quantization: # Configuration for bits and bytes quantization.\n load_in_8bit: false # This flag is used to enable 8-bit quantization with LLM.int8().\n load_in_4bit: false # This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from `bitsandbytes`.\n llm_int8_threshold: 6.0 # This corresponds to the outlier threshold for outlier detection as described in `LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale` paper: https://arxiv.org/abs/2208.07339 Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).\n llm_int8_skip_modules: # An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as Jukebox that has several heads in different places and not necessarily at the last position. For example for `CausalLM` models, the last `lm_head` is kept in its original `dtype`.\n llm_int8_enable_fp32_cpu_offload: false # This flag is used for advanced use cases and users that are aware of this feature. If you want to split your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use this flag. This is useful for offloading large models such as `google/flan-t5-xxl`. Note that the int8 operations will not be run on CPU.\n llm_int8_has_fp16_weight: false # This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not have to be converted back and forth for the backward pass.\n bnb_4bit_compute_dtype: # This sets the computational type which might be different than the input type. For example, inputs might be fp32, but computation can be set to bf16 for speedups.\n bnb_4bit_quant_type: fp4 # This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types which are specified by `fp4` or `nf4`.\n bnb_4bit_use_double_quant: false # This flag is used for nested quantization where the quantization constants from the first quantization are quantized again.\n bnb_4bit_quant_storage: # This sets the storage type to pack the quanitzed 4-bit prarams.\n torch_dtype: bfloat16 # Override the default torch.dtype and load the model under a specific dtype\n trust_remote_code: false # Whether to trust remote code.\n config: # Configuration for the model.\n device: cuda # Device map for the model. If not specified, the model will be loaded on the CPU. Defaults to auto.\n labels: # Classification labels, the position is specifying label id. Leave empty for automatic detection of labels from dataset or using labels from model configuration.\ntokenizer: # Hugging Face tokenizer for the model. Leave empty if you wish to initialize it from the model.\nthreshold: # The threshold for the model's confidence probability. Documents with a probability below this threshold will be filtered out. By default, no threshold is applied and a class with the highest probability is selected.\nbatch_size: 32 # Batch size for processing documents.\n```\n\nSee help for more options:\n\n```bash\nsofairfilter --help\n```\n\n## Evaluation\nWe evaluated this model on the test set of [SoFairOA/sofair_softcite_somesci](https://huggingface.co/datasets/SoFairOA/sofair_softcite_somesci) (sofair_softcite_somesci_documents) dataset:\n\n<table>\n <tr>\n <th>precision</th>\n <td>0.8625730994152047</td>\n </tr>\n <tr>\n <th>recall</th>\n <td>0.9104938271604939</td>\n </tr>\n <tr>\n <th>f1</th>\n <td>0.8858858858858859</td>\n </tr>\n <tr>\n <th>accuracy</th>\n <td>0.9268527430221367</td>\n </tr>\n</table>\n\nScripts used for evaluation are available in the `experiments/sofair_softcite_somesci` folder.\n",
"bugtrack_url": null,
"license": null,
"summary": "Tool for identifying candidate documents for software mention extraction.",
"version": "1.0.0",
"project_urls": {
"Repository": "https://github.com/SoFairOA/filter"
},
"split_keywords": [
"software mentions",
" filtering",
" candidate documents",
" text processing",
" nlp"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "36a097e8189e55456ac535b945a80b57428be9ceefea707c1220a1a178fa72c9",
"md5": "b560bffb6af3f8617ac036369da46576",
"sha256": "f43a10c1df4b5a138e7aa6b65a35b503503b3fa07cbb31a17db2f6d805d7036f"
},
"downloads": -1,
"filename": "sofairfilter-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b560bffb6af3f8617ac036369da46576",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 12981,
"upload_time": "2025-07-25T12:29:27",
"upload_time_iso_8601": "2025-07-25T12:29:27.641083Z",
"url": "https://files.pythonhosted.org/packages/36/a0/97e8189e55456ac535b945a80b57428be9ceefea707c1220a1a178fa72c9/sofairfilter-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "85e4434f04cbef9f4f58742e0ae56b03eeb042a893da4954632de32b6ece2f16",
"md5": "dc042b29ce8a054f1130f7b521cfbb6c",
"sha256": "30deca2d6f660fbce1d1f95b4a777a55412f5c4d20973fd11fbb704ecb6c5fcc"
},
"downloads": -1,
"filename": "sofairfilter-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "dc042b29ce8a054f1130f7b521cfbb6c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 9253,
"upload_time": "2025-07-25T12:29:29",
"upload_time_iso_8601": "2025-07-25T12:29:29.158901Z",
"url": "https://files.pythonhosted.org/packages/85/e4/434f04cbef9f4f58742e0ae56b03eeb042a893da4954632de32b6ece2f16/sofairfilter-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-25 12:29:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SoFairOA",
"github_project": "filter",
"github_not_found": true,
"lcname": "sofairfilter"
}