<!--- Top of README Badges (automated) --->
[![PyPI](https://img.shields.io/pypi/v/wipac-file-catalog-indexer)](https://pypi.org/project/wipac-file-catalog-indexer/) [![GitHub release (latest by date including pre-releases)](https://img.shields.io/github/v/release/WIPACrepo/file-catalog-indexer?include_prereleases)](https://github.com/WIPACrepo/file-catalog-indexer/) [![PyPI - License](https://img.shields.io/pypi/l/wipac-file-catalog-indexer)](https://github.com/WIPACrepo/file-catalog-indexer/blob/master/LICENSE) [![Lines of code](https://img.shields.io/tokei/lines/github/WIPACrepo/file-catalog-indexer)](https://github.com/WIPACrepo/file-catalog-indexer/) [![GitHub issues](https://img.shields.io/github/issues/WIPACrepo/file-catalog-indexer)](https://github.com/WIPACrepo/file-catalog-indexer/issues?q=is%3Aissue+sort%3Aupdated-desc+is%3Aopen) [![GitHub pull requests](https://img.shields.io/github/issues-pr/WIPACrepo/file-catalog-indexer)](https://github.com/WIPACrepo/file-catalog-indexer/pulls?q=is%3Apr+sort%3Aupdated-desc+is%3Aopen)
<!--- End of README Badges (automated) --->
# file-catalog-indexer
Indexing package and scripts for the File Catalog
## How To
### API
#### `from indexer.index import index`
- The flagship indexing function
- Find files rooted at given path(s), compute their metadata, and upload it to File Catalog
- Configurable for multi-processing (default: 1 process) and recursive file-traversing (default: on)
- Internally communicates asynchronously with File Catalog
- Note: Symbolic links are never followed.
- Note: `index()` runs the current event loop (`asyncio.get_event_loop().run_until_complete()`)
- Ex:
```python
index(
index_config, # see config.py for a description of the fields in these typed dictionaries
oauth_config,
rest_config
)
```
#### `from indexer.index import index_file`
- Compute metadata of a single file, and upload it to File Catalog, i.e. index one file
- Single-processed, single-threaded
```python
await index_file(
filepath='/data/exp/IceCube/2018/filtered/level2/0820/Run00131410_74/Level2_IC86.2018_data_Run00131410_Subrun00000000_00000172.i3.zst',
manager=MetadataManager(...),
fc_rc=RestClient(...),
)
```
#### `from indexer.index import index_paths`
- A wrapper around `index_file()` which indexes multiple files, and returns any nested sub-directories
- Single-processed, single-threaded
- Note: Symbolic links are never followed.
```python
sub_dirs = await index_paths(
paths=['/data/exp/IceCube/2018/filtered/level2/0820', '/data/exp/IceCube/2018/filtered/level2/0825'],
manager=MetadataManager(...),
fc_rc=RestClient(...),
)
```
#### `from indexer.metadata_manager import MetadataManager`
- The internal brain of the Indexer. This has minimal guardrails, does not communicate to File Catalog, and does not traverse file directory tree.
- Metadata is produced for an individual file, at a time.
- Ex:
```python
manager = MetadataManager(...) # caches connections & directory info, manages metadata collection
metadata_file = manager.new_file(filepath) # returns an instance (computationally light)
metadata = metadata_file.generate() # returns a dict (computationally intense)
```
### Scripts
##### `python -m indexer.index`
- A command-line alternative to using `from indexer.index import index`
- Use with `-h` to see usage.
- Note: Symbolic links are never followed.
##### `python -m indexer.generate`
- Like `python -m indexer.index`, but prints (using `pprint`) the metadata instead of posting to File Catalog.
- Simply, uses file-traversing logic around calls to `indexer.metadata_manager.MetadataManager`
- Note: Symbolic links are never followed.
##### `python -m indexer.delocate`
- Find files rooted at given path(s); for each, remove the matching location entry from its File Catalog record.
- Note: Symbolic links are never followed.
## .i3 File Processing-Level Detection and Embedded Filename-Metadata Extraction
Regex is used heavily to detect the processing level of a `.i3` file, and extract any embedded metadata in the filename. The exact process depends on the type of data:
### Real Data (`/data/exp/*`)
This is a two-stage process (see `MetadataManager._new_file_real()`):
1. Processing-Level Detection (Base Pattern Screening)
- The filename is applied to multiple generic patterns to detect if it is L2, PFFilt, PFDST, or PFRaw
- If the filename does not trigger a match, *only basic metadata is collected* (`logical_name`, `checksum`, `file_size`, `locations`, and `create_date`)
2. Embedded Filename-Metadata Extraction
- After the processing level is known, the filename is parsed using one of (possibly) several tokenizing regex patterns for the best match (greedy matching)
- If the filename does not trigger a match, *the function will raise an exception (script will exit).* This probably indicates that a new pattern needs to be added to the list.
+ see `indexer.metadata.real.filename_patterns`
### Simulation Data (`/data/sim/*`)
This is a three-stage process (see `MetadataManager._new_file_simulation()`):
1. Base Pattern Screening
- The filename is checked for `.i3` file extensions: `.i3`, `.i3.gz`, `.i3.bz2`, `.i3.zst`
- If the filename does not trigger a match, *only basic metadata is collected* (`logical_name`, `checksum`, `file_size`, `locations`, and `create_date`)
+ there are a couple hard-coded "anti-patterns" used for rejecting known false-positives (see code)
2. Embedded Filename-Metadata Extraction
- The filename is parsed using one of MANY (around a thousand) tokenizing regex patterns for the best match (greedy matching)
- If the filename does not trigger a match, *the function will raise an exception (script will exit).* This probably indicates that a new pattern needs to be added to the list.
+ see `indexer.metadata.sim.filename_patterns`
3. Processing-Level Detection
- The filename is parsed for substrings corresponding to a processing level
+ see `DataSimI3FileMetadata.figure_processing_level()`
- If there is no match, `processing_level` will be set to `None`, since the processing level is less important for simulation data.
## Metadata Schema
See:
- [Google Doc](https://docs.google.com/document/d/14SanUWiYEbgarElt0YXSn_2We-rwT-ePO5Fg7rrM9lw/edit?usp=sharing)
- [File Catalog Types](https://github.com/WIPACrepo/file_catalog/blob/master/file_catalog/schema/types.py)
## Warnings
### Re-indexing Files is Tricky (Two Scenarios)
1. Indexing files that have not changed locations is okay--this probably means that the rest of the metadata has also not changed. A guardrail query will check if the file exists in the FC with that `locations` entry, and will not process the file further.
2. HOWEVER, don't point the indexer at restored files (of the same file-version)--those that had their initial `locations` entry removed (ie. removed from WIPAC, then moved back). Unlike re-indexing an unchanged file, this file will be *fully locally processed* (opened, read, and check-summed) before encountering the checksum-conflict then aborting. These files will be skipped (not sent to FC), unless you use `--patch` *(replaces the `locations` list, wholesale)*, which is **DANGEROUS**.
- Example Conflict: It's possible a file-version exists in FC after initial guardrails
1. file was at WIPAC & indexed
2. then moved to NERSC (`location` added) & deleted from WIPAC (`location` removed)
3. file was brought back to WIPAC
4. now is being re-indexed at WIPAC
5. CONFLICT -> has the same `logical_name`+`checksum.sha512` but differing `locations`
Raw data
{
"_id": null,
"home_page": "https://github.com/WIPACrepo/file-catalog-indexer",
"name": "wipac-file-catalog-indexer",
"maintainer": "",
"docs_url": null,
"requires_python": "<3.12,>=3.8",
"maintainer_email": "",
"keywords": "indexer,metadata,data,warehouse,archive,L2,PFDST,PFFilt,PFRaw,i3,simulation,File,Catalog,iceprod,WIPAC,IceCube",
"author": "WIPAC Developers",
"author_email": "developers@icecube.wisc.edu",
"download_url": "https://files.pythonhosted.org/packages/ea/22/9701550828a67eabb88774c9b708e9d1751291f37a7cf9889abb5c65ba2b/wipac-file-catalog-indexer-2.0.1.tar.gz",
"platform": null,
"description": "<!--- Top of README Badges (automated) --->\n[![PyPI](https://img.shields.io/pypi/v/wipac-file-catalog-indexer)](https://pypi.org/project/wipac-file-catalog-indexer/) [![GitHub release (latest by date including pre-releases)](https://img.shields.io/github/v/release/WIPACrepo/file-catalog-indexer?include_prereleases)](https://github.com/WIPACrepo/file-catalog-indexer/) [![PyPI - License](https://img.shields.io/pypi/l/wipac-file-catalog-indexer)](https://github.com/WIPACrepo/file-catalog-indexer/blob/master/LICENSE) [![Lines of code](https://img.shields.io/tokei/lines/github/WIPACrepo/file-catalog-indexer)](https://github.com/WIPACrepo/file-catalog-indexer/) [![GitHub issues](https://img.shields.io/github/issues/WIPACrepo/file-catalog-indexer)](https://github.com/WIPACrepo/file-catalog-indexer/issues?q=is%3Aissue+sort%3Aupdated-desc+is%3Aopen) [![GitHub pull requests](https://img.shields.io/github/issues-pr/WIPACrepo/file-catalog-indexer)](https://github.com/WIPACrepo/file-catalog-indexer/pulls?q=is%3Apr+sort%3Aupdated-desc+is%3Aopen) \n<!--- End of README Badges (automated) --->\n# file-catalog-indexer\nIndexing package and scripts for the File Catalog\n\n## How To\n\n### API\n#### `from indexer.index import index`\n- The flagship indexing function\n- Find files rooted at given path(s), compute their metadata, and upload it to File Catalog\n- Configurable for multi-processing (default: 1 process) and recursive file-traversing (default: on)\n- Internally communicates asynchronously with File Catalog\n- Note: Symbolic links are never followed.\n- Note: `index()` runs the current event loop (`asyncio.get_event_loop().run_until_complete()`)\n- Ex:\n```python\nindex(\n\tindex_config, # see config.py for a description of the fields in these typed dictionaries\n\toauth_config,\n\trest_config\n)\n```\n\n#### `from indexer.index import index_file`\n- Compute metadata of a single file, and upload it to File Catalog, i.e. index one file\n- Single-processed, single-threaded\n```python\nawait index_file(\n filepath='/data/exp/IceCube/2018/filtered/level2/0820/Run00131410_74/Level2_IC86.2018_data_Run00131410_Subrun00000000_00000172.i3.zst',\n manager=MetadataManager(...),\n fc_rc=RestClient(...),\n)\n```\n\n#### `from indexer.index import index_paths`\n- A wrapper around `index_file()` which indexes multiple files, and returns any nested sub-directories\n- Single-processed, single-threaded\n- Note: Symbolic links are never followed.\n```python\nsub_dirs = await index_paths(\n paths=['/data/exp/IceCube/2018/filtered/level2/0820', '/data/exp/IceCube/2018/filtered/level2/0825'],\n manager=MetadataManager(...),\n fc_rc=RestClient(...),\n)\n```\n\n#### `from indexer.metadata_manager import MetadataManager`\n- The internal brain of the Indexer. This has minimal guardrails, does not communicate to File Catalog, and does not traverse file directory tree.\n- Metadata is produced for an individual file, at a time.\n- Ex:\n```python\nmanager = MetadataManager(...) # caches connections & directory info, manages metadata collection\nmetadata_file = manager.new_file(filepath) # returns an instance (computationally light)\nmetadata = metadata_file.generate() # returns a dict (computationally intense)\n ```\n\n### Scripts\n##### `python -m indexer.index`\n- A command-line alternative to using `from indexer.index import index`\n- Use with `-h` to see usage.\n- Note: Symbolic links are never followed.\n\n##### `python -m indexer.generate`\n- Like `python -m indexer.index`, but prints (using `pprint`) the metadata instead of posting to File Catalog.\n- Simply, uses file-traversing logic around calls to `indexer.metadata_manager.MetadataManager`\n- Note: Symbolic links are never followed.\n\n##### `python -m indexer.delocate`\n- Find files rooted at given path(s); for each, remove the matching location entry from its File Catalog record.\n- Note: Symbolic links are never followed.\n\n## .i3 File Processing-Level Detection and Embedded Filename-Metadata Extraction\nRegex is used heavily to detect the processing level of a `.i3` file, and extract any embedded metadata in the filename. The exact process depends on the type of data:\n\n### Real Data (`/data/exp/*`)\nThis is a two-stage process (see `MetadataManager._new_file_real()`):\n1. Processing-Level Detection (Base Pattern Screening)\n\t- The filename is applied to multiple generic patterns to detect if it is L2, PFFilt, PFDST, or PFRaw\n\t- If the filename does not trigger a match, *only basic metadata is collected* (`logical_name`, `checksum`, `file_size`, `locations`, and `create_date`)\n2. Embedded Filename-Metadata Extraction\n\t- After the processing level is known, the filename is parsed using one of (possibly) several tokenizing regex patterns for the best match (greedy matching)\n\t- If the filename does not trigger a match, *the function will raise an exception (script will exit).* This probably indicates that a new pattern needs to be added to the list.\n\t\t+ see `indexer.metadata.real.filename_patterns`\n\n### Simulation Data (`/data/sim/*`)\nThis is a three-stage process (see `MetadataManager._new_file_simulation()`):\n1. Base Pattern Screening\n\t- The filename is checked for `.i3` file extensions: `.i3`, `.i3.gz`, `.i3.bz2`, `.i3.zst`\n\t- If the filename does not trigger a match, *only basic metadata is collected* (`logical_name`, `checksum`, `file_size`, `locations`, and `create_date`)\n\t\t+ there are a couple hard-coded \"anti-patterns\" used for rejecting known false-positives (see code)\n2. Embedded Filename-Metadata Extraction\n\t- The filename is parsed using one of MANY (around a thousand) tokenizing regex patterns for the best match (greedy matching)\n\t- If the filename does not trigger a match, *the function will raise an exception (script will exit).* This probably indicates that a new pattern needs to be added to the list.\n\t\t+ see `indexer.metadata.sim.filename_patterns`\n3. Processing-Level Detection\n\t- The filename is parsed for substrings corresponding to a processing level\n\t\t+ see `DataSimI3FileMetadata.figure_processing_level()`\n\t- If there is no match, `processing_level` will be set to `None`, since the processing level is less important for simulation data.\n\n\n## Metadata Schema\nSee:\n- [Google Doc](https://docs.google.com/document/d/14SanUWiYEbgarElt0YXSn_2We-rwT-ePO5Fg7rrM9lw/edit?usp=sharing)\n- [File Catalog Types](https://github.com/WIPACrepo/file_catalog/blob/master/file_catalog/schema/types.py)\n\n\n## Warnings\n\n### Re-indexing Files is Tricky (Two Scenarios)\n1. Indexing files that have not changed locations is okay--this probably means that the rest of the metadata has also not changed. A guardrail query will check if the file exists in the FC with that `locations` entry, and will not process the file further.\n2. HOWEVER, don't point the indexer at restored files (of the same file-version)--those that had their initial `locations` entry removed (ie. removed from WIPAC, then moved back). Unlike re-indexing an unchanged file, this file will be *fully locally processed* (opened, read, and check-summed) before encountering the checksum-conflict then aborting. These files will be skipped (not sent to FC), unless you use `--patch` *(replaces the `locations` list, wholesale)*, which is **DANGEROUS**.\n\t- Example Conflict: It's possible a file-version exists in FC after initial guardrails\n\t\t1. file was at WIPAC & indexed\n\t\t2. then moved to NERSC (`location` added) & deleted from WIPAC (`location` removed)\n\t\t3. file was brought back to WIPAC\n\t\t4. now is being re-indexed at WIPAC\n\t\t5. CONFLICT -> has the same `logical_name`+`checksum.sha512` but differing `locations`\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Indexing package and scripts for the File Catalog",
"version": "2.0.1",
"project_urls": {
"Download": "https://pypi.org/project/wipac-file-catalog-indexer/",
"Homepage": "https://github.com/WIPACrepo/file-catalog-indexer",
"Source": "https://github.com/WIPACrepo/file-catalog-indexer",
"Tracker": "https://github.com/WIPACrepo/file-catalog-indexer/issues"
},
"split_keywords": [
"indexer",
"metadata",
"data",
"warehouse",
"archive",
"l2",
"pfdst",
"pffilt",
"pfraw",
"i3",
"simulation",
"file",
"catalog",
"iceprod",
"wipac",
"icecube"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5b28318ec6516c12f67cd0d1579d8fec4f64e7ae5c158b1854f6f6c004f34fc8",
"md5": "d1f580bc9cef3a662a6c02f28679fb45",
"sha256": "19ae166bb64118ce81adbcecf4c08e6f37be4e6755e4f42392d28d6ad6c18998"
},
"downloads": -1,
"filename": "wipac_file_catalog_indexer-2.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d1f580bc9cef3a662a6c02f28679fb45",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.8",
"size": 92202,
"upload_time": "2023-05-05T18:02:18",
"upload_time_iso_8601": "2023-05-05T18:02:18.398871Z",
"url": "https://files.pythonhosted.org/packages/5b/28/318ec6516c12f67cd0d1579d8fec4f64e7ae5c158b1854f6f6c004f34fc8/wipac_file_catalog_indexer-2.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ea229701550828a67eabb88774c9b708e9d1751291f37a7cf9889abb5c65ba2b",
"md5": "090cca444006254603703aeb6dfdfcfc",
"sha256": "47d474a6ffc64de05725129077e650e0f6c9effb308a4595f9a5ab4be1cb379d"
},
"downloads": -1,
"filename": "wipac-file-catalog-indexer-2.0.1.tar.gz",
"has_sig": false,
"md5_digest": "090cca444006254603703aeb6dfdfcfc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.8",
"size": 75717,
"upload_time": "2023-05-05T18:02:20",
"upload_time_iso_8601": "2023-05-05T18:02:20.350980Z",
"url": "https://files.pythonhosted.org/packages/ea/22/9701550828a67eabb88774c9b708e9d1751291f37a7cf9889abb5c65ba2b/wipac-file-catalog-indexer-2.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-05 18:02:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "WIPACrepo",
"github_project": "file-catalog-indexer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "wipac-file-catalog-indexer"
}