# ServiceX DataBinder
<p align="right"> Release v0.5.0 </p>
[![PyPI version](https://badge.fury.io/py/servicex-databinder.svg)](https://badge.fury.io/py/servicex-databinder)
`servicex-databinder` is a user-analysis data management package using a single configuration file.
Samples with external data sources (e.g. `RucioDID` or `XRootDFiles`) utilize ServiceX to deliver user-selected columns with optional row filtering.
<!-- to interact with ServiceX instance to make ServiceX request(s) and manage ServiceX delivered data from a single configuration file. -->
The following table shows supported ServiceX transformers by DataBinder
| Input format | Code generator | Transformer | Output format
| :--- | :---: | :---: | :---: |
| ROOT Ntuple | func-adl | `uproot` | `root` or `parquet` |
| ATLAS Release 21 xAOD | func-adl | `atlasr21`| `root` |
| ROOT Ntuple | python function | `python`| `root` or `parquet` |
<!-- [`ServiceX`](https://github.com/ssl-hep/ServiceX) is a scalable HEP event data extraction, transformation and delivery system.
['ServiceX Client library'](https://github.com/ssl-hep/ServiceX_frontend) provides -->
## Prerequisite
- [Access to a ServiceX instance](https://servicex.readthedocs.io/en/latest/user/getting-started/)
- Python 3.7+
## Installation
```shell
pip install servicex-databinder
```
## Configuration file
The configuration file is a yaml file containing all the information.
The [following example configuration file](config_minimum.yaml) contains minimal fields. You can also download [`servicex-opendata.yaml`](servicex-opendata.yaml) file (rename to `servicex.yaml`) at your working directory, and run DataBinder for OpenData without an access token.
```yaml
General:
ServiceXName: servicex-opendata
OutputFormat: parquet
Sample:
- Name: ggH125_ZZ4lep
XRootDFiles: "root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets\
/2020-01-22/4lep/MC/mc_345060.ggH125_ZZ4lep.4lep.root"
Tree: mini
Columns: lep_pt, lep_eta
```
`General` block requires two mandatory options (`ServiceXName` and `OutputFormat`) as in the example above.
Input dataset for each Sample can be defined either by `RucioDID` or `XRootDFiles` or `LocalPath`.
ServiceX query can be constructed with either TCut syntax or func-adl.
- Options for TCut syntax: `Filter`<sup>1</sup> and `Columns`
- Option for Func-adl expression: `FuncADL`
<sup>1</sup> `Filter` works only for scalar-type of TBranch.
Output format can be either `Apache parquet` or `ROOT ntuple` for `uproot` backend. Only `ROOT ntuple` format is supported for `xAOD` backend.
The followings are available options:
<!-- `General` block: -->
| Option for `General` block | Description | DataType |
|:--------:|:------|:------|
| `ServiceXName`* | ServiceX backend name in your `servicex.yaml` file <br> | `String` |
| `OutputFormat`* | Output file format of ServiceX delivered data (`parquet` or `root` for `uproot` / `root` for `xaod`) | `String` |
| `Transformer` | Set transformer for all Samples. Overwrites the default transformer in the `servicex.yaml` file. | `String`|
| `Delivery` | Delivery option; `LocalPath` (default) or `LocalCache` or `ObjectStore` | `String` |
| `OutputDirectory` | Path to a directory for ServiceX delivered files | `String` |
| `WriteOutputDict` | Name of an ouput yaml file containing Python nested dictionary of output file paths (located in the `OutputDirectory`) | `String` |
| `IgnoreServiceXCache` | Ignore the existing ServiceX cache and force to make ServiceX requests | `Boolean` |
<p align="right"> *Mandatory options</p>
| Option for `Sample` block | Description |DataType |
|:--------:|:------|:------|
| `Name` | Sample name defined by a user |`String` |
| `Transformer` | Transformer for the given sample | `String`|
| `RucioDID` | Rucio Dataset Id (DID) for a given sample; <br> Can be multiple DIDs separated by comma |`String` |
| `XRootDFiles` | XRootD files (e.g. `root://`) for a given sample; <br> Can be multiple files separated by comma |`String` |
| `Tree` | Name of the input ROOT `TTree`; <br> Can be multiple `TTree`s separated by comma (`uproot` ONLY) |`String` |
| `Filter` | Selection in the TCut syntax, e.g. `jet_pt > 10e3 && jet_eta < 2.0` (TCut ONLY) |`String` |
| `Columns` | List of columns (or branches) to be delivered; multiple columns separately by comma (TCut ONLY) |`String` |
| `FuncADL` | Func-adl expression for a given sample |`String` |
| `LocalPath` | File path directly from local path (NO ServiceX tranformation) | `String` |
<!-- Options exclusively for TCut syntax (CANNOT combine with the option `FuncADL`) -->
<!-- Option for func-adl expression (CANNOT combine with the option `Fitler` and `Columns`) -->
A config file can be simplified by utilizing `Definition` block. You can define placeholders under `Definition` block, which will replace all matched placeholders in the values of `Sample` block. Note that placeholders must start with `DEF_`.
You can source each Sample using different ServiceX transformers.
The default transformer is set by `type` of `servicex.yaml`, but `Transformer` in the `General` block overwrites if present, and `Transformer` in each `Sample` overwrites any previous transformer selection.
The [following example configuration](config_maximum.yaml) shows how to use each Options.
```yaml
General:
ServiceXName: servicex-uc-af
Transformer: uproot
OutputFormat: root
OutputDirectory: /Users/kchoi/data_for_MLstudy
WriteOutputDict: fileset_ml_study
IgnoreServiceXCache: False
Sample:
- Name: Signal
RucioDID: user.kchoi:user.kchoi.signalA,
user.kchoi:user.kchoi.signalB,
user.kchoi:user.kchoi.signalC
Tree: nominal
FuncADL: DEF_ttH_nominal_query
- Name: Background1
XRootDFiles: DEF_ggH_input
Tree: mini
Filter: lep_n>2
Columns: lep_pt, lep_eta
- Name: Background2
Transformer: atlasr21
RucioDID: DEF_Zee_input
FuncADL: DEF_Zee_query
- Name: Background3
LocalPath: /Users/kchoi/Work/data/background3
- Name: Background4
Transformer: python
RucioDID: user.kchoi:user.kchoi.background4
Function: |
def run_query(input_filenames=None):
import awkward as ak, uproot
tree_name = "nominal"
o = uproot.lazy({input_filenames:tree_name})
return {"nominal: o}
Definition:
DEF_ttH_nominal_query: "Where(lambda e: e.met_met>150e3). \
Select(lambda event: {'el_pt': event.el_pt, 'jet_e': event.jet_e, \
'jet_pt': event.jet_pt, 'met_met': event.met_met})"
DEF_ggH_input: "root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets\
/2020-01-22/4lep/MC/mc_345060.ggH125_ZZ4lep.4lep.root"
DEF_Zee_input: "mc15_13TeV:mc15_13TeV.361106.PowhegPythia8EvtGen_AZNLOCTEQ6L1_Zee.\
merge.DAOD_STDM3.e3601_s2576_s2132_r6630_r6264_p2363_tid05630052_00"
DEF_Zee_query: "SelectMany('lambda e: e.Jets(\"AntiKt4EMTopoJets\")'). \
Where('lambda j: (j.pt() / 1000) > 30'). \
Select('lambda j: j.pt() / 1000.0'). \
AsROOTTTree('junk.root', 'my_tree', [\"JetPt\"])"
```
## Deliver data
```python
from servicex_databinder import DataBinder
sx_db = DataBinder('<CONFIG>.yml')
out = sx_db.deliver()
```
The function `deliver()` returns a Python nested dictionary that contains delivered files.
<!-- - for `uproot` backend and `parquet` output format: `out['<SAMPLE>']['<TREE>'] = [ List of output parquet files ]`
- for `uproot` backend and `root` output format: `out['<SAMPLE>'] = [ List of output root files ]`
- for `xAOD` backend: `out['<SAMPLE>'] = [ List of output root files ]` -->
Input configuration can be also passed in a form of a Python dictionary.
Delivered Samples and files in the `OutputDirectory` are always synced with the DataBinder config file.
<!-- ## Currently available
- Dataset as Rucio DID + Input file format is ROOT TTree + ServiceX delivers output in parquet format
- Dataset as Rucio DID + Input file format is ATLAS xAOD + ServiceX delivers output in ROOT TTree format
- Dataset as XRootD + Input file format is ROOT TTree + ServiceX delivers output in parquet format -->
## Error handling
```python
failed_requests = sx_db.get_failed_requests()
```
If failed ServiceX request(s), `deliver()` will print number of failed requests and the name of Sample, Tree if present, and input dataset. You can get a full list of failed samples and error messages for each by `get_failed_requests()` function. If it is not clear from the message you can browse `Logs` in the ServiceX instance webpage for the detail.
## Useful tools
### Create Rucio container for multiple DIDs
The current ServiceX generates one request per Rucio DID.
It's often the case that a physics analysis needs to process hundreds of DIDs.
In such cases, the script (`scripts/create_rucio_container.py`) can be used to create one Rucio container per Sample from a yaml file.
An example yaml file (`scripts/rucio_dids_example.yaml`) is included.
Here is the usage of the script:
```shell
usage: create_rucio_containers.py [-h] [--dry-run DRY_RUN]
infile container_name version
Create Rucio containers from multiple DIDs
positional arguments:
infile yaml file contains Rucio DIDs for each Sample
container_name e.g. user.kchoi:user.kchoi.<container-name>.Sample.v1
version e.g. user.kchoi:user.kchoi.fcnc_ana.Sample.<version>
optional arguments:
-h, --help show this help message and exit
--dry-run DRY_RUN Run without creating new Rucio container
```
## Acknowledgements
Support for this work was provided by the the U.S. Department of Energy, Office of High Energy Physics under Grant No. DE-SC0007890
Raw data
{
"_id": null,
"home_page": "https://github.com/kyungeonchoi/ServiceXDataBinder",
"name": "servicex-databinder",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "KyungEon Choi (UT Austin)",
"author_email": "kyungeonchoi@utexas.edu",
"download_url": "https://files.pythonhosted.org/packages/19/d1/d6a857f314de610749fd3bd80d4c1c28c01018c77b2188392e3b12d96ec3/servicex_databinder-0.5.0.tar.gz",
"platform": "Any",
"description": "# ServiceX DataBinder\n\n<p align=\"right\"> Release v0.5.0 </p>\n\n[![PyPI version](https://badge.fury.io/py/servicex-databinder.svg)](https://badge.fury.io/py/servicex-databinder)\n\n`servicex-databinder` is a user-analysis data management package using a single configuration file. \nSamples with external data sources (e.g. `RucioDID` or `XRootDFiles`) utilize ServiceX to deliver user-selected columns with optional row filtering.\n<!-- to interact with ServiceX instance to make ServiceX request(s) and manage ServiceX delivered data from a single configuration file. -->\n\nThe following table shows supported ServiceX transformers by DataBinder\n\n| Input format | Code generator | Transformer | Output format\n| :--- | :---: | :---: | :---: |\n| ROOT Ntuple | func-adl | `uproot` | `root` or `parquet` |\n| ATLAS Release 21 xAOD | func-adl | `atlasr21`| `root` |\n| ROOT Ntuple | python function | `python`| `root` or `parquet` |\n\n<!-- [`ServiceX`](https://github.com/ssl-hep/ServiceX) is a scalable HEP event data extraction, transformation and delivery system. \n\n['ServiceX Client library'](https://github.com/ssl-hep/ServiceX_frontend) provides --> \n\n## Prerequisite\n- [Access to a ServiceX instance](https://servicex.readthedocs.io/en/latest/user/getting-started/)\n- Python 3.7+\n\n## Installation\n```shell\npip install servicex-databinder\n```\n\n## Configuration file\n\nThe configuration file is a yaml file containing all the information.\n\nThe [following example configuration file](config_minimum.yaml) contains minimal fields. You can also download [`servicex-opendata.yaml`](servicex-opendata.yaml) file (rename to `servicex.yaml`) at your working directory, and run DataBinder for OpenData without an access token.\n\n```yaml\nGeneral:\n ServiceXName: servicex-opendata\n OutputFormat: parquet\n \nSample:\n - Name: ggH125_ZZ4lep\n XRootDFiles: \"root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets\\\n /2020-01-22/4lep/MC/mc_345060.ggH125_ZZ4lep.4lep.root\"\n Tree: mini\n Columns: lep_pt, lep_eta\n```\n\n`General` block requires two mandatory options (`ServiceXName` and `OutputFormat`) as in the example above.\n\nInput dataset for each Sample can be defined either by `RucioDID` or `XRootDFiles` or `LocalPath`.\n\nServiceX query can be constructed with either TCut syntax or func-adl.\n- Options for TCut syntax: `Filter`<sup>1</sup> and `Columns`\n- Option for Func-adl expression: `FuncADL`\n\n <sup>1</sup> `Filter` works only for scalar-type of TBranch.\n\nOutput format can be either `Apache parquet` or `ROOT ntuple` for `uproot` backend. Only `ROOT ntuple` format is supported for `xAOD` backend.\n\n\nThe followings are available options:\n\n<!-- `General` block: -->\n| Option for `General` block | Description | DataType |\n|:--------:|:------|:------|\n| `ServiceXName`* | ServiceX backend name in your `servicex.yaml` file <br> | `String` |\n| `OutputFormat`* | Output file format of ServiceX delivered data (`parquet` or `root` for `uproot` / `root` for `xaod`) | `String` |\n| `Transformer` | Set transformer for all Samples. Overwrites the default transformer in the `servicex.yaml` file. | `String`|\n| `Delivery` | Delivery option; `LocalPath` (default) or `LocalCache` or `ObjectStore` | `String` |\n| `OutputDirectory` | Path to a directory for ServiceX delivered files | `String` |\n| `WriteOutputDict` | Name of an ouput yaml file containing Python nested dictionary of output file paths (located in the `OutputDirectory`) | `String` |\n| `IgnoreServiceXCache` | Ignore the existing ServiceX cache and force to make ServiceX requests | `Boolean` |\n<p align=\"right\"> *Mandatory options</p>\n\n| Option for `Sample` block | Description |DataType |\n|:--------:|:------|:------|\n| `Name` | Sample name defined by a user |`String` |\n| `Transformer` | Transformer for the given sample | `String`|\n| `RucioDID` | Rucio Dataset Id (DID) for a given sample; <br> Can be multiple DIDs separated by comma |`String` |\n| `XRootDFiles` | XRootD files (e.g. `root://`) for a given sample; <br> Can be multiple files separated by comma |`String` |\n| `Tree` | Name of the input ROOT `TTree`; <br> Can be multiple `TTree`s separated by comma (`uproot` ONLY) |`String` |\n| `Filter` | Selection in the TCut syntax, e.g. `jet_pt > 10e3 && jet_eta < 2.0` (TCut ONLY) |`String` |\n| `Columns` | List of columns (or branches) to be delivered; multiple columns separately by comma (TCut ONLY) |`String` |\n| `FuncADL` | Func-adl expression for a given sample |`String` |\n| `LocalPath` | File path directly from local path (NO ServiceX tranformation) | `String` |\n\n <!-- Options exclusively for TCut syntax (CANNOT combine with the option `FuncADL`) -->\n\n <!-- Option for func-adl expression (CANNOT combine with the option `Fitler` and `Columns`) -->\n\nA config file can be simplified by utilizing `Definition` block. You can define placeholders under `Definition` block, which will replace all matched placeholders in the values of `Sample` block. Note that placeholders must start with `DEF_`.\n\nYou can source each Sample using different ServiceX transformers. \nThe default transformer is set by `type` of `servicex.yaml`, but `Transformer` in the `General` block overwrites if present, and `Transformer` in each `Sample` overwrites any previous transformer selection.\n\nThe [following example configuration](config_maximum.yaml) shows how to use each Options.\n\n```yaml\nGeneral:\n ServiceXName: servicex-uc-af\n Transformer: uproot\n OutputFormat: root\n OutputDirectory: /Users/kchoi/data_for_MLstudy\n WriteOutputDict: fileset_ml_study\n IgnoreServiceXCache: False\n \nSample: \n - Name: Signal\n RucioDID: user.kchoi:user.kchoi.signalA,\n user.kchoi:user.kchoi.signalB,\n user.kchoi:user.kchoi.signalC\n Tree: nominal\n FuncADL: DEF_ttH_nominal_query\n - Name: Background1\n XRootDFiles: DEF_ggH_input\n Tree: mini\n Filter: lep_n>2\n Columns: lep_pt, lep_eta\n - Name: Background2\n Transformer: atlasr21\n RucioDID: DEF_Zee_input\n FuncADL: DEF_Zee_query\n - Name: Background3\n LocalPath: /Users/kchoi/Work/data/background3\n - Name: Background4\n Transformer: python\n RucioDID: user.kchoi:user.kchoi.background4\n Function: |\n def run_query(input_filenames=None):\n import awkward as ak, uproot\n tree_name = \"nominal\"\n o = uproot.lazy({input_filenames:tree_name})\n return {\"nominal: o}\n\nDefinition:\n DEF_ttH_nominal_query: \"Where(lambda e: e.met_met>150e3). \\\n Select(lambda event: {'el_pt': event.el_pt, 'jet_e': event.jet_e, \\\n 'jet_pt': event.jet_pt, 'met_met': event.met_met})\"\n DEF_ggH_input: \"root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets\\\n /2020-01-22/4lep/MC/mc_345060.ggH125_ZZ4lep.4lep.root\"\n DEF_Zee_input: \"mc15_13TeV:mc15_13TeV.361106.PowhegPythia8EvtGen_AZNLOCTEQ6L1_Zee.\\\n merge.DAOD_STDM3.e3601_s2576_s2132_r6630_r6264_p2363_tid05630052_00\"\n DEF_Zee_query: \"SelectMany('lambda e: e.Jets(\\\"AntiKt4EMTopoJets\\\")'). \\\n Where('lambda j: (j.pt() / 1000) > 30'). \\\n Select('lambda j: j.pt() / 1000.0'). \\\n AsROOTTTree('junk.root', 'my_tree', [\\\"JetPt\\\"])\"\n```\n\n\n## Deliver data\n\n```python\nfrom servicex_databinder import DataBinder\nsx_db = DataBinder('<CONFIG>.yml')\nout = sx_db.deliver()\n```\n\nThe function `deliver()` returns a Python nested dictionary that contains delivered files.\n<!-- - for `uproot` backend and `parquet` output format: `out['<SAMPLE>']['<TREE>'] = [ List of output parquet files ]`\n- for `uproot` backend and `root` output format: `out['<SAMPLE>'] = [ List of output root files ]`\n- for `xAOD` backend: `out['<SAMPLE>'] = [ List of output root files ]` -->\n\nInput configuration can be also passed in a form of a Python dictionary.\n\nDelivered Samples and files in the `OutputDirectory` are always synced with the DataBinder config file.\n\n<!-- ## Currently available \n- Dataset as Rucio DID + Input file format is ROOT TTree + ServiceX delivers output in parquet format\n- Dataset as Rucio DID + Input file format is ATLAS xAOD + ServiceX delivers output in ROOT TTree format\n- Dataset as XRootD + Input file format is ROOT TTree + ServiceX delivers output in parquet format -->\n\n## Error handling\n\n```python\nfailed_requests = sx_db.get_failed_requests()\n```\n\nIf failed ServiceX request(s), `deliver()` will print number of failed requests and the name of Sample, Tree if present, and input dataset. You can get a full list of failed samples and error messages for each by `get_failed_requests()` function. If it is not clear from the message you can browse `Logs` in the ServiceX instance webpage for the detail.\n\n## Useful tools\n\n### Create Rucio container for multiple DIDs\n\nThe current ServiceX generates one request per Rucio DID. \nIt's often the case that a physics analysis needs to process hundreds of DIDs.\nIn such cases, the script (`scripts/create_rucio_container.py`) can be used to create one Rucio container per Sample from a yaml file.\nAn example yaml file (`scripts/rucio_dids_example.yaml`) is included.\n\nHere is the usage of the script:\n\n```shell\nusage: create_rucio_containers.py [-h] [--dry-run DRY_RUN]\n infile container_name version\n\nCreate Rucio containers from multiple DIDs\n\npositional arguments:\n infile yaml file contains Rucio DIDs for each Sample\n container_name e.g. user.kchoi:user.kchoi.<container-name>.Sample.v1\n version e.g. user.kchoi:user.kchoi.fcnc_ana.Sample.<version>\n\noptional arguments:\n -h, --help show this help message and exit\n --dry-run DRY_RUN Run without creating new Rucio container\n\n```\n\n## Acknowledgements\n\nSupport for this work was provided by the the U.S. Department of Energy, Office of High Energy Physics under Grant No. DE-SC0007890\n",
"bugtrack_url": null,
"license": "BSD 3-clause",
"summary": "ServiceX data management using a configuration file",
"version": "0.5.0",
"project_urls": {
"Homepage": "https://github.com/kyungeonchoi/ServiceXDataBinder"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e545b1ee4c2d467bed91f07078567268d2daefad99271bf30df8b1889f600069",
"md5": "2ce05f493d697f931ee514b4d9379ce6",
"sha256": "1949cbbb1e1ca057ae98325ee4ce3de0051b7886f8b830f2cb80b508c38de84b"
},
"downloads": -1,
"filename": "servicex_databinder-0.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2ce05f493d697f931ee514b4d9379ce6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 19485,
"upload_time": "2023-06-09T04:58:33",
"upload_time_iso_8601": "2023-06-09T04:58:33.988100Z",
"url": "https://files.pythonhosted.org/packages/e5/45/b1ee4c2d467bed91f07078567268d2daefad99271bf30df8b1889f600069/servicex_databinder-0.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "19d1d6a857f314de610749fd3bd80d4c1c28c01018c77b2188392e3b12d96ec3",
"md5": "8d2fb495e8e4070cac7bc96dae57aa80",
"sha256": "7435797525be108ebc0db73a3c7576964a6f71eeee3b51ef70acbbe058243fc9"
},
"downloads": -1,
"filename": "servicex_databinder-0.5.0.tar.gz",
"has_sig": false,
"md5_digest": "8d2fb495e8e4070cac7bc96dae57aa80",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 20676,
"upload_time": "2023-06-09T04:58:35",
"upload_time_iso_8601": "2023-06-09T04:58:35.650069Z",
"url": "https://files.pythonhosted.org/packages/19/d1/d6a857f314de610749fd3bd80d4c1c28c01018c77b2188392e3b12d96ec3/servicex_databinder-0.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-09 04:58:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kyungeonchoi",
"github_project": "ServiceXDataBinder",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "servicex-databinder"
}