<p align="center">
<img src="./logo.png" />
</p>
# refyre: Large scale file management energized by AI
[![PyPI Version](https://img.shields.io/pypi/v/refyre.svg)](https://pypi.python.org/pypi/refyre)
___
Refyre is an AI-fused Python package that provides two high level features:
- Easy large scale filesystem manipulations
- Efficient, code-less directory structuring and restructuring
Enhance your favorite Python packages such as Pandas, NumPy, Spark, and other data manipulation tools to quickly structure scattered data.
## Features
- Filesystem agnostic data handshakes
- Kickstart loading entire repositories & setting up virtual environments in a single command, your way
- Perform mass operations on files such as copying, moving, zipping, POST-ing, in 1 line of code
- Homebrew structured data such as Pandas DataFrames, and image datasets in a snap of your fingers (< 30 lines)
- Refactor, organize, and analyze periodic research experiments with zero lines of code
## Quickstart
Simply provide refyre with an "input specification", telling it what directories to focus on
`sample_input_spec.txt`
```python
'''
Suppose you have a directory structure
a/
a1.txt
a2.txt
...
b/
c/
c1.txt
c2.txt
d.txt
d2.txt
...
You seek to analyze the a files and the c files
'''
[dir="a"|name="a_var"]
[dir="b"]
[dir="c"|pattern="g!c?.txt"|name="c_var"] #Glob patterns start with 'g!', regex with 'r!', no need for just normal pattern matching
```
Have refyre analyze the directory with the following:
```python
#Main analysis line
ref = Refyre(input_specs = ['sample_input_spec.txt'])
#Now, have a bit of fun!
a_var = ref["a_var"]
c_var = ref["c_var"]
print(len(a_var)) #Number of files
#Move all the files to another directory, copy works the same way
a_var = a_var.move('dir2') #.copy() ...
#Get all the files in a List[Pathlib.Path] objects
all_a_var = a_var.vals()
#Automatically zip a copy of all the files
zipped_c_var = c_var.zip()
print(len(zipped_c_var)) #1, the zipped c_var files
#Get all the parents dirs
c_var_parent_dirs = c_var.dirs()
print(type(c_var)) #refyre.cluster.FileCluster (this is what each variable type is)
#Do mass file management operations such as delete(), filter()
all_a_var_and_c_vars = FileCluster(values = []) #Values are strings of filepaths you want to do operations on
all_a_var_and_c_vars = a_var + c_var
filtered_c = all_a_var_and_c_vars.filter(lambda p : p.name.startswith('c'))
#Delete all files
filtered_c.delete()
#Automatically account for any modifications by variables
print(len(all_a_var_and_c_vars))
```
And finally, after any analysis, you can use the variables to generate
specs
Let's say you want to generate directories & data in the format specified by `output_spec.txt`:
```python
'''
Sample output spec, creates
directories d & e, and ports the data
from a_var and c_var into it.
'''
[dir="d"|name="a_var"]
[dir="e"|name="c_var"]
```
One line.
```python
ref.create_spec('output_spec.txt')
```
Alternatively, this entire process (minus the in-between analysis) can be done through our CLI.
```python
refyre -i input.txt -o output.txt
```
## Microdocs
Let's provide a quick overview of the various capabilities refyre packs.
### Spec Attributes:
Specs, as shown above are a Pythonic way for you to feed information to refyre. Each [] represents a *cluster*, which usually has a dir attribute specified. Attributes are seperated using the '|' seperator.
As shown above, Pythonic comments can be used in a similar fashion to Python. Back to the various attributes:
- **dir**: Specifies the directory the cluster is targeting. *Usually, the clusters are relative paths*.
- You can specify the three pattern types to target multiple directories
- **pattern**: Allows you to target specific files by specifying a template pattern. Currently, glob, regex, and "generator expressions" are supported.
- For glob patterns, add a 'g!' before the pattern; ex: `g!*.txt`
- For regex patterns, add an 'r!' before the pattern ex: `r!.txt`
- Generator expressions a simplified pattern matching, that's more humanly controllable
- Just one template matching --> `$` matches to a number
- refyre supports generator expressions the most out of the three
- **name**: The workhorse of specs. Arguably the **meat** of the spec. Assigns all the values to a variable specified by **name**. *Only single variable / cluster are supported*
- You can achieve 'appending' by specifying a `+` before the name
- **flags**: A grab bag of various tricks you can use. You can specify as many as you want, and they work together to bring out cool cluster behaviours
- `*m` makes a directory if it doesn't exist in a read spec
- `*d` (only during generation) deletes everything in the *current* directory except for the clusters specified
- `*da` deletes everything in the *current & all subclusters* except for clusters listed
- `*f` gets all the files listed in the current directory
- `*d` gets all the directories listed in the current directory
- `*r` allows `*f` and `*d` to behave recursively (i.e, get all files from subdirectories, etc.)
- `*s` enables *step generation*, Each time `refyre.step()` is called, the next directory in the pattern is generated. (ex:, if `dir="test$"` and `*s`, `test1` would be generated on first `.step()`, `test2`, ...)
- `*c` enables code analysis. If you seek to import a directory / repository you recently cloned, you can specify the `*c` flag and then import it in your code
- `type & link` are used for specific behaviours, most commonly *git cloning*. Automatically clone in repos by specifying `type="git"` and the link to your git.
- **mode**: Can either be set to `cut` or `copy`. During generation, the variable files will either be cut or copied to their respective place.
- **limit**: Limits the number of results targeted, or directories generated
- **serialize** specify a generator expression to rename all the files into a consistent format
These are all the basic quantifiers you can use, they cover ~80% of refyre's inner power. The other 20% are pretty obscure and aren't that useful normally.
### FileClusters (Variables)
Variables are the backbone of refyre. The clusters provide an *avenue* for the variables to easily target the data without worrying about writing any code. However, they aren't the only way to access variable's powers. The docs below, again, specify the most useful abilities for these variables.
`FileCluster(values = [], dirs = [], patterns = [], as_pathlib = False,)`
- `values`: string filepaths, or `Path` objects depending on whether `as_pathlib` is true or false.
- `patterns`: corresponds to the dirs, lists what patterns you want to target
FileClusters are strongly rooted in *object oriented operations*, meaning each operation returns another FileCluster, so you can continue channeling FileCluster capabilities. To get out of FileClusters, you can use the following options:
- `.vals()`: Returns a list of Path objects
- `.item()`: Returns the first Path object
Using this basic constructor, you can make some easy operations:
- `.move(target_dir)`
- `.copy(target_dir)`
- `.filter(filter_func)`
- `.map(map_func)`
- `.zip()`
- `.delete()`
- `.post(url, additional_data, payload_name)`
- `.filesize()`
- `.clone()`
You can also do other operations between FileClusters
- `+` (Returns the sum of the contents of two FileClusters)
- `-` (Returns the contents in the current FileCluster while removing all other contents that are also in the other FileCluster)
- `&` (Intersection operator)
- `|` (Union operator)
### The Refyre Object
These docs are running too long already, I will try to keep this as short as possible.
- `Refyre(input_specs = [], output_specs = [])`
- Instantiates a refyre spec
- `add_spec(spec_path, track = False)`
- Adds a spec for refyre *reading*. If track is set to true, it can
later be reused for step generation.
- `create_spec(spec_path, track = False)`
- Creates a spec. If track is set to true, it can later be reused for step
generation.
- `step()`
- Any specs with a `*s` attribute have the next directory in the patterns they specify generated
Accessing variables can be done using the `[]` notation. Use it to get and attach variables to a `Refyre` object.
**Congratulations, you know everything to be a refyre expert!**
### Misc Docs
#### DataStack
Let's say you want to brew a dataset & structure data of you're own. refyre allows you to combine the power of variables with the DataStack, processing them to create constructs such as Pandas Dataframes.
The process is twofold - (1) *secure your variables*, and then (2) *run them through the DataStack*. The DataStack itself is a *processor*, taking in a bunch of variables, and producing the variables.
Your job with the DataStack will be to figure out how can you convert the variables to the dataset format you want.
Consider the PandasStack (a DataStack). Here, your job is to figure out how you can convert each row of variables into a *DataFrame column*
```python
from refyre import Refyre
from refyre.datastack import PandasStack
from PIL import Image
import pandas as pd
ref = Refyre(input_specs = ['specs/in.txt'])
#We will do some pandas visualizations on the input data
stack = PandasStack(AssociationCluster(input_vars = [ref["images"]]))
def processor(fp):
print('processing', fp)
im = Image.open(fp).convert('RGB')
width, height = im.size
ar, ag, ab = 0.0, 0.0, 0.0
for i in range(width):
for j in range(height):
r, g, b = im.getpixel((i, j))
ar, ag, ab = ar + r, ag + g, ab + b
ar, ag, ab = ar / (width * height), ag / (width * height), ab / (width * height)
return (fp.name, width, height, ar, ag, ab)
df = stack.create_dataframe(['image_name', 'image_width', 'image_height', 'average_red', 'average_green', 'average_blue'], processor)
```
As you can see, the majority of the work here comes from building a *processor* method to convert each row of variables into a DataFrame row.
Raw data
{
"_id": null,
"home_page": "https://github.com/flockfysh/refyre",
"name": "refyre",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "files,manipulation,data science",
"author": "Ansh",
"author_email": "eye.am.ansh@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/cb/fd/1eeeafd98c5510fc8765e5c997164db4839f24058ff13118b755e8878191/refyre-0.0.1.5.9.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"./logo.png\" />\n</p>\n\n# refyre: Large scale file management energized by AI\n[![PyPI Version](https://img.shields.io/pypi/v/refyre.svg)](https://pypi.python.org/pypi/refyre)\n___ \n\nRefyre is an AI-fused Python package that provides two high level features:\n- Easy large scale filesystem manipulations\n- Efficient, code-less directory structuring and restructuring\n\nEnhance your favorite Python packages such as Pandas, NumPy, Spark, and other data manipulation tools to quickly structure scattered data.\n\n## Features\n- Filesystem agnostic data handshakes \n- Kickstart loading entire repositories & setting up virtual environments in a single command, your way\n- Perform mass operations on files such as copying, moving, zipping, POST-ing, in 1 line of code\n- Homebrew structured data such as Pandas DataFrames, and image datasets in a snap of your fingers (< 30 lines)\n- Refactor, organize, and analyze periodic research experiments with zero lines of code\n\n## Quickstart\nSimply provide refyre with an \"input specification\", telling it what directories to focus on\n\n`sample_input_spec.txt`\n```python\n'''\nSuppose you have a directory structure\n\na/\n a1.txt\n a2.txt\n ...\nb/\n c/\n c1.txt\n c2.txt\n d.txt\n d2.txt\n ...\n\nYou seek to analyze the a files and the c files\n'''\n[dir=\"a\"|name=\"a_var\"]\n[dir=\"b\"]\n [dir=\"c\"|pattern=\"g!c?.txt\"|name=\"c_var\"] #Glob patterns start with 'g!', regex with 'r!', no need for just normal pattern matching\n```\n\nHave refyre analyze the directory with the following:\n```python\n#Main analysis line\nref = Refyre(input_specs = ['sample_input_spec.txt'])\n\n#Now, have a bit of fun!\na_var = ref[\"a_var\"]\nc_var = ref[\"c_var\"]\n\nprint(len(a_var)) #Number of files\n\n#Move all the files to another directory, copy works the same way\na_var = a_var.move('dir2') #.copy() ...\n\n#Get all the files in a List[Pathlib.Path] objects\nall_a_var = a_var.vals()\n\n#Automatically zip a copy of all the files \nzipped_c_var = c_var.zip()\n\nprint(len(zipped_c_var)) #1, the zipped c_var files\n\n#Get all the parents dirs\nc_var_parent_dirs = c_var.dirs()\n\nprint(type(c_var)) #refyre.cluster.FileCluster (this is what each variable type is)\n\n#Do mass file management operations such as delete(), filter()\nall_a_var_and_c_vars = FileCluster(values = []) #Values are strings of filepaths you want to do operations on\nall_a_var_and_c_vars = a_var + c_var\n\nfiltered_c = all_a_var_and_c_vars.filter(lambda p : p.name.startswith('c'))\n\n#Delete all files\nfiltered_c.delete()\n\n#Automatically account for any modifications by variables\nprint(len(all_a_var_and_c_vars))\n\n```\n\nAnd finally, after any analysis, you can use the variables to generate\nspecs\n\nLet's say you want to generate directories & data in the format specified by `output_spec.txt`:\n\n```python\n'''\nSample output spec, creates\ndirectories d & e, and ports the data\nfrom a_var and c_var into it.\n'''\n[dir=\"d\"|name=\"a_var\"]\n[dir=\"e\"|name=\"c_var\"]\n```\n\nOne line.\n```python\nref.create_spec('output_spec.txt')\n```\n\nAlternatively, this entire process (minus the in-between analysis) can be done through our CLI.\n```python\nrefyre -i input.txt -o output.txt\n```\n\n## Microdocs\n\nLet's provide a quick overview of the various capabilities refyre packs.\n\n### Spec Attributes:\n\nSpecs, as shown above are a Pythonic way for you to feed information to refyre. Each [] represents a *cluster*, which usually has a dir attribute specified. Attributes are seperated using the '|' seperator. \n\nAs shown above, Pythonic comments can be used in a similar fashion to Python. Back to the various attributes:\n\n- **dir**: Specifies the directory the cluster is targeting. *Usually, the clusters are relative paths*.\n - You can specify the three pattern types to target multiple directories\n- **pattern**: Allows you to target specific files by specifying a template pattern. Currently, glob, regex, and \"generator expressions\" are supported.\n - For glob patterns, add a 'g!' before the pattern; ex: `g!*.txt`\n - For regex patterns, add an 'r!' before the pattern ex: `r!.txt`\n - Generator expressions a simplified pattern matching, that's more humanly controllable\n - Just one template matching --> `$` matches to a number\n - refyre supports generator expressions the most out of the three\n\n- **name**: The workhorse of specs. Arguably the **meat** of the spec. Assigns all the values to a variable specified by **name**. *Only single variable / cluster are supported*\n - You can achieve 'appending' by specifying a `+` before the name\n- **flags**: A grab bag of various tricks you can use. You can specify as many as you want, and they work together to bring out cool cluster behaviours\n - `*m` makes a directory if it doesn't exist in a read spec\n - `*d` (only during generation) deletes everything in the *current* directory except for the clusters specified\n - `*da` deletes everything in the *current & all subclusters* except for clusters listed\n - `*f` gets all the files listed in the current directory\n - `*d` gets all the directories listed in the current directory\n - `*r` allows `*f` and `*d` to behave recursively (i.e, get all files from subdirectories, etc.)\n - `*s` enables *step generation*, Each time `refyre.step()` is called, the next directory in the pattern is generated. (ex:, if `dir=\"test$\"` and `*s`, `test1` would be generated on first `.step()`, `test2`, ...)\n - `*c` enables code analysis. If you seek to import a directory / repository you recently cloned, you can specify the `*c` flag and then import it in your code\n - `type & link` are used for specific behaviours, most commonly *git cloning*. Automatically clone in repos by specifying `type=\"git\"` and the link to your git.\n- **mode**: Can either be set to `cut` or `copy`. During generation, the variable files will either be cut or copied to their respective place.\n- **limit**: Limits the number of results targeted, or directories generated\n- **serialize** specify a generator expression to rename all the files into a consistent format\n\nThese are all the basic quantifiers you can use, they cover ~80% of refyre's inner power. The other 20% are pretty obscure and aren't that useful normally.\n\n### FileClusters (Variables)\n\nVariables are the backbone of refyre. The clusters provide an *avenue* for the variables to easily target the data without worrying about writing any code. However, they aren't the only way to access variable's powers. The docs below, again, specify the most useful abilities for these variables.\n\n`FileCluster(values = [], dirs = [], patterns = [], as_pathlib = False,)`\n - `values`: string filepaths, or `Path` objects depending on whether `as_pathlib` is true or false.\n - `patterns`: corresponds to the dirs, lists what patterns you want to target\n\nFileClusters are strongly rooted in *object oriented operations*, meaning each operation returns another FileCluster, so you can continue channeling FileCluster capabilities. To get out of FileClusters, you can use the following options:\n - `.vals()`: Returns a list of Path objects\n - `.item()`: Returns the first Path object\n\nUsing this basic constructor, you can make some easy operations:\n - `.move(target_dir)`\n - `.copy(target_dir)`\n - `.filter(filter_func)`\n - `.map(map_func)`\n - `.zip()`\n - `.delete()`\n - `.post(url, additional_data, payload_name)` \n - `.filesize()`\n - `.clone()`\n\nYou can also do other operations between FileClusters\n - `+` (Returns the sum of the contents of two FileClusters) \n - `-` (Returns the contents in the current FileCluster while removing all other contents that are also in the other FileCluster)\n - `&` (Intersection operator)\n - `|` (Union operator)\n\n### The Refyre Object\n\nThese docs are running too long already, I will try to keep this as short as possible.\n\n- `Refyre(input_specs = [], output_specs = [])`\n - Instantiates a refyre spec\n\n- `add_spec(spec_path, track = False)`\n - Adds a spec for refyre *reading*. If track is set to true, it can\n later be reused for step generation.\n\n- `create_spec(spec_path, track = False)`\n - Creates a spec. If track is set to true, it can later be reused for step \n generation.\n\n- `step()`\n - Any specs with a `*s` attribute have the next directory in the patterns they specify generated\n\nAccessing variables can be done using the `[]` notation. Use it to get and attach variables to a `Refyre` object.\n\n**Congratulations, you know everything to be a refyre expert!**\n\n### Misc Docs\n\n#### DataStack\nLet's say you want to brew a dataset & structure data of you're own. refyre allows you to combine the power of variables with the DataStack, processing them to create constructs such as Pandas Dataframes.\n\nThe process is twofold - (1) *secure your variables*, and then (2) *run them through the DataStack*. The DataStack itself is a *processor*, taking in a bunch of variables, and producing the variables.\n\nYour job with the DataStack will be to figure out how can you convert the variables to the dataset format you want.\n\nConsider the PandasStack (a DataStack). Here, your job is to figure out how you can convert each row of variables into a *DataFrame column*\n\n```python\n\nfrom refyre import Refyre\nfrom refyre.datastack import PandasStack\nfrom PIL import Image\nimport pandas as pd\n\nref = Refyre(input_specs = ['specs/in.txt'])\n\n#We will do some pandas visualizations on the input data\nstack = PandasStack(AssociationCluster(input_vars = [ref[\"images\"]]))\n\ndef processor(fp):\n print('processing', fp)\n im = Image.open(fp).convert('RGB')\n width, height = im.size \n\n ar, ag, ab = 0.0, 0.0, 0.0\n for i in range(width):\n for j in range(height):\n r, g, b = im.getpixel((i, j))\n ar, ag, ab = ar + r, ag + g, ab + b \n \n ar, ag, ab = ar / (width * height), ag / (width * height), ab / (width * height)\n return (fp.name, width, height, ar, ag, ab)\n\ndf = stack.create_dataframe(['image_name', 'image_width', 'image_height', 'average_red', 'average_green', 'average_blue'], processor)\n```\n\nAs you can see, the majority of the work here comes from building a *processor* method to convert each row of variables into a DataFrame row.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Filesystem dominance is all you need.",
"version": "0.0.1.5.9",
"project_urls": {
"Homepage": "https://github.com/flockfysh/refyre",
"Repository": "https://github.com/flockfysh/refyre"
},
"split_keywords": [
"files",
"manipulation",
"data science"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "99271ca5a601cd742b1704681369e1a9136ce5ff59450b2e2dc5ede050f9e352",
"md5": "9aec952ec431d289a67100aa285dc269",
"sha256": "8bd5df9494d380d6abaa33d4059ebb6448bae4854a9cb91cca3ebd79faafeed7"
},
"downloads": -1,
"filename": "refyre-0.0.1.5.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9aec952ec431d289a67100aa285dc269",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 40443,
"upload_time": "2024-01-15T04:46:32",
"upload_time_iso_8601": "2024-01-15T04:46:32.147785Z",
"url": "https://files.pythonhosted.org/packages/99/27/1ca5a601cd742b1704681369e1a9136ce5ff59450b2e2dc5ede050f9e352/refyre-0.0.1.5.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cbfd1eeeafd98c5510fc8765e5c997164db4839f24058ff13118b755e8878191",
"md5": "922c5930e81c3fd0253fdb1fe1f42e8f",
"sha256": "2cf7948dd3028592da94c1c7e0edfcbc3db7fda6e4b59540c0986e17eb27ccae"
},
"downloads": -1,
"filename": "refyre-0.0.1.5.9.tar.gz",
"has_sig": false,
"md5_digest": "922c5930e81c3fd0253fdb1fe1f42e8f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 34584,
"upload_time": "2024-01-15T04:46:33",
"upload_time_iso_8601": "2024-01-15T04:46:33.973131Z",
"url": "https://files.pythonhosted.org/packages/cb/fd/1eeeafd98c5510fc8765e5c997164db4839f24058ff13118b755e8878191/refyre-0.0.1.5.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-15 04:46:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "flockfysh",
"github_project": "refyre",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "refyre"
}