# dftxt
A Python library for a simple DataFrame text file format that facilitates easier
specification of a Pandas and Polars DataFrame in a reliable, human-readable text
format for use in testing and where source data is small and human managed. Ultimately,
the goal of this project is to make a DataFrame transformation function test as easy
as:
```python
def test_my_transformation():
"""Should transform source DataFrame into the expected output."""
data_frames = dftxt.read_all_to_pandas("./test_data.dftxt")
observed = my_transformation(data_frames.source)
pandas.testing.assert_frame_equal(observed, data_frames.expected)
```
by allowing one to express the attributes, structure and data that constitute a
DataFrame within a text file format and avoid having to post-process loaded data.
Here's an example showing what the basic dftxt format looks like:
```
Name Planet Numeral Mean Radius (km) Discovery Year Discoverer
&dtype=cat &dtype=float &dtype=Int
Moon Earth I 1738 None None
Phobos Mars I 11.267 1877 Hall
Deimos Mars II 6.2 1877 Hall
Io Jupiter I 1821 1610 Galileo
Europa Jupiter II 1560 1610 Galileo
Ganymede Jupiter III 2634 1610 Galileo
Callisto Jupiter IV 2410 1610 Galileo
Amalthea Jupiter V 83.5 1892 Barnard
Himalia Jupiter VI 69.8 1904 Perrine
Mimas Saturn I 198.2 1789 Herschel
```
This is a fixed-width file format that uses two+ spaces separating column names to
define the width of each column.
## Quick Start
One of the best ways to learn the dftxt is to create a Pandas/Polar DataFrame and save
it to a file or string and see what the output looks like.
```python
import dftxt
import polars as pl
df = pl.DataFrame([
{"character": "Jay Gatsby", "book": "The Great Gatsby", "year": 1925},
{"character": "Clarissa Dalloway", "book": "Mrs. Dalloway", "year": 1925},
{"character": "Toad", "book": "The Wind & the Willow", "year": 1906},
])
# Could write to a file if you prefer:
# dftxt.write("./example.dftxt", df)
# But here we'll just print the serialized string:
print(dftxt.writes(df))
```
which would print out:
```
character book year
&dtype=int
Jay Gatsby The Great Gatsby 1925
Clarissa Dalloway Mrs. Dalloway 1925
Toad The Wind & the Willow 1906
```
It's also possible to embed dftxt into Markdown files with fenced code blocks that use
the `df` or `dftxt` type signifier. Multiple fenced code blocks will be collectively
extracted into the loaded DataFrame, which makes inline commenting of blocks quite
useful.
For examples of what that looks like see:
- [Markdown with dftxt Example](./dftxt/tests/_io/_markdown/scenarios/multiple_frames/source.md)
- [Single DataFrame broken out across multiple blocks](./dftxt/tests/_io/_markdown/scenarios/single_frame/source.md)
## Benefits
The benefits of the dftxt DataFrame serialization format include:
### 1. Preserves DataFrame Structure
Most importantly, this format retains the necessary information to reload the DataFrame
in an identical fashion as the file was specified. This includes data types, column
ordering, and indexing (Pandas only as Polars has no index). In testing, it should be
possible to use `(pandas|polars).testing.assert_frame_equal()` on a loaded DataFrame
without any transformation when read from a file.
For example,
```
sku price_usd originally_released_on product_name
&int_index &dtype=decimal &dtype=date
109456 119.99 2023-07-09 Fancy Socks
450213 24.49 2020-11-12 Simple Socks
90210 299.99 1998-03-28 LA Heartthrob Socks
```
In this example the:
- `sku` column will be loaded as the DataFrame's index (Pandas only) and the values in
the column treated as integers.
- `price_usd` column will be loaded as `decimal.Decimal` or `polars.Decimal`
depending on the type of DataFrame.
- `originally_released_on` column will be loaded as `datetime.Date` values.
- `product_name` values will be loaded as strings.
The order of the rows and columns are preserved when loaded as well. All of this avoids
having to process the DataFrame in order to achieve this configuration as one would have
to do with other formats, e.g. CSV.
### 2. Human Friendly
The format is easy to read and modify by humans and requires little to no machine
characters in its specification. Whitespace is used as the delimiter - specifically
2+ spaces between column names - which also serves to align columns for easy
readability.
#### Quoting
Quoting and escaping are rarely needed as a result. Here's an example
showing three string columns, which require no quoting:
```
Name Birth Month Favorite Movie
Jane Doe February Back to the Future
John Doe April Harry Potter & the Goblet of Fire
Anna Johnson November Frozen
Steve Simpson June Avengers: Infinity War
```
Spaces in the column names and in the data values are not an issue for the format
because the 2+ spaces between columns are what's required to identify the columns.
Quoting is needed in cases:
1. The column name contains 2+ spaces, e.g. `"Hello World"`.
2. The column name or value starts or ends with a space , e.g. `" foo "`.
3. The columns name or value ends with a backslash.
#### Column Wrapping
Additionally, long values can be broken up over multiple lines on a per-column basis
using a backslash end character like in Python. This makes it possible to keep long
strings from bloating the width of the data and hurting readability. For example,
```
Play Quotation Act Scene
&&int &&int
Hamlet To be, or not to be: that is the question 3 1
As You Like It All the world's a stage, and all the men \ 2 7
and women merely players. They have their \
exits and their entrances; And one man in \
his time plays many parts.
Romeo & Juliet Romeo, Romeo! Wherefore art thou Romeo? 2 2
Richard III Now is the winter of our discontent 1 1
Macbeth Is this a dagger which I see before me, \ 2 1
the handle toward my hand?
```
Note that blank lines between rows here is optional and was included to emphasize the
wrapped lines in the `Quotation` column for the second and last rows.
#### Comments
Importantly, this format also supports commenting, using the standard Python line
comment, which begins a line of any indentation with a `#` sign. _Note that inline
comments are not supported because `#` is a common-enough character in data values that
it would confuse. A commented dftxt file looks something like this:
```
# This is a heavily commented example showing that comments can be used throughout a
# dftxt file. Very useful for documenting data for tests with the nuances and reasoning
# behind the test data.
# The data is source from:
# https://www.irs.gov/individuals/international-taxpayers/yearly-average-currency-exchange-rates
Country Currency 2023-01-01 2022-01-01 2021-01-01 2020-01-01 2019-01-01
# We're using Decimals here because of currency accuracy needs.
&&decimal &&decimal &&decimal &&decimal &&decimal
# We want the column names to be loaded in as dates.
&ntype=date &ntype=date &ntype=date &ntype=date &ntype=date
# The global pandemic strained the Argentine economy,
# which led to soaring inflation that remained going
# into 2024.
Argentina Peso 296.154 130.792 95.098 70.635 48.192
# Despite economic issues of its own, Brazil did not
# see the same inflationary pressures of its neighbor.
Brazil Real 4.994 5.165 5.395 5.151 3.946
Canada Dollar 1.350 1.301 1.254 1.341 1.327
Cayman Islands Dollar 0.833 0.833 0.833 0.833 0.833
Australia Dollar 1.506 1.442 1.332 1.452 1.439
China Yuan 7.075 6.730 6.452 6.900 6.910
Euro Zone Euro 0.924 0.951 0.846 0.877 0.893
# The values here are yearly average currency exchange rates converting into USD.
```
#### Embedded in Markdown
Markdown is a fairly ubiquitous way to create human-readable documentation that also
renders nicely in IDEs and code collaboration tools. As such, dftxt supports embedding
dftxt data within Markdown as fenced code blocks (triple backticks) that have the `df`
or `dftxt` specifier after them. It's possible to specify multiple DataFrames this way
and break DataFrames up into multiple markdown fenced code blocks for inline commenting
where desirable.
For examples of what that looks like see:
- [Markdown with dftxt Example](./dftxt/tests/_io/_markdown/scenarios/multiple_frames/source.md)
- [Single DataFrame broken out across multiple blocks](./dftxt/tests/_io/_markdown/scenarios/single_frame/source.md)
### 3. Diff/Code Review Friendly
The benefits of the dftxt file format that make it human-friendly are also what make it
friendly for code reviewing and display diffs.
#### DataFrame Wrapping
Additionally, DataFrames can be separated into multiple blocks within a file to keep
wide datasets manageable and easy to read in diffs as well as preventing small changes
from having big impacts on file changes.
For example, this:
```
Name Birth Month
Jane Doe February
John Doe April
Anna Johnson November
Steve Simpson June
Favorite Movie
Back to the Future
Harry Potter & the Goblet of Fire
Frozen
Avengers: Infinity War
```
is identical to this:
```
Name Birth Month Favorite Movie
Jane Doe February Back to the Future
John Doe April Harry Potter & the Goblet of Fire
Anna Johnson November Frozen
Steve Simpson June Avengers: Infinity War
```
The 2+ blank lines in a dftxt file indicate that the what follows are additional
column data for the same DataFrame. It's also possible to repeat columns - often this
will be an index or primary key - to make it easy to track lines in different blocks of
the wrapped DataFrame. The example above could be written as:
```
Name Birth Month
Jane Doe February
John Doe April
Anna Johnson November
Steve Simpson June
Name Favorite Movie
&repeat
Jane Doe Back to the Future
John Doe Harry Potter & the Goblet of Fire
Anna Johnson Frozen
Steve Simpson Avengers: Infinity War
```
Here the `&repeat` modifier on the second appearance of the `Name` column indicates
that this column is a repeat of one already in the DataFrame and should not be loaded
again. It exists purely for human-readability.
It's also possible to have an index column exist only in the file and have it never
be included in the loaded result. Continuing the example from above, this might look
something like:
```
ID Name Birth Month
&exclude
1 Jane Doe February
2 John Doe April
3 Anna Johnson November
4 Steve Simpson June
ID Favorite Movie
&exclude
1 Back to the Future
2 Harry Potter & the Goblet of Fire
3 Frozen
4 Avengers: Infinity War
```
Here the `ID` column exists only in the file and will be excluded from the loading
process because of the `&exclude` column modifier.
### 4. Flexibility
The dftxt has additional flexibility in a key ways.
#### Multiple DataFrames
First, the dftxt format allows for specifying multiple DataFrames in a single file.
This can be used in a number of ways, but the most common one is to include all
DataFrames for a test in a single location for coherence. DataFrames within a file
are separated by a line that begins with 3+ dashes with a blank both before and after
it. This looks like:
```
Name Birth Month Favorite Movie
Jane Doe February Back to the Future
John Doe April Harry Potter & the Goblet of Fire
Anna Johnson November Frozen
Steve Simpson June Avengers: Infinity War
---
Movie Budget ($M) Box Office ($M)
&dtype=int &dtype=decimal
Avengers: Infinity War 400 2052.0
Back to the Future 19 388.8
Frozen 150 1280.0
Harry Potter & the Goblet of Fire 150 896.8
---
Name Birth Month Favorite Movie Budget ($M) Box Office ($M)
&dtype=int &dtype=decimal
Jane Doe February Back to the Future 19 388.8
John Doe April Harry Potter & the Goblet of Fire 150 896.8
Anna Johnson November Frozen 150 1280.0
Steve Simpson June Avengers: Infinity War 400 2052.0
```
To load and use this file would look something like this:
```python
import pandas.testing
import dftxt
data_frames = dftxt.read_all_to_pandas("./example.dftxt")
combined = (
frames[0]
.merge(
frames[1],
how="left",
left_on="Favorite Movie",
right_on="Movie",
)
.drop(columns=["Movie"])
)
pandas.testing.assert_frame_equal(combined, data_frames[2])
```
Notice how the DataFrames are access by the indexed order from the file. It is also
possible to name the DataFrames in the file and access them by a name instead. This
would look something like this:
```
--- people ---
Name Birth Month Favorite Movie
Jane Doe February Back to the Future
John Doe April Harry Potter & the Goblet of Fire
Anna Johnson November Frozen
Steve Simpson June Avengers: Infinity War
--- movies ---
Movie Budget ($M) Box Office ($M)
&dtype=int &dtype=decimal
Avengers: Infinity War 400 2052.0
Back to the Future 19 388.8
Frozen 150 1280.0
Harry Potter & the Goblet of Fire 150 896.8
--- expected ---
Name Birth Month Favorite Movie Budget ($M) Box Office ($M)
&dtype=int &dtype=decimal
Jane Doe February Back to the Future 19 388.8
John Doe April Harry Potter & the Goblet of Fire 150 896.8
Anna Johnson November Frozen 150 1280.0
Steve Simpson June Avengers: Infinity War 400 2052.0
```
and would be loaded and accessed like this:
```python
data_frames = dftxt.read_all_to_pandas("./example.dftxt")
combined = data_frames.people.merge(
data_frames.movies,
how="left",
left_on="Favorite Movie",
right_on="Movie",
).drop(columns=["Movie"])
pandas.testing.assert_frame_equal(combined, data_frames.expected)
```
Names must be valid Python variables. Also, the trailing `---` in the named example is
optional. It could also have been `--- people` instead of `--- people ---`.
#### Column Filtering
It is also possible add filters to columns to load different columns under different
circumstances. There are two types of filters `if` and `if_not`. An `if` filter will
only be included if the specified filter value is specified when loading. An `if_not`
filter will be excluded if the filter is present. These provide a lot of flexibility,
but can be very useful when testing mapping transformations without having to specify
data multiple times.
Continuing from the example in the previous "Multiple DataFrames" section, the expected
DataFrame could be omitted and the combined columns added to the people DataFrame with
filters. Also, in this example we'll drop the `Birth Month` column in the expected to
show if_not filtering as well. Here's what the file would look like:
```
--- people ---
Name Birth Month Favorite Movie Budget ($M) Box Office ($M)
&-expected &dtype=int &dtype=decimal
&+expected &+expected
Jane Doe February Back to the Future 19 388.8
John Doe April Harry Potter & the Goblet of Fire 150 896.8
Anna Johnson November Frozen 150 1280.0
Steve Simpson June Avengers: Infinity War 400 2052.0
--- movies ---
Movie Budget ($M) Box Office ($M)
&dtype=int &dtype=decimal
Avengers: Infinity War 400 2052.0
Back to the Future 19 388.8
Frozen 150 1280.0
Harry Potter & the Goblet of Fire 150 896.8
```
The if filters can be specified as `&if=expected` and the if not filters specified as
`&if_not=expected`. However, here the shorthand is used, which is `&+expected` and
`&-expected` respectively. In this case the `Birth Month` column will be loaded unless
the read call specifies the `exclude` filter. In contrast the `Budget ($M)` and
`Box Office ($M)` columns will only be included if the read call specifies the `exclude`
filter. In practice, this would look like:
```python
frames = dftxt.read_all_to_pandas("./example.dftxt")
expected_frames = dftxt.read_all_to_pandas("./example.dftxt", filters=["expected"])
combined = frames.people.merge(
frames.movies,
how="left",
left_on="Favorite Movie",
right_on="Movie",
).drop(columns=["Movie", "Birth Month"])
pandas.testing.assert_frame_equal(combined, expected_frames.people)
```
It is possible to specify multiple if and not if filters to a single column if
desirable. In those cases a column will be included when any of the if filters are
present. The not if filters take precedence and the column will be omitted if any of
the filters match the not if filters condition.
Raw data
{
"_id": null,
"home_page": "https://github.com/rocketboosters/dftxt",
"name": "dftxt",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "Scott Ernst",
"author_email": "swernst@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/55/22/7ca5d922c0ebde9284ae13446393a405d055ffddc838158bc3d66bb51468/dftxt-1.0.3.tar.gz",
"platform": null,
"description": "# dftxt\n\nA Python library for a simple DataFrame text file format that facilitates easier\nspecification of a Pandas and Polars DataFrame in a reliable, human-readable text\nformat for use in testing and where source data is small and human managed. Ultimately,\nthe goal of this project is to make a DataFrame transformation function test as easy\nas:\n\n```python\ndef test_my_transformation():\n \"\"\"Should transform source DataFrame into the expected output.\"\"\"\n data_frames = dftxt.read_all_to_pandas(\"./test_data.dftxt\")\n observed = my_transformation(data_frames.source)\n pandas.testing.assert_frame_equal(observed, data_frames.expected)\n```\n\nby allowing one to express the attributes, structure and data that constitute a\nDataFrame within a text file format and avoid having to post-process loaded data.\nHere's an example showing what the basic dftxt format looks like:\n\n```\nName Planet Numeral Mean Radius (km) Discovery Year Discoverer\n &dtype=cat &dtype=float &dtype=Int\nMoon Earth I 1738 None None\nPhobos Mars I 11.267 1877 Hall\nDeimos Mars II 6.2 1877 Hall\nIo Jupiter I 1821 1610 Galileo\nEuropa Jupiter II 1560 1610 Galileo\nGanymede Jupiter III 2634 1610 Galileo\nCallisto Jupiter IV 2410 1610 Galileo\nAmalthea Jupiter V 83.5 1892 Barnard\nHimalia Jupiter VI 69.8 1904 Perrine\nMimas Saturn I 198.2 1789 Herschel\n```\n\nThis is a fixed-width file format that uses two+ spaces separating column names to\ndefine the width of each column.\n\n## Quick Start\n\nOne of the best ways to learn the dftxt is to create a Pandas/Polar DataFrame and save\nit to a file or string and see what the output looks like.\n\n```python\nimport dftxt\nimport polars as pl\n\ndf = pl.DataFrame([\n {\"character\": \"Jay Gatsby\", \"book\": \"The Great Gatsby\", \"year\": 1925},\n {\"character\": \"Clarissa Dalloway\", \"book\": \"Mrs. Dalloway\", \"year\": 1925},\n {\"character\": \"Toad\", \"book\": \"The Wind & the Willow\", \"year\": 1906},\n])\n\n# Could write to a file if you prefer:\n# dftxt.write(\"./example.dftxt\", df)\n# But here we'll just print the serialized string:\nprint(dftxt.writes(df))\n```\n\nwhich would print out:\n\n```\ncharacter book year\n &dtype=int\nJay Gatsby The Great Gatsby 1925\nClarissa Dalloway Mrs. Dalloway 1925\nToad The Wind & the Willow 1906\n```\n\nIt's also possible to embed dftxt into Markdown files with fenced code blocks that use\nthe `df` or `dftxt` type signifier. Multiple fenced code blocks will be collectively\nextracted into the loaded DataFrame, which makes inline commenting of blocks quite\nuseful.\n\nFor examples of what that looks like see:\n\n- [Markdown with dftxt Example](./dftxt/tests/_io/_markdown/scenarios/multiple_frames/source.md)\n- [Single DataFrame broken out across multiple blocks](./dftxt/tests/_io/_markdown/scenarios/single_frame/source.md)\n\n## Benefits\n\nThe benefits of the dftxt DataFrame serialization format include:\n\n### 1. Preserves DataFrame Structure\n\nMost importantly, this format retains the necessary information to reload the DataFrame\nin an identical fashion as the file was specified. This includes data types, column\nordering, and indexing (Pandas only as Polars has no index). In testing, it should be\npossible to use `(pandas|polars).testing.assert_frame_equal()` on a loaded DataFrame\nwithout any transformation when read from a file.\n\nFor example,\n\n```\nsku price_usd originally_released_on product_name\n&int_index &dtype=decimal &dtype=date\n109456 119.99 2023-07-09 Fancy Socks\n450213 24.49 2020-11-12 Simple Socks\n90210 299.99 1998-03-28 LA Heartthrob Socks\n```\n\nIn this example the:\n\n- `sku` column will be loaded as the DataFrame's index (Pandas only) and the values in\n the column treated as integers.\n- `price_usd` column will be loaded as `decimal.Decimal` or `polars.Decimal`\n depending on the type of DataFrame.\n- `originally_released_on` column will be loaded as `datetime.Date` values.\n- `product_name` values will be loaded as strings.\n\nThe order of the rows and columns are preserved when loaded as well. All of this avoids\nhaving to process the DataFrame in order to achieve this configuration as one would have\nto do with other formats, e.g. CSV.\n\n### 2. Human Friendly\n\nThe format is easy to read and modify by humans and requires little to no machine\ncharacters in its specification. Whitespace is used as the delimiter - specifically\n2+ spaces between column names - which also serves to align columns for easy\nreadability.\n\n#### Quoting\n\nQuoting and escaping are rarely needed as a result. Here's an example\nshowing three string columns, which require no quoting:\n\n```\nName Birth Month Favorite Movie\nJane Doe February Back to the Future\nJohn Doe April Harry Potter & the Goblet of Fire\nAnna Johnson November Frozen\nSteve Simpson June Avengers: Infinity War\n```\n\nSpaces in the column names and in the data values are not an issue for the format\nbecause the 2+ spaces between columns are what's required to identify the columns.\nQuoting is needed in cases:\n\n1. The column name contains 2+ spaces, e.g. `\"Hello World\"`.\n2. The column name or value starts or ends with a space , e.g. `\" foo \"`.\n3. The columns name or value ends with a backslash.\n\n#### Column Wrapping\n\nAdditionally, long values can be broken up over multiple lines on a per-column basis\nusing a backslash end character like in Python. This makes it possible to keep long\nstrings from bloating the width of the data and hurting readability. For example,\n\n```\nPlay Quotation Act Scene\n &&int &&int\nHamlet To be, or not to be: that is the question 3 1\n\nAs You Like It All the world's a stage, and all the men \\ 2 7\n and women merely players. They have their \\\n exits and their entrances; And one man in \\\n his time plays many parts.\n\nRomeo & Juliet Romeo, Romeo! Wherefore art thou Romeo? 2 2\n\nRichard III Now is the winter of our discontent 1 1\n\nMacbeth Is this a dagger which I see before me, \\ 2 1\n the handle toward my hand?\n```\n\nNote that blank lines between rows here is optional and was included to emphasize the\nwrapped lines in the `Quotation` column for the second and last rows.\n\n#### Comments\n\nImportantly, this format also supports commenting, using the standard Python line\ncomment, which begins a line of any indentation with a `#` sign. _Note that inline\ncomments are not supported because `#` is a common-enough character in data values that\nit would confuse. A commented dftxt file looks something like this:\n\n```\n# This is a heavily commented example showing that comments can be used throughout a\n# dftxt file. Very useful for documenting data for tests with the nuances and reasoning\n# behind the test data.\n# The data is source from:\n# https://www.irs.gov/individuals/international-taxpayers/yearly-average-currency-exchange-rates\nCountry Currency 2023-01-01 2022-01-01 2021-01-01 2020-01-01 2019-01-01\n # We're using Decimals here because of currency accuracy needs.\n &&decimal &&decimal &&decimal &&decimal &&decimal\n # We want the column names to be loaded in as dates.\n &ntype=date &ntype=date &ntype=date &ntype=date &ntype=date\n # The global pandemic strained the Argentine economy,\n # which led to soaring inflation that remained going\n # into 2024.\nArgentina Peso 296.154 130.792 95.098 70.635 48.192\n # Despite economic issues of its own, Brazil did not\n # see the same inflationary pressures of its neighbor.\nBrazil Real 4.994 5.165 5.395 5.151 3.946\n\nCanada Dollar 1.350 1.301 1.254 1.341 1.327\nCayman Islands Dollar 0.833 0.833 0.833 0.833 0.833\n\nAustralia Dollar 1.506 1.442 1.332 1.452 1.439\nChina Yuan 7.075 6.730 6.452 6.900 6.910\n\nEuro Zone Euro 0.924 0.951 0.846 0.877 0.893\n\n# The values here are yearly average currency exchange rates converting into USD.\n```\n\n#### Embedded in Markdown\n\nMarkdown is a fairly ubiquitous way to create human-readable documentation that also\nrenders nicely in IDEs and code collaboration tools. As such, dftxt supports embedding\ndftxt data within Markdown as fenced code blocks (triple backticks) that have the `df`\nor `dftxt` specifier after them. It's possible to specify multiple DataFrames this way\nand break DataFrames up into multiple markdown fenced code blocks for inline commenting\nwhere desirable.\n\nFor examples of what that looks like see:\n\n- [Markdown with dftxt Example](./dftxt/tests/_io/_markdown/scenarios/multiple_frames/source.md)\n- [Single DataFrame broken out across multiple blocks](./dftxt/tests/_io/_markdown/scenarios/single_frame/source.md)\n\n### 3. Diff/Code Review Friendly\n\nThe benefits of the dftxt file format that make it human-friendly are also what make it\nfriendly for code reviewing and display diffs.\n\n#### DataFrame Wrapping\n\nAdditionally, DataFrames can be separated into multiple blocks within a file to keep\nwide datasets manageable and easy to read in diffs as well as preventing small changes\nfrom having big impacts on file changes.\n\nFor example, this:\n\n```\nName Birth Month\nJane Doe February\nJohn Doe April\nAnna Johnson November\nSteve Simpson June\n\n\nFavorite Movie\nBack to the Future\nHarry Potter & the Goblet of Fire\nFrozen\nAvengers: Infinity War\n```\n\nis identical to this:\n\n\n```\nName Birth Month Favorite Movie\nJane Doe February Back to the Future\nJohn Doe April Harry Potter & the Goblet of Fire\nAnna Johnson November Frozen\nSteve Simpson June Avengers: Infinity War\n```\n\nThe 2+ blank lines in a dftxt file indicate that the what follows are additional\ncolumn data for the same DataFrame. It's also possible to repeat columns - often this\nwill be an index or primary key - to make it easy to track lines in different blocks of\nthe wrapped DataFrame. The example above could be written as:\n\n\n```\nName Birth Month\nJane Doe February\nJohn Doe April\nAnna Johnson November\nSteve Simpson June\n\n\nName Favorite Movie\n&repeat\nJane Doe Back to the Future\nJohn Doe Harry Potter & the Goblet of Fire\nAnna Johnson Frozen\nSteve Simpson Avengers: Infinity War\n```\n\nHere the `&repeat` modifier on the second appearance of the `Name` column indicates\nthat this column is a repeat of one already in the DataFrame and should not be loaded\nagain. It exists purely for human-readability.\n\nIt's also possible to have an index column exist only in the file and have it never\nbe included in the loaded result. Continuing the example from above, this might look\nsomething like:\n\n\n```\nID Name Birth Month\n&exclude\n1 Jane Doe February\n2 John Doe April\n3 Anna Johnson November\n4 Steve Simpson June\n\n\nID Favorite Movie\n&exclude\n1 Back to the Future\n2 Harry Potter & the Goblet of Fire\n3 Frozen\n4 Avengers: Infinity War\n```\n\nHere the `ID` column exists only in the file and will be excluded from the loading\nprocess because of the `&exclude` column modifier.\n\n### 4. Flexibility\n\nThe dftxt has additional flexibility in a key ways.\n\n#### Multiple DataFrames\n\nFirst, the dftxt format allows for specifying multiple DataFrames in a single file.\nThis can be used in a number of ways, but the most common one is to include all\nDataFrames for a test in a single location for coherence. DataFrames within a file\nare separated by a line that begins with 3+ dashes with a blank both before and after\nit. This looks like:\n\n```\nName Birth Month Favorite Movie\nJane Doe February Back to the Future\nJohn Doe April Harry Potter & the Goblet of Fire\nAnna Johnson November Frozen\nSteve Simpson June Avengers: Infinity War\n\n---\n\nMovie Budget ($M) Box Office ($M)\n &dtype=int &dtype=decimal\nAvengers: Infinity War 400 2052.0\nBack to the Future 19 388.8\nFrozen 150 1280.0\nHarry Potter & the Goblet of Fire 150 896.8\n\n---\n\nName Birth Month Favorite Movie Budget ($M) Box Office ($M)\n &dtype=int &dtype=decimal\nJane Doe February Back to the Future 19 388.8\nJohn Doe April Harry Potter & the Goblet of Fire 150 896.8\nAnna Johnson November Frozen 150 1280.0\nSteve Simpson June Avengers: Infinity War 400 2052.0\n```\n\nTo load and use this file would look something like this:\n\n```python\nimport pandas.testing\nimport dftxt\n\ndata_frames = dftxt.read_all_to_pandas(\"./example.dftxt\")\ncombined = (\n frames[0]\n .merge(\n frames[1],\n how=\"left\",\n left_on=\"Favorite Movie\",\n right_on=\"Movie\",\n )\n .drop(columns=[\"Movie\"])\n)\npandas.testing.assert_frame_equal(combined, data_frames[2])\n```\n\nNotice how the DataFrames are access by the indexed order from the file. It is also\npossible to name the DataFrames in the file and access them by a name instead. This\nwould look something like this:\n\n```\n--- people ---\n\nName Birth Month Favorite Movie\nJane Doe February Back to the Future\nJohn Doe April Harry Potter & the Goblet of Fire\nAnna Johnson November Frozen\nSteve Simpson June Avengers: Infinity War\n\n--- movies ---\n\nMovie Budget ($M) Box Office ($M)\n &dtype=int &dtype=decimal\nAvengers: Infinity War 400 2052.0\nBack to the Future 19 388.8\nFrozen 150 1280.0\nHarry Potter & the Goblet of Fire 150 896.8\n\n--- expected ---\n\nName Birth Month Favorite Movie Budget ($M) Box Office ($M)\n &dtype=int &dtype=decimal\nJane Doe February Back to the Future 19 388.8\nJohn Doe April Harry Potter & the Goblet of Fire 150 896.8\nAnna Johnson November Frozen 150 1280.0\nSteve Simpson June Avengers: Infinity War 400 2052.0\n```\n\nand would be loaded and accessed like this:\n\n```python\ndata_frames = dftxt.read_all_to_pandas(\"./example.dftxt\")\ncombined = data_frames.people.merge(\n data_frames.movies,\n how=\"left\",\n left_on=\"Favorite Movie\",\n right_on=\"Movie\",\n).drop(columns=[\"Movie\"])\npandas.testing.assert_frame_equal(combined, data_frames.expected)\n```\n\nNames must be valid Python variables. Also, the trailing `---` in the named example is\noptional. It could also have been `--- people` instead of `--- people ---`.\n\n#### Column Filtering\n\nIt is also possible add filters to columns to load different columns under different\ncircumstances. There are two types of filters `if` and `if_not`. An `if` filter will\nonly be included if the specified filter value is specified when loading. An `if_not`\nfilter will be excluded if the filter is present. These provide a lot of flexibility,\nbut can be very useful when testing mapping transformations without having to specify\ndata multiple times.\n\nContinuing from the example in the previous \"Multiple DataFrames\" section, the expected\nDataFrame could be omitted and the combined columns added to the people DataFrame with\nfilters. Also, in this example we'll drop the `Birth Month` column in the expected to\nshow if_not filtering as well. Here's what the file would look like:\n\n```\n--- people ---\n\nName Birth Month Favorite Movie Budget ($M) Box Office ($M)\n &-expected &dtype=int &dtype=decimal\n &+expected &+expected\nJane Doe February Back to the Future 19 388.8\nJohn Doe April Harry Potter & the Goblet of Fire 150 896.8\nAnna Johnson November Frozen 150 1280.0\nSteve Simpson June Avengers: Infinity War 400 2052.0\n\n--- movies ---\n\nMovie Budget ($M) Box Office ($M)\n &dtype=int &dtype=decimal\nAvengers: Infinity War 400 2052.0\nBack to the Future 19 388.8\nFrozen 150 1280.0\nHarry Potter & the Goblet of Fire 150 896.8\n```\n\nThe if filters can be specified as `&if=expected` and the if not filters specified as\n`&if_not=expected`. However, here the shorthand is used, which is `&+expected` and\n`&-expected` respectively. In this case the `Birth Month` column will be loaded unless\nthe read call specifies the `exclude` filter. In contrast the `Budget ($M)` and\n`Box Office ($M)` columns will only be included if the read call specifies the `exclude`\nfilter. In practice, this would look like:\n\n```python\nframes = dftxt.read_all_to_pandas(\"./example.dftxt\")\nexpected_frames = dftxt.read_all_to_pandas(\"./example.dftxt\", filters=[\"expected\"])\ncombined = frames.people.merge(\n frames.movies,\n how=\"left\",\n left_on=\"Favorite Movie\",\n right_on=\"Movie\",\n).drop(columns=[\"Movie\", \"Birth Month\"])\npandas.testing.assert_frame_equal(combined, expected_frames.people)\n```\n\nIt is possible to specify multiple if and not if filters to a single column if\ndesirable. In those cases a column will be included when any of the if filters are\npresent. The not if filters take precedence and the column will be omitted if any of\nthe filters match the not if filters condition.\n\n",
"bugtrack_url": null,
"license": "Apache Version 2.0",
"summary": "Human-friendly, VCS-friendly file format for Python Pandas and Polars DataFrames.",
"version": "1.0.3",
"project_urls": {
"Documentation": "https://github.com/rocketboosters/dftxt",
"Homepage": "https://github.com/rocketboosters/dftxt",
"Repository": "https://github.com/rocketboosters/dftxt"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7a4c91693a8e9a32efb80c75899ed79f2f44b1d5fc56c3e28918df50d55d835c",
"md5": "3c0877f1ef8c817d2e2f44db1f7b854c",
"sha256": "41680a0143d1368afe13edfdf52ff83946e5ac2bc4c6324332263aee02129ba6"
},
"downloads": -1,
"filename": "dftxt-1.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3c0877f1ef8c817d2e2f44db1f7b854c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9,<4.0",
"size": 88294,
"upload_time": "2024-01-24T14:23:46",
"upload_time_iso_8601": "2024-01-24T14:23:46.411625Z",
"url": "https://files.pythonhosted.org/packages/7a/4c/91693a8e9a32efb80c75899ed79f2f44b1d5fc56c3e28918df50d55d835c/dftxt-1.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "55227ca5d922c0ebde9284ae13446393a405d055ffddc838158bc3d66bb51468",
"md5": "90716c5b3262d92535e64376ae7b5872",
"sha256": "15529b80a9b23a95d679c194a4ba07da15136da6bf018a13a9bb8b69aae5cbc3"
},
"downloads": -1,
"filename": "dftxt-1.0.3.tar.gz",
"has_sig": false,
"md5_digest": "90716c5b3262d92535e64376ae7b5872",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9,<4.0",
"size": 52527,
"upload_time": "2024-01-24T14:23:47",
"upload_time_iso_8601": "2024-01-24T14:23:47.537374Z",
"url": "https://files.pythonhosted.org/packages/55/22/7ca5d922c0ebde9284ae13446393a405d055ffddc838158bc3d66bb51468/dftxt-1.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-24 14:23:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rocketboosters",
"github_project": "dftxt",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "dftxt"
}