tablite


Nametablite JSON
Version 2020.6.30.66481 PyPI version JSON
download
home_pagehttps://github.com/root-11/tablite
SummaryA table crunching library
upload_time2020-06-30 17:28:53
maintainer
docs_urlNone
authorBjorn Madsen
requires_python
licenseMIT
keywords tables
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Tablite 

[![Build Status](https://travis-ci.org/root-11/tablite.svg?branch=master)](https://travis-ci.org/root-11/tablite)
[![Code coverage](https://codecov.io/gh/root-11/tablite/branch/master/graph/badge.svg)](https://codecov.io/gh/root-11/tablite)

We're all tired of reinventing the wheel when we need to process a bit of data.

- Pandas has a huge memory overhead.
- Numpy has become a language of it's own. It just doesn't seem pythonic anymore.
- Arrows isn't ready.
- SQLite is great but just too slow, particularly on disk.
- Protobuffer is just overkill for storing data when I still need to implement all the analytics after that.

So what do we do? We write a custom built class for the problem at hand and
discover that we've just spent 3 hours doing something that should have taken
20 minutes. No more please!

### Enter: [Tablite](https://pypi.org/project/tablite/)
A python library for tables that does everything you need in 60kB.

Install: `pip install tablite`  
Usage:  `>>> import table`  

- it handles all datatypes: `str`, `float`, `bool`, `int`, `date`, `datetime`, `time`.
- you can select: 
  - all rows in a column as `table['A']` 
  - rows across all columns as `table[4:8]`
  - or a slice as `list(table.filter('A', 'B', slice(4,8)))`.
- you to update with `table['A'][2] = new value`
- you can store or send data using json, by: 
  - dumping to json: `json_str = table.to_json()`, or 
  - you can load it with `Table.from_json(json_str)`.
- Type checking is automatic when you append or replace values. 
- it checks if the header name is already in use.
- you can add any type of metadata to the table as `table.metadata['some key'] = 'some value'`
- you can ask `column_xyz in Table.colums`
- load from file with `Table.from_file('this.csv')` which has automatic datatype detection
- perform sql join between tables 
- summarise using `table.groupby( ... )` 
- create pivot tables using `groupby.pivot( ... )`


Here are some examples:

```
# 1. Create the table.
table = Table()

# 2. Add a column
table.add_column('A', int, False)

# 3. check that it really is there.
assert 'A' in table

# 4. Add another column that doesn't tolerate None's
table.add_column('B', str, allow_empty=False)

# 5. appending a couple of rows:
table.add_row((1, 'hello'))
table.add_row((2, 'world'))

# 6. converting to json is easy:
table_as_json = table.to_json()

# 7. loading from json is easy:
table2 = Table.from_json(table_as_json)

# 8. for storing in a database or on disk, I recommend to zip the json.
zipped = zlib.compress(table_as_json.encode())

# 9. copying is easy:
table3 = table.copy()

# 10. comparing tables are straight forward:
assert table == table2 == table3

# 11. comparing metadata is also straight forward 
# (for example if you want to append the one table to the other)
table.compare(table3)  # will raise exception if they're different.

# 12. The plus operator `+` also works:
table3x2 = table3 + table3
assert len(table3x2) == len(table3) * 2

# 13. and so does plus-equal: +=
table3x2 += table3
assert len(table3x2) == len(table3) * 3

# 14. updating values is familiar to any user who likes a list:
assert 'A' in table.columns
assert isinstance(table.columns['A'], list)
last_row = -1
table['A'][last_row] = 44
table['B'][last_row] = "Hallo"

# 15. type verification is included, and complaints if you're doing it wrong:
try:
    table.columns['A'][0] = 'Hallo'
    assert False, "A TypeError should have been raised."
except TypeError:
    assert True

# 16. slicing is easy:
table_chunk = table2[2:4]
assert isinstance(table_chunk, Table)

# 17. we will handle duplicate names gracefully.
table2.add_column('B', int, allow_empty=True)
assert set(table2.columns) == {'A', 'B', 'B_1'}

# 18. you can delete a column as key...
del table2['B_1']
assert set(table2.columns) == {'A', 'B'}

# 19. adding a computed column is easy:
table.add_column('new column', str, allow_empty=False, data=[f"{r}" for r in table.rows])

# 20. iterating over the rows is easy, but if you forget `.rows` will remind you.
try:
    for row in table: # <-- wont pass needs: table.rows
        assert False, "not possible. Use for row in table.rows or for column in table.columns"
except AttributeError:
    assert True

print(table)
for row in table.rows:
    print(row)

# but if you just want to view it interactively (or a slice of it), use:
table.show()

+ =====+=====+============= +
|   A  |  B  |  new column  |
|  int | str |     str      |
| False|False|    False     |
+ -----+-----+------------- +
|     1|hello| (1, 'hello') |
|    44|Hallo|(44, 'Hallo') |
+ =====+=====+============= +

table.show('A', slice(0,1))

+ ===== +
|   A   |
|  int  |
| False |
+ ----- +
|     1 |
+ ===== +


# 21 .updating a column with a function is easy:
f = lambda x: x * 10
table['A'] = [f(r) for r in table['A']]

# 22. using regular indexing will also work.
for ix, r in enumerate(table['A']):
    table['A'][ix] = r * 10

# 23. and it will tell you if you're not allowed:
try:
    f = lambda x: f"'{x} as text'"
    table['A'] = [f(r) for r in table['A']]
    assert False, "The line above must raise a TypeError"
except TypeError as error:
    print("The error is:", str(error))

# 24. works with all datatypes ...
now = datetime.now()

table4 = Table()
table4.add_column('A', int, allow_empty=False, data=[-1, 1])
table4.add_column('A', int, allow_empty=True, data=[None, 1])  # None!
table4.add_column('A', float, False, data=[-1.1, 1.1])
table4.add_column('A', str, False, data=["", "1"])  # Empty string is not a None, when dtype is str!
table4.add_column('A', str, True, data=[None, "1"])  # Empty string is not a None, when dtype is str!
table4.add_column('A', bool, False, data=[False, True])
table4.add_column('A', datetime, False, data=[now, now])
table4.add_column('A', date, False, data=[now.date(), now.date()])
table4.add_column('A', time, False, data=[now.time(), now.time()])

# ...to and from json:
table4_json = table4.to_json()
table5 = Table.from_json(table4_json)

assert table4 == table5

# 25. doing lookups is supported by indexing:
table6 = Table()
table6.add_column('A', str, data=['Alice', 'Bob', 'Bob', 'Ben', 'Charlie', 'Ben', 'Albert'])
table6.add_column('B', str, data=['Alison', 'Marley', 'Dylan', 'Affleck', 'Hepburn', 'Barnes', 'Einstein'])

index = table6.index('A')  # single key.
assert index[('Bob',)] == {1, 2}

index2 = table6.index('A', 'B')  # multiple keys.
assert index2[('Bob', 'Dylan')] == {2}

# 26. And finally: You can add metadata until the cows come home:
table5.metadata['db_mapping'] = {'A': 'customers.customer_name',
                                 'A_2': 'product.sku',
                                 'A_4': 'locations.sender'}

table5_json = table5.to_json()
table5_from_json = Table.from_json(table5_json)
assert table5 == table5_from_json

```

### Okay, great. How do I load data?

Easy. Use `filereader`

```
from pathlib import Path
from table import filereader

for filename in ['data.csv', 'data.xlsx', 'data.txt', 'data.tsv', 'data.ods']:
    path = Path(filename)
    for table in filereader(path):
        assert isinstance(t, Table)
        ...
```

table.filereader currently accepts the following formats:

`csv, tsv, txt, xls, xlsx, xlsm, ods, zip, log.`

And should have some wicked format like:

input: ```19-Sep 02:59:47.153 web_id821 LOG 62.13.11.127 [3] (USER_N) ProcessScannedItem() : Scan[35572] LineNo 201636 scanned 1 of product 2332```

you can provide a split as a keyword:

```
table = filereader('web.log', split_sequence `" ", " ", " ", " "," [", "] (", ") ", " : ", "LineNo ", " scanned ", "of "`)
```


----------------

### At this point you know it's a solid bread and butter table. 
### No surprises. Now onto the real time savers

**_Iteration_** supports for loops and list comprehension at the speed of c: 

Just use `[r for r in table.rows]`, or:

    for row in table.rows:
        row ...

Here's a more practical use case:

(1) Imagine a table with columns a,b,c,d,e (all integers) like this:
```
t = Table()
_ = [t.add_column(header=c, datatype=int, allow_empty=False, data=[i for i in range(5)]) for c in 'abcde']
```

(2) we want to add two new columns using the functions:
```
def f1(a,b,c): 
    return a+b+c+1
def f2(b,c,d): 
    return b*c*d
```

(3) and we want to compute two new columns `f` and `g`:
```
t.add_column(header='f', datatype=int, allow_empty=False)
t.add_column(header='g', datatype=int, allow_empty=True)
```

(4) we can now use the filter, to iterate over the table, and add the values to the two new columns:
```
for row in t.filter('a', 'b', 'c', 'd'):
    a, b, c, d = row

    t['f'].append(f1(a, b, c))
    t['g'].append(f2(b, c, d))

assert len(t) == 5
assert list(t.columns) == list('abcdefg')

```

**_Sort_** supports multi-column sort as simple as `table.sort(**{'A': False, 'B': True, 'C': False})`

Here's an example:

```
table7 = Table()
table7.add_column('A', int, data=[1, None, 8, 3, 4, 6, 5, 7, 9], allow_empty=True)
table7.add_column('B', int, data=[10, 100, 1, 1, 1, 1, 10, 10, 10])
table7.add_column('C', int, data=[0, 1, 0, 1, 0, 1, 0, 1, 0])

table7.sort(**{'B': False, 'C': False, 'A': False})

assert list(table7.rows) == [
    (4, 1, 0),
    (8, 1, 0),
    (3, 1, 1),
    (6, 1, 1),
    (1, 10, 0),
    (5, 10, 0),
    (9, 10, 0),
    (7, 10, 1),
    (None, 100, 1)
]
```
This takes us to filter:

**_Filter_** allows selection of particular rows like: 

    for row in table.filter('b', 'a', 'a', 'c')
        b,a,a,c = row

So and if you only want a slice, for example column A and B for rows 4-8 from 
table7, you'd do it like this: `list(table7.filter('A', 'B', slice(4, 8)))`.

Hereby you'd get:
```
assert list(table7.filter('A', 'B', slice(4, 8))) == [
    (1, 10), 
    (5, 10), 
    (9, 10), 
    (7, 10)
]
```
As you can see, the table is sorted first by column `B` in ascending order, then
by column 'C' and finally by column 'A'. Note that `None` is handled as `float('-inf')`


**_Index_** supports multi-key indexing using args: `table.index('B','C')`. 

This gives you a dictionary with the key as a tuple and the indices as a set, e.g. 

    indices = {
        (1, 44): {2,3,33,35}
        (2, 44): {4,5,32}
    }



**_All_** allows copy of a table where "all" criteria match.

This allows you to use custom functions like this:

```
before = [r for r in table2.rows]
assert before == [(1, 'hello'), (2, 'world'), (1, 'hello'), (44, 'Hallo')]

# as is filtering for ALL that match:
filter_1 = lambda x: 'llo' in x
filter_2 = lambda x: x > 3

after = table2.all(**{'B': filter_1, 'A': filter_2})

assert list(after.rows) == [(44, 'Hallo')]
```

**_Any_** works like `all` except it retrieves where "any" criteria match instead of all.

```
after = table2.any(**{'B': filter_1, 'A': filter_2})

assert list(after.rows) == [(1, 'hello'), (1, 'hello'), (44, 'Hallo')]
```

----------------

**_SQL JOINs_** are supported out of the box. 

Here are a couple of examples:

```
# We start with creating two tables:
left = Table()
left.add_column('number', int, allow_empty=True, data=[1, 2, 3, 4, None])
left.add_column('colour', str, data=['black', 'blue', 'white', 'white', 'blue'])

right = Table()
right.add_column('letter', str, allow_empty=True, data=['a', 'b,', 'c', 'd', None])
right.add_column('colour', str, data=['blue', 'white', 'orange', 'white', 'blue'])
```

**Left join** would in SQL be:
`SELECT number, letter FROM left LEFT JOIN right on left.colour == right.colour`

with table it's:
```
left_join = left.left_join(right, keys=['colour'], columns=['number', 'letter'])
```

**Inner join** would in SQL be:
`SELECT number, letter FROM left JOIN right ON left.colour == right.colour`

with table it's
```
inner_join = left.inner_join(right, keys=['colour'],  columns=['number','letter'])
```


**Outer join** would in SQL be:
`SELECT number, letter FROM left OUTER JOIN right ON left.colour == right.colour`

with table it's:

```
outer_join = left.outer_join(right, keys=['colour'], columns=['number','letter'])
```

----------------

**_GroupBy_** operations are supported using the GroupBy class.

It allows summarising the data for all of the functions below:

```
g = GroupBy(keys=['a', 'b'],  # <-- Group by these columns
            functions=[('f', Max),  # <-- find the max on column `f` for each group.
                       ('f', Min),
                       ('f', Sum),
                       ('f', First),
                       ('f', Last),
                       ('f', Count),
                       ('f', CountUnique),
                       ('f', Average),
                       ('f', StandardDeviation),
                       ('a', StandardDeviation),
                       ('f', Median),
                       ('f', Mode),
                       ('g', Median)])
t2 = t + t
assert len(t2) == 2 * len(t)
t2.show()

+ =====+=====+=====+=====+=====+=====+===== +
|   a  |  b  |  c  |  d  |  e  |  f  |  g   |
|  int | int | int | int | int | int | int  |
| False|False|False|False|False|False| True |
+ -----+-----+-----+-----+-----+-----+----- +
|     0|    0|    0|    0|    0|    1|    0 |
|     1|    1|    1|    1|    1|    4|    1 |
|     2|    2|    2|    2|    2|    7|    8 |
|     3|    3|    3|    3|    3|   10|   27 |
|     4|    4|    4|    4|    4|   13|   64 |
|     0|    0|    0|    0|    0|    1|    0 |
|     1|    1|    1|    1|    1|    4|    1 |
|     2|    2|    2|    2|    2|    7|    8 |
|     3|    3|    3|    3|    3|   10|   27 |
|     4|    4|    4|    4|    4|   13|   64 |
+ =====+=====+=====+=====+=====+=====+===== +

g += t2

assert list(g.rows) == [
    (0, 0, 1, 1, 2, 1, 1, 2, 1, 1.0, 0.0, 0.0, 1, 1, 0),
    (1, 1, 4, 4, 8, 4, 4, 2, 1, 4.0, 0.0, 0.0, 4, 4, 1),
    (2, 2, 7, 7, 14, 7, 7, 2, 1, 7.0, 0.0, 0.0, 7, 7, 8),
    (3, 3, 10, 10, 20, 10, 10, 2, 1, 10.0, 0.0, 0.0, 10, 10, 27),
    (4, 4, 13, 13, 26, 13, 13, 2, 1, 13.0, 0.0, 0.0, 13, 13, 64)
]

g.table.show()

+ =====+=====+======+======+======+========+=======+========+==============+==========+====================+====================+=========+=======+========= +
|   a  |  b  |Max(f)|Min(f)|Sum(f)|First(f)|Last(f)|Count(f)|CountUnique(f)|Average(f)|StandardDeviation(f)|StandardDeviation(a)|Median(f)|Mode(f)|Median(g) |
|  int | int | int  | int  | int  |  int   |  int  |  int   |     int      |  float   |       float        |       float        |   int   |  int  |   int    |
| False|False| True | True | True |  True  |  True |  True  |     True     |   True   |        True        |        True        |   True  |  True |   True   |
+ -----+-----+------+------+------+--------+-------+--------+--------------+----------+--------------------+--------------------+---------+-------+--------- +
|     0|    0|     1|     1|     2|       1|      1|       2|             1|       1.0|                 0.0|                 0.0|        1|      1|        0 |
|     1|    1|     4|     4|     8|       4|      4|       2|             1|       4.0|                 0.0|                 0.0|        4|      4|        1 |
|     2|    2|     7|     7|    14|       7|      7|       2|             1|       7.0|                 0.0|                 0.0|        7|      7|        8 |
|     3|    3|    10|    10|    20|      10|     10|       2|             1|      10.0|                 0.0|                 0.0|       10|     10|       27 |
|     4|    4|    13|    13|    26|      13|     13|       2|             1|      13.0|                 0.0|                 0.0|       13|     13|       64 |
+ =====+=====+======+======+======+========+=======+========+==============+==========+====================+====================+=========+=======+========= +

```
 
Note that groupby is instantiated on it's own, without any data, and then
data is added using `+=` ? That's because I wanted the GroupBy class to be
friendly to updates that otherwise might run out of memory. Here's the case:

(1) Imagine you have a large number of files that you want to summarize.

For this you first need a groupby operation:
```
g = GroupBy(keys=['a','b'], functions=[('c', StandardDeviation), ('d', Average)])
```

(2) now you can just iterate over the files and not having to worry about 
the memory footprint, as each table is consumed by the groupby function:
```
files = Path(__file__).parent / 'archive'
assert files.isdir()  
for file in files.iterdir():
    json_str = json.loads(file.read())
    table = Table.from_json(json)
    g += table
```

(3) Once all files have been summarized, you can read the results using
Pythons friendly for loop:
```
for a, b, stdev, avg in g.rows:
     # ... do something ...
```

**Pivot Table** included in the groupby? Yes. You can pivot the groupby on any
column that is used for grouping. Here's a simple example:

```
g2 = GroupBy(keys=['a', 'b'], functions=[('f', Max), ('f', Sum)])
g2 += t + t + t

g2.table.show()

+=====+=====+======+======+
|  a  |  b  |Max(f)|Sum(f)|
| int | int | int  | int  |
|False|False| True | True |
+-----+-----+------+------+
|    0|    0|     1|     3|
|    1|    1|     4|    12|
|    2|    2|     7|    21|
|    3|    3|    10|    30|
|    4|    4|    13|    39|
+=====+=====+======+======+

pivot_table = g2.pivot(columns=['b'])

pivot_table.show()

+=====+==========+==========+==========+==========+==========+==========+==========+==========+==========+==========+
|  a  |Max(f,b=0)|Sum(f,b=0)|Max(f,b=1)|Sum(f,b=1)|Max(f,b=2)|Sum(f,b=2)|Max(f,b=3)|Sum(f,b=3)|Max(f,b=4)|Sum(f,b=4)|
| int |   int    |   int    |   int    |   int    |   int    |   int    |   int    |   int    |   int    |   int    |
|False|   True   |   True   |   True   |   True   |   True   |   True   |   True   |   True   |   True   |   True   |
+-----+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
|    0|         1|         3|      None|      None|      None|      None|      None|      None|      None|      None|
|    1|      None|      None|         4|        12|      None|      None|      None|      None|      None|      None|
|    2|      None|      None|      None|      None|         7|        21|      None|      None|      None|      None|
|    3|      None|      None|      None|      None|      None|      None|        10|        30|      None|      None|
|    4|      None|      None|      None|      None|      None|      None|      None|      None|        13|        39|
+=====+==========+==========+==========+==========+==========+==========+==========+==========+==========+==========+
```

---------------------


This concludes the mega-tutorial to `tablite`. There's nothing more to it.
But oh boy it'll save a lot of time.
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/root-11/tablite",
    "name": "tablite",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "tables",
    "author": "Bjorn Madsen",
    "author_email": "bjorn.madsen@operationsresearchgroup.com",
    "download_url": "https://files.pythonhosted.org/packages/ee/f2/19cd865d1b3a7f428eb45ee6ab3d5779f20d14a1ac314d06de61b63393e3/tablite-2020.6.30.66481.tar.gz",
    "platform": "any",
    "description": "# Tablite \n\n[![Build Status](https://travis-ci.org/root-11/tablite.svg?branch=master)](https://travis-ci.org/root-11/tablite)\n[![Code coverage](https://codecov.io/gh/root-11/tablite/branch/master/graph/badge.svg)](https://codecov.io/gh/root-11/tablite)\n\nWe're all tired of reinventing the wheel when we need to process a bit of data.\n\n- Pandas has a huge memory overhead.\n- Numpy has become a language of it's own. It just doesn't seem pythonic anymore.\n- Arrows isn't ready.\n- SQLite is great but just too slow, particularly on disk.\n- Protobuffer is just overkill for storing data when I still need to implement all the analytics after that.\n\nSo what do we do? We write a custom built class for the problem at hand and\ndiscover that we've just spent 3 hours doing something that should have taken\n20 minutes. No more please!\n\n### Enter: [Tablite](https://pypi.org/project/tablite/)\nA python library for tables that does everything you need in 60kB.\n\nInstall: `pip install tablite`  \nUsage:  `>>> import table`  \n\n- it handles all datatypes: `str`, `float`, `bool`, `int`, `date`, `datetime`, `time`.\n- you can select: \n  - all rows in a column as `table['A']` \n  - rows across all columns as `table[4:8]`\n  - or a slice as `list(table.filter('A', 'B', slice(4,8)))`.\n- you to update with `table['A'][2] = new value`\n- you can store or send data using json, by: \n  - dumping to json: `json_str = table.to_json()`, or \n  - you can load it with `Table.from_json(json_str)`.\n- Type checking is automatic when you append or replace values. \n- it checks if the header name is already in use.\n- you can add any type of metadata to the table as `table.metadata['some key'] = 'some value'`\n- you can ask `column_xyz in Table.colums`\n- load from file with `Table.from_file('this.csv')` which has automatic datatype detection\n- perform sql join between tables \n- summarise using `table.groupby( ... )` \n- create pivot tables using `groupby.pivot( ... )`\n\n\nHere are some examples:\n\n```\n# 1. Create the table.\ntable = Table()\n\n# 2. Add a column\ntable.add_column('A', int, False)\n\n# 3. check that it really is there.\nassert 'A' in table\n\n# 4. Add another column that doesn't tolerate None's\ntable.add_column('B', str, allow_empty=False)\n\n# 5. appending a couple of rows:\ntable.add_row((1, 'hello'))\ntable.add_row((2, 'world'))\n\n# 6. converting to json is easy:\ntable_as_json = table.to_json()\n\n# 7. loading from json is easy:\ntable2 = Table.from_json(table_as_json)\n\n# 8. for storing in a database or on disk, I recommend to zip the json.\nzipped = zlib.compress(table_as_json.encode())\n\n# 9. copying is easy:\ntable3 = table.copy()\n\n# 10. comparing tables are straight forward:\nassert table == table2 == table3\n\n# 11. comparing metadata is also straight forward \n# (for example if you want to append the one table to the other)\ntable.compare(table3)  # will raise exception if they're different.\n\n# 12. The plus operator `+` also works:\ntable3x2 = table3 + table3\nassert len(table3x2) == len(table3) * 2\n\n# 13. and so does plus-equal: +=\ntable3x2 += table3\nassert len(table3x2) == len(table3) * 3\n\n# 14. updating values is familiar to any user who likes a list:\nassert 'A' in table.columns\nassert isinstance(table.columns['A'], list)\nlast_row = -1\ntable['A'][last_row] = 44\ntable['B'][last_row] = \"Hallo\"\n\n# 15. type verification is included, and complaints if you're doing it wrong:\ntry:\n    table.columns['A'][0] = 'Hallo'\n    assert False, \"A TypeError should have been raised.\"\nexcept TypeError:\n    assert True\n\n# 16. slicing is easy:\ntable_chunk = table2[2:4]\nassert isinstance(table_chunk, Table)\n\n# 17. we will handle duplicate names gracefully.\ntable2.add_column('B', int, allow_empty=True)\nassert set(table2.columns) == {'A', 'B', 'B_1'}\n\n# 18. you can delete a column as key...\ndel table2['B_1']\nassert set(table2.columns) == {'A', 'B'}\n\n# 19. adding a computed column is easy:\ntable.add_column('new column', str, allow_empty=False, data=[f\"{r}\" for r in table.rows])\n\n# 20. iterating over the rows is easy, but if you forget `.rows` will remind you.\ntry:\n    for row in table: # <-- wont pass needs: table.rows\n        assert False, \"not possible. Use for row in table.rows or for column in table.columns\"\nexcept AttributeError:\n    assert True\n\nprint(table)\nfor row in table.rows:\n    print(row)\n\n# but if you just want to view it interactively (or a slice of it), use:\ntable.show()\n\n+ =====+=====+============= +\n|   A  |  B  |  new column  |\n|  int | str |     str      |\n| False|False|    False     |\n+ -----+-----+------------- +\n|     1|hello| (1, 'hello') |\n|    44|Hallo|(44, 'Hallo') |\n+ =====+=====+============= +\n\ntable.show('A', slice(0,1))\n\n+ ===== +\n|   A   |\n|  int  |\n| False |\n+ ----- +\n|     1 |\n+ ===== +\n\n\n# 21 .updating a column with a function is easy:\nf = lambda x: x * 10\ntable['A'] = [f(r) for r in table['A']]\n\n# 22. using regular indexing will also work.\nfor ix, r in enumerate(table['A']):\n    table['A'][ix] = r * 10\n\n# 23. and it will tell you if you're not allowed:\ntry:\n    f = lambda x: f\"'{x} as text'\"\n    table['A'] = [f(r) for r in table['A']]\n    assert False, \"The line above must raise a TypeError\"\nexcept TypeError as error:\n    print(\"The error is:\", str(error))\n\n# 24. works with all datatypes ...\nnow = datetime.now()\n\ntable4 = Table()\ntable4.add_column('A', int, allow_empty=False, data=[-1, 1])\ntable4.add_column('A', int, allow_empty=True, data=[None, 1])  # None!\ntable4.add_column('A', float, False, data=[-1.1, 1.1])\ntable4.add_column('A', str, False, data=[\"\", \"1\"])  # Empty string is not a None, when dtype is str!\ntable4.add_column('A', str, True, data=[None, \"1\"])  # Empty string is not a None, when dtype is str!\ntable4.add_column('A', bool, False, data=[False, True])\ntable4.add_column('A', datetime, False, data=[now, now])\ntable4.add_column('A', date, False, data=[now.date(), now.date()])\ntable4.add_column('A', time, False, data=[now.time(), now.time()])\n\n# ...to and from json:\ntable4_json = table4.to_json()\ntable5 = Table.from_json(table4_json)\n\nassert table4 == table5\n\n# 25. doing lookups is supported by indexing:\ntable6 = Table()\ntable6.add_column('A', str, data=['Alice', 'Bob', 'Bob', 'Ben', 'Charlie', 'Ben', 'Albert'])\ntable6.add_column('B', str, data=['Alison', 'Marley', 'Dylan', 'Affleck', 'Hepburn', 'Barnes', 'Einstein'])\n\nindex = table6.index('A')  # single key.\nassert index[('Bob',)] == {1, 2}\n\nindex2 = table6.index('A', 'B')  # multiple keys.\nassert index2[('Bob', 'Dylan')] == {2}\n\n# 26. And finally: You can add metadata until the cows come home:\ntable5.metadata['db_mapping'] = {'A': 'customers.customer_name',\n                                 'A_2': 'product.sku',\n                                 'A_4': 'locations.sender'}\n\ntable5_json = table5.to_json()\ntable5_from_json = Table.from_json(table5_json)\nassert table5 == table5_from_json\n\n```\n\n### Okay, great. How do I load data?\n\nEasy. Use `filereader`\n\n```\nfrom pathlib import Path\nfrom table import filereader\n\nfor filename in ['data.csv', 'data.xlsx', 'data.txt', 'data.tsv', 'data.ods']:\n    path = Path(filename)\n    for table in filereader(path):\n        assert isinstance(t, Table)\n        ...\n```\n\ntable.filereader currently accepts the following formats:\n\n`csv, tsv, txt, xls, xlsx, xlsm, ods, zip, log.`\n\nAnd should have some wicked format like:\n\ninput: ```19-Sep 02:59:47.153 web_id821 LOG 62.13.11.127 [3] (USER_N) ProcessScannedItem() : Scan[35572] LineNo 201636 scanned 1 of product 2332```\n\nyou can provide a split as a keyword:\n\n```\ntable = filereader('web.log', split_sequence `\" \", \" \", \" \", \" \",\" [\", \"] (\", \") \", \" : \", \"LineNo \", \" scanned \", \"of \"`)\n```\n\n\n----------------\n\n### At this point you know it's a solid bread and butter table. \n### No surprises. Now onto the real time savers\n\n**_Iteration_** supports for loops and list comprehension at the speed of c: \n\nJust use `[r for r in table.rows]`, or:\n\n    for row in table.rows:\n        row ...\n\nHere's a more practical use case:\n\n(1) Imagine a table with columns a,b,c,d,e (all integers) like this:\n```\nt = Table()\n_ = [t.add_column(header=c, datatype=int, allow_empty=False, data=[i for i in range(5)]) for c in 'abcde']\n```\n\n(2) we want to add two new columns using the functions:\n```\ndef f1(a,b,c): \n    return a+b+c+1\ndef f2(b,c,d): \n    return b*c*d\n```\n\n(3) and we want to compute two new columns `f` and `g`:\n```\nt.add_column(header='f', datatype=int, allow_empty=False)\nt.add_column(header='g', datatype=int, allow_empty=True)\n```\n\n(4) we can now use the filter, to iterate over the table, and add the values to the two new columns:\n```\nfor row in t.filter('a', 'b', 'c', 'd'):\n    a, b, c, d = row\n\n    t['f'].append(f1(a, b, c))\n    t['g'].append(f2(b, c, d))\n\nassert len(t) == 5\nassert list(t.columns) == list('abcdefg')\n\n```\n\n**_Sort_** supports multi-column sort as simple as `table.sort(**{'A': False, 'B': True, 'C': False})`\n\nHere's an example:\n\n```\ntable7 = Table()\ntable7.add_column('A', int, data=[1, None, 8, 3, 4, 6, 5, 7, 9], allow_empty=True)\ntable7.add_column('B', int, data=[10, 100, 1, 1, 1, 1, 10, 10, 10])\ntable7.add_column('C', int, data=[0, 1, 0, 1, 0, 1, 0, 1, 0])\n\ntable7.sort(**{'B': False, 'C': False, 'A': False})\n\nassert list(table7.rows) == [\n    (4, 1, 0),\n    (8, 1, 0),\n    (3, 1, 1),\n    (6, 1, 1),\n    (1, 10, 0),\n    (5, 10, 0),\n    (9, 10, 0),\n    (7, 10, 1),\n    (None, 100, 1)\n]\n```\nThis takes us to filter:\n\n**_Filter_** allows selection of particular rows like: \n\n    for row in table.filter('b', 'a', 'a', 'c')\n        b,a,a,c = row\n\nSo and if you only want a slice, for example column A and B for rows 4-8 from \ntable7, you'd do it like this: `list(table7.filter('A', 'B', slice(4, 8)))`.\n\nHereby you'd get:\n```\nassert list(table7.filter('A', 'B', slice(4, 8))) == [\n    (1, 10), \n    (5, 10), \n    (9, 10), \n    (7, 10)\n]\n```\nAs you can see, the table is sorted first by column `B` in ascending order, then\nby column 'C' and finally by column 'A'. Note that `None` is handled as `float('-inf')`\n\n\n**_Index_** supports multi-key indexing using args: `table.index('B','C')`. \n\nThis gives you a dictionary with the key as a tuple and the indices as a set, e.g. \n\n    indices = {\n        (1, 44): {2,3,33,35}\n        (2, 44): {4,5,32}\n    }\n\n\n\n**_All_** allows copy of a table where \"all\" criteria match.\n\nThis allows you to use custom functions like this:\n\n```\nbefore = [r for r in table2.rows]\nassert before == [(1, 'hello'), (2, 'world'), (1, 'hello'), (44, 'Hallo')]\n\n# as is filtering for ALL that match:\nfilter_1 = lambda x: 'llo' in x\nfilter_2 = lambda x: x > 3\n\nafter = table2.all(**{'B': filter_1, 'A': filter_2})\n\nassert list(after.rows) == [(44, 'Hallo')]\n```\n\n**_Any_** works like `all` except it retrieves where \"any\" criteria match instead of all.\n\n```\nafter = table2.any(**{'B': filter_1, 'A': filter_2})\n\nassert list(after.rows) == [(1, 'hello'), (1, 'hello'), (44, 'Hallo')]\n```\n\n----------------\n\n**_SQL JOINs_** are supported out of the box. \n\nHere are a couple of examples:\n\n```\n# We start with creating two tables:\nleft = Table()\nleft.add_column('number', int, allow_empty=True, data=[1, 2, 3, 4, None])\nleft.add_column('colour', str, data=['black', 'blue', 'white', 'white', 'blue'])\n\nright = Table()\nright.add_column('letter', str, allow_empty=True, data=['a', 'b,', 'c', 'd', None])\nright.add_column('colour', str, data=['blue', 'white', 'orange', 'white', 'blue'])\n```\n\n**Left join** would in SQL be:\n`SELECT number, letter FROM left LEFT JOIN right on left.colour == right.colour`\n\nwith table it's:\n```\nleft_join = left.left_join(right, keys=['colour'], columns=['number', 'letter'])\n```\n\n**Inner join** would in SQL be:\n`SELECT number, letter FROM left JOIN right ON left.colour == right.colour`\n\nwith table it's\n```\ninner_join = left.inner_join(right, keys=['colour'],  columns=['number','letter'])\n```\n\n\n**Outer join** would in SQL be:\n`SELECT number, letter FROM left OUTER JOIN right ON left.colour == right.colour`\n\nwith table it's:\n\n```\nouter_join = left.outer_join(right, keys=['colour'], columns=['number','letter'])\n```\n\n----------------\n\n**_GroupBy_** operations are supported using the GroupBy class.\n\nIt allows summarising the data for all of the functions below:\n\n```\ng = GroupBy(keys=['a', 'b'],  # <-- Group by these columns\n            functions=[('f', Max),  # <-- find the max on column `f` for each group.\n                       ('f', Min),\n                       ('f', Sum),\n                       ('f', First),\n                       ('f', Last),\n                       ('f', Count),\n                       ('f', CountUnique),\n                       ('f', Average),\n                       ('f', StandardDeviation),\n                       ('a', StandardDeviation),\n                       ('f', Median),\n                       ('f', Mode),\n                       ('g', Median)])\nt2 = t + t\nassert len(t2) == 2 * len(t)\nt2.show()\n\n+ =====+=====+=====+=====+=====+=====+===== +\n|   a  |  b  |  c  |  d  |  e  |  f  |  g   |\n|  int | int | int | int | int | int | int  |\n| False|False|False|False|False|False| True |\n+ -----+-----+-----+-----+-----+-----+----- +\n|     0|    0|    0|    0|    0|    1|    0 |\n|     1|    1|    1|    1|    1|    4|    1 |\n|     2|    2|    2|    2|    2|    7|    8 |\n|     3|    3|    3|    3|    3|   10|   27 |\n|     4|    4|    4|    4|    4|   13|   64 |\n|     0|    0|    0|    0|    0|    1|    0 |\n|     1|    1|    1|    1|    1|    4|    1 |\n|     2|    2|    2|    2|    2|    7|    8 |\n|     3|    3|    3|    3|    3|   10|   27 |\n|     4|    4|    4|    4|    4|   13|   64 |\n+ =====+=====+=====+=====+=====+=====+===== +\n\ng += t2\n\nassert list(g.rows) == [\n    (0, 0, 1, 1, 2, 1, 1, 2, 1, 1.0, 0.0, 0.0, 1, 1, 0),\n    (1, 1, 4, 4, 8, 4, 4, 2, 1, 4.0, 0.0, 0.0, 4, 4, 1),\n    (2, 2, 7, 7, 14, 7, 7, 2, 1, 7.0, 0.0, 0.0, 7, 7, 8),\n    (3, 3, 10, 10, 20, 10, 10, 2, 1, 10.0, 0.0, 0.0, 10, 10, 27),\n    (4, 4, 13, 13, 26, 13, 13, 2, 1, 13.0, 0.0, 0.0, 13, 13, 64)\n]\n\ng.table.show()\n\n+ =====+=====+======+======+======+========+=======+========+==============+==========+====================+====================+=========+=======+========= +\n|   a  |  b  |Max(f)|Min(f)|Sum(f)|First(f)|Last(f)|Count(f)|CountUnique(f)|Average(f)|StandardDeviation(f)|StandardDeviation(a)|Median(f)|Mode(f)|Median(g) |\n|  int | int | int  | int  | int  |  int   |  int  |  int   |     int      |  float   |       float        |       float        |   int   |  int  |   int    |\n| False|False| True | True | True |  True  |  True |  True  |     True     |   True   |        True        |        True        |   True  |  True |   True   |\n+ -----+-----+------+------+------+--------+-------+--------+--------------+----------+--------------------+--------------------+---------+-------+--------- +\n|     0|    0|     1|     1|     2|       1|      1|       2|             1|       1.0|                 0.0|                 0.0|        1|      1|        0 |\n|     1|    1|     4|     4|     8|       4|      4|       2|             1|       4.0|                 0.0|                 0.0|        4|      4|        1 |\n|     2|    2|     7|     7|    14|       7|      7|       2|             1|       7.0|                 0.0|                 0.0|        7|      7|        8 |\n|     3|    3|    10|    10|    20|      10|     10|       2|             1|      10.0|                 0.0|                 0.0|       10|     10|       27 |\n|     4|    4|    13|    13|    26|      13|     13|       2|             1|      13.0|                 0.0|                 0.0|       13|     13|       64 |\n+ =====+=====+======+======+======+========+=======+========+==============+==========+====================+====================+=========+=======+========= +\n\n```\n \nNote that groupby is instantiated on it's own, without any data, and then\ndata is added using `+=` ? That's because I wanted the GroupBy class to be\nfriendly to updates that otherwise might run out of memory. Here's the case:\n\n(1) Imagine you have a large number of files that you want to summarize.\n\nFor this you first need a groupby operation:\n```\ng = GroupBy(keys=['a','b'], functions=[('c', StandardDeviation), ('d', Average)])\n```\n\n(2) now you can just iterate over the files and not having to worry about \nthe memory footprint, as each table is consumed by the groupby function:\n```\nfiles = Path(__file__).parent / 'archive'\nassert files.isdir()  \nfor file in files.iterdir():\n    json_str = json.loads(file.read())\n    table = Table.from_json(json)\n    g += table\n```\n\n(3) Once all files have been summarized, you can read the results using\nPythons friendly for loop:\n```\nfor a, b, stdev, avg in g.rows:\n     # ... do something ...\n```\n\n**Pivot Table** included in the groupby? Yes. You can pivot the groupby on any\ncolumn that is used for grouping. Here's a simple example:\n\n```\ng2 = GroupBy(keys=['a', 'b'], functions=[('f', Max), ('f', Sum)])\ng2 += t + t + t\n\ng2.table.show()\n\n+=====+=====+======+======+\n|  a  |  b  |Max(f)|Sum(f)|\n| int | int | int  | int  |\n|False|False| True | True |\n+-----+-----+------+------+\n|    0|    0|     1|     3|\n|    1|    1|     4|    12|\n|    2|    2|     7|    21|\n|    3|    3|    10|    30|\n|    4|    4|    13|    39|\n+=====+=====+======+======+\n\npivot_table = g2.pivot(columns=['b'])\n\npivot_table.show()\n\n+=====+==========+==========+==========+==========+==========+==========+==========+==========+==========+==========+\n|  a  |Max(f,b=0)|Sum(f,b=0)|Max(f,b=1)|Sum(f,b=1)|Max(f,b=2)|Sum(f,b=2)|Max(f,b=3)|Sum(f,b=3)|Max(f,b=4)|Sum(f,b=4)|\n| int |   int    |   int    |   int    |   int    |   int    |   int    |   int    |   int    |   int    |   int    |\n|False|   True   |   True   |   True   |   True   |   True   |   True   |   True   |   True   |   True   |   True   |\n+-----+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+\n|    0|         1|         3|      None|      None|      None|      None|      None|      None|      None|      None|\n|    1|      None|      None|         4|        12|      None|      None|      None|      None|      None|      None|\n|    2|      None|      None|      None|      None|         7|        21|      None|      None|      None|      None|\n|    3|      None|      None|      None|      None|      None|      None|        10|        30|      None|      None|\n|    4|      None|      None|      None|      None|      None|      None|      None|      None|        13|        39|\n+=====+==========+==========+==========+==========+==========+==========+==========+==========+==========+==========+\n```\n\n---------------------\n\n\nThis concludes the mega-tutorial to `tablite`. There's nothing more to it.\nBut oh boy it'll save a lot of time.",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A table crunching library",
    "version": "2020.6.30.66481",
    "split_keywords": [
        "tables"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "7d5d7a4eb32a38fcaf759f4cefca5e57",
                "sha256": "74cc8574be0aae8e0d04502d2159bc94847013bf20ae4f0f2c9ff602d87738cf"
            },
            "downloads": -1,
            "filename": "tablite-2020.6.30.66481.tar.gz",
            "has_sig": false,
            "md5_digest": "7d5d7a4eb32a38fcaf759f4cefca5e57",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 29236,
            "upload_time": "2020-06-30T17:28:53",
            "upload_time_iso_8601": "2020-06-30T17:28:53.616080Z",
            "url": "https://files.pythonhosted.org/packages/ee/f2/19cd865d1b3a7f428eb45ee6ab3d5779f20d14a1ac314d06de61b63393e3/tablite-2020.6.30.66481.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2020-06-30 17:28:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": null,
    "github_project": "root-11",
    "error": "Could not fetch GitHub repository",
    "lcname": "tablite"
}
        
Elapsed time: 0.11366s