# Pandas DataFrame Operations 8 times faster (or even more)
DataFrame.query has never worked for me. On my PC, it has been extremely slow when using small DataFrames, and only a little bit, if at all, faster when using huge DataFrames.
DataFrame.query uses pd.eval, and pd.eval uses numexpr. The weird thing is that numexpr is insanely fast when it is used against a DataFrame, but nor pd.eval neither DataFrame.query aren’t. First I thought there was a problem with my Pandas/environment configuration, but then I read on the[ Pandas page](https://pandas.pydata.org/docs/user_guide/indexing.html#performance-of-query):
_You will only see the performance benefits of using the numexpr engine with DataFrame.query() if your frame has more than approximately 200,000 rows._
Well, **a_pandas_ex_numexpr** adds different methods to the DataFrame/Series classes, and will get tremendous speed-ups **(up to 8 times faster in my tests)** even for small DataFrames. All tests were done using: [https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv](https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv)
**Let the numbers speak for themselves**
## How to import / use a_pandas_ex_numexpr
```python
from a_pandas_ex_numexpr import pd_add_numexpr
pd_add_numexpr()
import pandas as pd
dafra = "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv"
df = pd.read_csv(dafra)
df
Out[3]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[891 rows x 12 columns]
```
## Speed test - a_pandas_ex_numexpr
```python
# Code explanation at the end of the page
wholedict = {'c': df.Pclass}
%timeit df['Survived'].ne_query('b * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * c', return_np=True, local_dict=wholedict)
30.8 µs ± 229 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df['Survived'].ne_query('b * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * c', return_np=False, local_dict=wholedict)
70.1 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df['Survived'] * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * df['Pclass']
262 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit pd.eval("df.Survived * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * df.Pclass") #used by df.query
1.37 ms ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
df['Survived'].ne_query('b * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * c', return_np=False, local_dict=wholedict)
Out[33]:
0 0.094622
1 995.031541
2 995.094622
3 995.031541
4 0.094622
...
886 0.063081
887 995.031541
888 0.094622
889 995.031541
890 0.094622
Length: 891, dtype: float64
df['Survived'] * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * df['Pclass']
Out[34]:
0 0.094622
1 995.031541
2 995.094622
3 995.031541
4 0.094622
...
886 0.063081
887 995.031541
888 0.094622
889 995.031541
890 0.094622
Length: 891, dtype: float64
```
```python
wholedict = {'c': df.Pclass}
%timeit df['Survived'].ne_query('b * 99.5 * c', return_np=True, local_dict=wholedict)
27 µs ± 245 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df['Survived'].ne_query('b * 99.5 * c', return_np=False, local_dict=wholedict)
65.7 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df['Survived'] * 99.5 * df['Pclass']
140 µs ± 5.46 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit pd.eval("df.Survived * 99.5 * df.Pclass")
916 µs ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
```python
wholedict = {'c': df.Pclass}
%timeit df['Survived'].ne_query('b / c', return_np=True, local_dict=wholedict)
26.5 µs ± 200 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df['Survived'].ne_query('b / c', return_np=False, local_dict=wholedict) # returns a Series
60.3 µs ± 336 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df['Survived'] / df['Pclass']
68.2 µs ± 599 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit pd.eval("df.Survived / df.Pclass")
929 µs ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
## All functions/methods
## Speed of some “ready to use methods” for Series
```python
%timeit df.loc[df.PassengerId.ne_less_than(100)]
142 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.PassengerId <100]
212 µs ± 897 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
##############################################################
%timeit df.loc[df.Survived.ne_not_equal(0)]
157 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.Survived!=0]
229 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
##############################################################
%timeit df.loc[df.PassengerId.ne_greater_than(100)]
174 µs ± 375 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.PassengerId>100]
248 µs ± 2.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
##############################################################
%timeit df.loc[df.PassengerId.ne_equal(1)]
138 µs ± 626 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.PassengerId == 1]
209 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
##############################################################
%timeit df.loc[df.Cabin.ne_search_for_string_contains('C1')]
329 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df.loc[df.Cabin.str.contains('C1',na=False)]
403 µs ± 924 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
##############################################################
%timeit df.loc[df.PassengerId.ne_greater_than_or_equal_to(100)]
175 µs ± 832 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.PassengerId>=100]
251 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
##############################################################
%timeit df.loc[df.PassengerId.ne_less_than_or_equal_to(100)]
145 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.PassengerId <=100]
212 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
##############################################################
```
## Overview - all methods for DataFrames/Series
```python
# Always use 'b' as the variable for the Series/DataFrame
df.ne_search_in_all_columns('b == 1')
array([ 0, 1, 2, 3, 8, 9, 10, 11, 15, 17, 19, 21, 22,
23, 25, 28, 31, 32, 36, 39, 43, 44, 47, 52, 53, 55,
56, 58, 61, 65, 66, 68, 74, 78, 79, 81, 82, 84, 85,
88, 97, 98, 106, 107, 109, 123, 125, 127, 128, 133, 136, 141,
142, 146, 151, 156, 161, 165, 166, 172, 183, 184, 186, 187, 190,
192, 193, 194, 195, 198 ...]
```
```python
# Returns duplicated index if the value is found in
# several columns. Exceptions will be ignored
# the dtype argument is useful when searching for
# strings -> dtype='S' (ascii only)
df.ne_search_in_all_columns('b == "1"', dtype='S')
array([ 0, 1, 2, 3, 8, 9, 10, 11, 15, 17, 19, 21, 22,
23, 25, 28, 31, 32, 36, 39, 43, 44, 47, 52, 53, 55,
56, 58, 61, 65, 66, 68, 74, 78, 79, 81, 82, 84, 85,
88, 97, 98, 106, 107, 109, ...]
```
```python
# Converts all columns to dtype='S' before searching
# Might not work with special characters
# UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 0:
df.ne_search_string_allhits_contains('C1')
Out[6]:
PassengerId Survived Pclass ... Fare Cabin Embarked
3 4 1 1 ... 53.1000 C123 S
11 12 1 1 ... 26.5500 C103 S
110 111 0 1 ... 52.0000 C110 S
137 138 0 1 ... 53.1000 C123 S
268 269 1 1 ... 153.4625 C125 S
273 274 0 1 ... 29.7000 C118 C
298 299 1 1 ... 30.5000 C106 S
331 332 0 1 ... 28.5000 C124 S
351 352 0 1 ... 35.0000 C128 S
449 450 1 1 ... 30.5000 C104 S
452 453 0 1 ... 27.7500 C111 C
571 572 1 1 ... 51.4792 C101 S
609 610 1 1 ... 153.4625 C125 S
669 670 1 1 ... 52.0000 C126 S
711 712 0 1 ... 26.5500 C124 S
712 713 1 1 ... 52.0000 C126 S
889 890 1 1 ... 30.0000 C148 C
[17 rows x 12 columns]
```
```python
# Series doesn't return duplicated results
df.Cabin.ne_search_string_allhits_contains('C1')
Out[9]:
3 C123
11 C103
110 C110
137 C123
268 C125
273 C118
298 C106
331 C124
351 C128
449 C104
452 C111
571 C101
609 C125
669 C126
711 C124
712 C126
889 C148
Name: Cabin, dtype: object
%timeit df.Cabin.ne_search_string_allhits_contains('C1')
274 µs ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df.Cabin.loc[df.Cabin.str.contains('C1', na=False)]
351 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
```python
# All rows where the string/substring C1 is found.
# Numbers are converted to string (ascii)
df.ne_search_string_dataframe_contains('C1')
Out[13]:
PassengerId Survived Pclass ... Fare Cabin Embarked
3 4 1 1 ... 53.1000 C123 S
11 12 1 1 ... 26.5500 C103 S
110 111 0 1 ... 52.0000 C110 S
137 138 0 1 ... 53.1000 C123 S
268 269 1 1 ... 153.4625 C125 S
273 274 0 1 ... 29.7000 C118 C
298 299 1 1 ... 30.5000 C106 S
331 332 0 1 ... 28.5000 C124 S
351 352 0 1 ... 35.0000 C128 S
449 450 1 1 ... 30.5000 C104 S
452 453 0 1 ... 27.7500 C111 C
571 572 1 1 ... 51.4792 C101 S
609 610 1 1 ... 153.4625 C125 S
669 670 1 1 ... 52.0000 C126 S
711 712 0 1 ... 26.5500 C124 S
712 713 1 1 ... 52.0000 C126 S
889 890 1 1 ... 30.0000 C148 C
[17 rows x 12 columns]
df.ne_search_string_dataframe_contains('610')
Out[14]:
PassengerId Survived Pclass ... Fare Cabin Embarked
194 195 1 1 ... 27.7208 B4 C
609 610 1 1 ... 153.4625 C125 S
[2 rows x 12 columns]
```
```python
# Converts all columns to ascii and searches in each column
# For each presence in a column, you get a duplicate of the index
df.ne_search_string_dataframe_allhits_equal('1')
df.ne_search_string_dataframe_allhits_equal('1')
Out[15]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
1 2 1 1 ... 71.2833 C85 C
1 2 1 1 ... 71.2833 C85 C
.. ... ... ... ... ... ... ...
887 888 1 1 ... 30.0000 B42 S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
889 890 1 1 ... 30.0000 C148 C
[886 rows x 12 columns]
```
```python
# All equal strings in a Series
df.Embarked.ne_search_string_dataframe_allhits_equal('S')
Out[16]:
0 S
2 S
3 S
4 S
6 S
..
883 S
884 S
886 S
887 S
888 S
Name: Embarked, Length: 644, dtype: object
%timeit df.Embarked.ne_search_string_dataframe_allhits_equal('S')
160 µs ± 2.14 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.Embarked.loc[df.Embarked=='S']
178 µs ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```
```python
# Converts the whole df to ascii and checks where the
# the value is present. Exceptions are ignored
df.ne_search_string_dataframe_equal('C123')
Out[20]:
PassengerId Survived Pclass ... Fare Cabin Embarked
3 4 1 1 ... 53.1 C123 S
137 138 0 1 ... 53.1 C123 S
[2 rows x 12 columns]
```
```python
# Might not be efficient (The only method that was slower during testing)!
%timeit df.Cabin.loc[df.Cabin.ne_search_for_string_series_equal('C123')]
252 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df.Cabin.loc[df.Cabin=='C123']
158 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Out[21]:
PassengerId Survived Pclass ... Fare Cabin Embarked
3 4 1 1 ... 53.1 C123 S
137 138 0 1 ... 53.1 C123 S
[2 rows x 12 columns]
```
```python
# Returns bool values
df.loc[df.ne_search_for_string_contains('C1')]
Out[7]:
array([[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, True, False],
[False, False, False, ..., False, False, False]])
```
```python
# returns Bool
df.loc[df.Cabin.ne_search_for_string_contains('C1')]
Out[14]:
PassengerId Survived Pclass ... Fare Cabin Embarked
3 4 1 1 ... 53.1000 C123 S
11 12 1 1 ... 26.5500 C103 S
110 111 0 1 ... 52.0000 C110 S
137 138 0 1 ... 53.1000 C123 S
268 269 1 1 ... 153.4625 C125 S
273 274 0 1 ... 29.7000 C118 C
298 299 1 1 ... 30.5000 C106 S
331 332 0 1 ... 28.5000 C124 S
351 352 0 1 ... 35.0000 C128 S
449 450 1 1 ... 30.5000 C104 S
452 453 0 1 ... 27.7500 C111 C
571 572 1 1 ... 51.4792 C101 S
609 610 1 1 ... 153.4625 C125 S
669 670 1 1 ... 52.0000 C126 S
711 712 0 1 ... 26.5500 C124 S
712 713 1 1 ... 52.0000 C126 S
889 890 1 1 ... 30.0000 C148 C
[17 rows x 12 columns]
%timeit df.loc[df.Cabin.ne_search_for_string_contains('C1')]
329 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df.loc[df.Cabin.str.contains('C1',na=False)]
403 µs ± 924 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
```python
# Returns the index of all rows where the value was found.
# Exceptions (e.g. wrong datatype etc.) are ignored.
# duplicates (more positive results in one row) are not deleted
df.ne_equal_df_ind(1)
array([ 0, 1, 2, 3, 8, 9, 10, 11, 15, 17, 19, 21, 22,
23, 25, 28, 31, 32, 36, 39, 43, 44, 47, 52, 53, 55,
56, 58, 61, 65, 66, 68, 74, 78, 79, 81, 82, 84, 85,
88, 97, 98, 106, 107, 109, 123...]
```
```python
# You can pass dtype='S' to convert the values to string
# (or other formats) before performing the search.
# df.ne_equal_df_ind(b'1', 'S')
# If you use 'S', you have to pass a binary value
df.ne_equal_df_ind(b'1', 'S')
Out[16]:
array([ 0, 1, 2, 3, 8, 9, 10, 11, 15, 17, 19, 21, 22,
23, 25, 28, 31, 32, 36, 39, 43, 44, 47, 52, 53, 55,
56, 58, 61, 65, 66, 68, 74,...]
```
```python
# same as DataFrame.ne_equal_df_ind
# but deletes all duplicates
df.ne_equal_df_ind_no_dup(b'1', 'S')
array([ 0, 1, 2, 3, 6, 7, 8, 9, 10, 11, 13, 15, 16,
17, 18, 19, 21, 22, 23, 24, 25, 27, 28, 30, 31, 32,
34, 35, 36, 39, 40, 41, 43, 44
```
```python
# Same as DataFrame.ne_equal_df_ind,
# but returns the DataFrame (df.loc[])
df.ne_equal_df_dup(b'1', 'S')
Out[18]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
8 9 1 3 ... 11.1333 NaN S
.. ... ... ... ... ... ... ...
856 857 1 1 ... 164.8667 NaN S
869 870 1 3 ... 11.1333 NaN S
871 872 1 1 ... 52.5542 D35 S
879 880 1 1 ... 83.1583 C50 C
880 881 1 2 ... 26.0000 NaN S
[886 rows x 12 columns]
```
```python
# Same as DataFrame.ne_equal_df_ind_no_dup
# but returns the DataFrame (df.loc)
df.ne_equal_df_no_dup(b'1', 'S')
Out[19]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
6 7 0 1 ... 51.8625 E46 S
.. ... ... ... ... ... ... ...
879 880 1 1 ... 83.1583 C50 C
880 881 1 2 ... 26.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
[524 rows x 12 columns]
```
```python
# Returns bool
array([ True, False, False, False, False, False, False, False, False,
False, False, False, ...]
df.loc[df.PassengerId.ne_equal(1)]
%timeit df.loc[df.PassengerId.ne_equal(1)]
138 µs ± 626 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.PassengerId == 1]
209 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
```python
# Every time the condition is False, the index is
# added to the return value.
# Example:
# A row has 6 columns. 2 of them have the value 1.
# That means the index of the row will be added 4 times
# to the final result
df.loc[df.ne_not_equal_df_ind(1)]
array([ 1, 2, 3, ..., 888, 889, 890], dtype=int64)
df.loc[df.ne_not_equal_df_ind(1)]
PassengerId Survived Pclass ... Fare Cabin Embarked
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
5 6 0 3 ... 8.4583 NaN Q
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[5344 rows x 12 columns]
```
```python
# Same as DataFrame.ne_not_equal_df_ind
# but drops all duplicates
df.ne_not_equal_df_ind_no_dup(0)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16,...]
df.loc[df.ne_not_equal_df_ind_no_dup(0)]
Out[26]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[891 rows x 12 columns]
```
```python
# same as DataFrame.ne_not_equal_df_ind
# but returns the DataFrame (df.loc)
df.ne_not_equal_df_dup(0)
Out[28]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[4387 rows x 12 columns]
```
```python
# same as DataFrame.ne_not_equal_df_no_dup
# but returns the DataFrame (df.loc)
df.ne_not_equal_df_no_dup(0)
Out[29]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[891 rows x 12 columns]
```
```python
returns Bool
df.Survived.ne_not_equal(0)
array([False, True, True, True, False, False, False, False, True,
True, True, True, False, False, False, True, False, True,
False, True ...]
df.loc[df.Survived.ne_not_equal(0)]
PassengerId Survived Pclass ... Fare Cabin Embarked
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
8 9 1 3 ... 11.1333 NaN S
9 10 1 2 ... 30.0708 NaN C
.. ... ... ... ... ... ... ...
875 876 1 3 ... 7.2250 NaN C
879 880 1 1 ... 83.1583 C50 C
880 881 1 2 ... 26.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
889 890 1 1 ... 30.0000 C148 C
[342 rows x 12 columns]
%timeit df.loc[df.Survived.ne_not_equal(0)]
157 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.Survived!=0]
229 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
```python
# returns index, duplicates are possible
# if the condition is valid for more than one
# column. Exceptions (e.g. wrong dtype) are ignored
df.ne_greater_than_df_ind(100)
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
113, 114, 115...]
```
```python
# Same as DataFrame.ne_greater_than_df_ind
# but gets rid off all duplicates
df.ne_greater_than_df_ind_no_dup(0)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16...]
```
```python
# Same as DataFrame.ne_greater_than_df_ind
# but returns the DataFrame (df.loc)
df.ne_greater_than_df_dup(0)
Out[22]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[4210 rows x 12 columns]
```
```python
# same as DataFrame.ne_greater_than_df_ind_no_dup
# but returns the DataFrame (df.loc)
df.ne_greater_than_df_no_dup(600)
Out[24]:
PassengerId Survived Pclass ... Fare Cabin Embarked
600 601 1 2 ... 27.0000 NaN S
601 602 0 3 ... 7.8958 NaN S
602 603 0 1 ... 42.4000 NaN S
603 604 0 3 ... 8.0500 NaN S
604 605 1 1 ... 26.5500 NaN C
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[291 rows x 12 columns]
```
```python
# Returns bool
df.PassengerId.ne_greater_than(5)
Out[26]:
array([False, False, False, False, False, True, True, True, True,
True, True, True, True, True, True...]
df.loc[df.PassengerId.ne_greater_than(100)]
%timeit df.loc[df.PassengerId.ne_greater_than(100)]
174 µs ± 375 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.PassengerId>100]
248 µs ± 2.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
```python
# returns index, duplicates are possible
# if the condition is valid for more than one
# column. Exceptions (e.g. wrong dtype) are ignored
df.ne_less_than_df_ind(10)
array([ 0, 1, 2, ..., 881, 884, 890], dtype=int64)
```
```python
# Same as DataFrame.ne_less_than_df_ind
# but without duplicates
df.ne_less_than_df_ind_no_dup(100)
Out[28]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21,...]
```
```python
# Same as DataFrame.ne_less_than_df_ind,
# but returns DataFrame (df.loc)
df.ne_less_than_df_dup(1)
Out[29]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
4 5 0 3 ... 8.0500 NaN S
5 6 0 3 ... 8.4583 NaN Q
6 7 0 1 ... 51.8625 E46 S
7 8 0 3 ... 21.0750 NaN S
.. ... ... ... ... ... ... ...
674 675 0 2 ... 0.0000 NaN S
732 733 0 2 ... 0.0000 NaN S
806 807 0 1 ... 0.0000 A36 S
815 816 0 1 ... 0.0000 B102 S
822 823 0 1 ... 0.0000 NaN S
[1857 rows x 12 columns]
```
```python
# Same as DataFrame.ne_less_than_df_ind_no_dup
# but returns DataFrame (df.loc)
df.ne_less_than_df_no_dup(1)
Out[30]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[834 rows x 12 columns]
```
```python
# returns bool
# df.PassengerId.ne_less_than(100)
df.PassengerId.ne_less_than(100)
Out[31]:
array([ True, True, True, True, True, True...]
%timeit df.loc[df.PassengerId.ne_less_than(100)]
142 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.PassengerId <100]
212 µs ± 897 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
```python
# returns index, duplicates are possible
# if the condition is valid for more than one
# column. Exceptions (e.g. wrong dtype) are ignored
df.ne_greater_than_or_equal_to_df_ind(100)
Out[35]:
array([ 99, 100, 101, 102, 103, 104, ...]
```
```python
# Same as DataFrame.ne_greater_than_or_equal_to_df_ind ,
# but without duplicates
# df.ne_greater_than_or_equal_to_df_ind_no_dup(100)
df.ne_greater_than_or_equal_to_df_ind_no_dup(100)
Out[36]:
array([ 27, 31, 88, 99, 100, 101, 102,...]
```
```python
# Same as DataFrame.ne_greater_than_or_equal_to_df_ind,
# but returns DataFrame (df.loc)
df.ne_greater_than_or_equal_to_df_dup(100)
Out[37]:
PassengerId Survived Pclass ... Fare Cabin Embarked
99 100 0 2 ... 26.0000 NaN S
100 101 0 3 ... 7.8958 NaN S
101 102 0 3 ... 7.8958 NaN S
102 103 0 1 ... 77.2875 D26 S
103 104 0 3 ... 8.6542 NaN S
.. ... ... ... ... ... ... ...
742 743 1 1 ... 262.3750 B57 B59 B63 B66 C
763 764 1 1 ... 120.0000 B96 B98 S
779 780 1 1 ... 211.3375 B3 S
802 803 1 1 ... 120.0000 B96 B98 S
856 857 1 1 ... 164.8667 NaN S
[845 rows x 12 columns]
```
```python
# Same as DataFrame.ne_greater_than_or_equal_to_df_ind,
# but returns DataFrame (df.loc)
df.ne_greater_than_or_equal_to_df_no_dup(100)
Out[38]:
PassengerId Survived Pclass ... Fare Cabin Embarked
27 28 0 1 ... 263.0000 C23 C25 C27 S
31 32 1 1 ... 146.5208 B78 C
88 89 1 1 ... 263.0000 C23 C25 C27 S
99 100 0 2 ... 26.0000 NaN S
100 101 0 3 ... 7.8958 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[795 rows x 12 columns]
```
```python
# returns bool
df.PassengerId.ne_greater_than_or_equal_to(100)
Out[39]:
array([False, False, False, False, False...])
df.PassengerId.ne_greater_than_or_equal_to(100)
%timeit df.loc[df.PassengerId.ne_greater_than_or_equal_to(100)]
175 µs ± 832 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.PassengerId>=100]
251 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
```python
# returns index, duplicates are possible
# if the condition is valid for more than one
# column. Exceptions (e.g. wrong dtype) are ignored
df.ne_less_than_or_equal_to_df_ind(100)
Out[40]: array([ 0, 1, 2, ..., 888, 889, 890], dtype=int64)
```
```python
# Same as DataFrame.ne_less_than_or_equal_to_df_ind ,
# but without duplicates
df.ne_less_than_or_equal_to_df_ind_no_dup(100)
Out[41]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, ...])
```
```python
# Same as DataFrame.ne_less_than_or_equal_to_df_ind,
# but returns DataFrame (df.loc)
df.ne_less_than_or_equal_to_df_dup(100)
Out[42]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[5216 rows x 12 columns]
```
```python
# Same as DataFrame.ne_less_than_or_equal_to_df_ind,
# but returns DataFrame (df.loc)
df.ne_less_than_or_equal_to_df_no_dup(0)
Out[53]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[829 rows x 12 columns]
```
```python
# returns bool
df.PassengerId.ne_less_than_or_equal_to(100)
Out[55]:
array([ True, True, True, True, ....]
%timeit df.loc[df.PassengerId.ne_less_than_or_equal_to(100)]
145 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.PassengerId <=100]
212 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
```python
# Combining conditions
%timeit df.loc[df.PassengerId.ne_greater_than(100) & df.Cabin.ne_search_for_string_series_contains('C1')]
360 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df.loc[(df.PassengerId>100) & df.Cabin.str.contains('C1',na=False)]
552 µs ± 3.49 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
```python
# you can pass your own queries
# If you want to compare the DataFrame/Series to another array
# the variable 'b' represents the DataFrame/Series
# That means: don't use it for something else
wholedict = {'c': np.array([1])}
df[['Survived','Pclass']].ne_query('b == c',local_dict=wholedict)
Out[14]:
array([[False, False],
[ True, True],
[ True, False],
...,
[False, False],
[ True, True],
[False, False]])
# You can use any NumExpr operator/function
# https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/user_guide.html
# And get a tremendous speedup (even with small DataFrames)
%timeit df['Survived'] + df.Pclass
68.6 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df['Survived'] * df.Pclass
69 µs ± 260 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df['Survived'] == df.Pclass
72.3 µs ± 817 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# You have to pass the Series/Arrays that you are using in the expression as a dict (local_dict)
wholedict = {'c': df.Pclass}
%timeit df['Survived'].ne_query('b + c',local_dict=wholedict)
25.2 µs ± 130 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df['Survived'].ne_query('b * c',local_dict=wholedict)
25.3 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df['Survived'].ne_query('b == c',local_dict=wholedict)
25.2 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Exceptions are not ignored
# If you want to compare the DataFrame with a scalar:
df[['Survived','Pclass']].ne_query('b == 1')
# works also for Series
wholedict = {'c': np.array([1])}
df['Survived'].ne_query('b == c',local_dict=wholedict)
# scalar
df['Pclass'].ne_query('b == 1')
%timeit df.loc[df['Pclass'].ne_query('b == 1')]
155 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df['Pclass'] == 1]
220 µs ± 3.96 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/hansalemaos/a_pandas_ex_numexpr",
"name": "a-pandas-ex-numexpr",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "numexpr,numpy,sort,pandas,series",
"author": "Johannes Fischer",
"author_email": "<aulasparticularesdealemaosp@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/7c/fa/67301d80ba78883296c68552984ecfd440d241831289440d87c895ac0c3c/a_pandas_ex_numexpr-0.10.tar.gz",
"platform": null,
"description": "\n# Pandas DataFrame Operations 8 times faster (or even more)\n\n\n\nDataFrame.query has never worked for me. On my PC, it has been extremely slow when using small DataFrames, and only a little bit, if at all, faster when using huge DataFrames. \n\n\n\nDataFrame.query uses pd.eval, and pd.eval uses numexpr. The weird thing is that numexpr is insanely fast when it is used against a DataFrame, but nor pd.eval neither DataFrame.query aren\u2019t. First I thought there was a problem with my Pandas/environment configuration, but then I read on the[ Pandas page](https://pandas.pydata.org/docs/user_guide/indexing.html#performance-of-query):\n\n\n\n_You will only see the performance benefits of using the numexpr engine with DataFrame.query() if your frame has more than approximately 200,000 rows._\n\n\n\nWell, **a_pandas_ex_numexpr** adds different methods to the DataFrame/Series classes, and will get tremendous speed-ups **(up to 8 times faster in my tests)** even for small DataFrames. All tests were done using: [https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv](https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv)\n\n\n\n**Let the numbers speak for themselves**\n\n\n\n## How to import / use a_pandas_ex_numexpr\n\n\n\n```python\n\nfrom a_pandas_ex_numexpr import pd_add_numexpr\n\npd_add_numexpr()\n\nimport pandas as pd\n\ndafra = \"https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv\"\n\ndf = pd.read_csv(dafra)\n\n\n\n\n\n\n\ndf\n\nOut[3]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n4 5 0 3 ... 8.0500 NaN S\n\n.. ... ... ... ... ... ... ...\n\n886 887 0 2 ... 13.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n890 891 0 3 ... 7.7500 NaN Q\n\n[891 rows x 12 columns]\n\n```\n\n\n\n## Speed test - a_pandas_ex_numexpr\n\n\n\n```python\n\n# Code explanation at the end of the page\n\nwholedict = {'c': df.Pclass}\n\n%timeit df['Survived'].ne_query('b * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * c', return_np=True, local_dict=wholedict)\n\n30.8 \u00b5s \u00b1 229 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df['Survived'].ne_query('b * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * c', return_np=False, local_dict=wholedict)\n\n70.1 \u00b5s \u00b1 2.44 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df['Survived'] * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * df['Pclass']\n\n262 \u00b5s \u00b1 4.25 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n%timeit pd.eval(\"df.Survived * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * df.Pclass\") #used by df.query\n\n1.37 ms \u00b1 45.4 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n\n\n\n\ndf['Survived'].ne_query('b * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * c', return_np=False, local_dict=wholedict)\n\nOut[33]: \n\n0 0.094622\n\n1 995.031541\n\n2 995.094622\n\n3 995.031541\n\n4 0.094622\n\n ... \n\n886 0.063081\n\n887 995.031541\n\n888 0.094622\n\n889 995.031541\n\n890 0.094622\n\nLength: 891, dtype: float64\n\n\n\n\n\ndf['Survived'] * 99.5 / 000.1 + 42123.323211 / 1335523.42232 * df['Pclass']\n\nOut[34]: \n\n0 0.094622\n\n1 995.031541\n\n2 995.094622\n\n3 995.031541\n\n4 0.094622\n\n ... \n\n886 0.063081\n\n887 995.031541\n\n888 0.094622\n\n889 995.031541\n\n890 0.094622\n\nLength: 891, dtype: float64\n\n```\n\n\n\n```python\n\nwholedict = {'c': df.Pclass}\n\n%timeit df['Survived'].ne_query('b * 99.5 * c', return_np=True, local_dict=wholedict)\n\n27 \u00b5s \u00b1 245 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df['Survived'].ne_query('b * 99.5 * c', return_np=False, local_dict=wholedict)\n\n65.7 \u00b5s \u00b1 1.65 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df['Survived'] * 99.5 * df['Pclass']\n\n140 \u00b5s \u00b1 5.46 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit pd.eval(\"df.Survived * 99.5 * df.Pclass\")\n\n916 \u00b5s \u00b1 7.1 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n\n\n```python\n\nwholedict = {'c': df.Pclass}\n\n%timeit df['Survived'].ne_query('b / c', return_np=True, local_dict=wholedict)\n\n26.5 \u00b5s \u00b1 200 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df['Survived'].ne_query('b / c', return_np=False, local_dict=wholedict) # returns a Series\n\n60.3 \u00b5s \u00b1 336 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df['Survived'] / df['Pclass']\n\n68.2 \u00b5s \u00b1 599 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit pd.eval(\"df.Survived / df.Pclass\")\n\n929 \u00b5s \u00b1 31.7 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n\n\n## All functions/methods\n\n\n\n## Speed of some \u201cready to use methods\u201d for Series\n\n\n\n```python\n\n%timeit df.loc[df.PassengerId.ne_less_than(100)]\n\n142 \u00b5s \u00b1 412 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.PassengerId <100]\n\n212 \u00b5s \u00b1 897 ns per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n##############################################################\n\n%timeit df.loc[df.Survived.ne_not_equal(0)]\n\n157 \u00b5s \u00b1 390 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.Survived!=0]\n\n229 \u00b5s \u00b1 1.46 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n##############################################################\n\n%timeit df.loc[df.PassengerId.ne_greater_than(100)]\n\n174 \u00b5s \u00b1 375 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.PassengerId>100]\n\n248 \u00b5s \u00b1 2.26 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n##############################################################\n\n%timeit df.loc[df.PassengerId.ne_equal(1)]\n\n138 \u00b5s \u00b1 626 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.PassengerId == 1]\n\n209 \u00b5s \u00b1 1.04 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n##############################################################\n\n%timeit df.loc[df.Cabin.ne_search_for_string_contains('C1')]\n\n329 \u00b5s \u00b1 1.18 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n%timeit df.loc[df.Cabin.str.contains('C1',na=False)]\n\n403 \u00b5s \u00b1 924 ns per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n##############################################################\n\n%timeit df.loc[df.PassengerId.ne_greater_than_or_equal_to(100)]\n\n175 \u00b5s \u00b1 832 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.PassengerId>=100]\n\n251 \u00b5s \u00b1 2.77 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n##############################################################\n\n%timeit df.loc[df.PassengerId.ne_less_than_or_equal_to(100)]\n\n145 \u00b5s \u00b1 1.82 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.PassengerId <=100]\n\n212 \u00b5s \u00b1 1.63 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n##############################################################\n\n```\n\n\n\n## Overview - all methods for DataFrames/Series\n\n\n\n```python\n\n# Always use 'b' as the variable for the Series/DataFrame\n\ndf.ne_search_in_all_columns('b == 1')\n\narray([ 0, 1, 2, 3, 8, 9, 10, 11, 15, 17, 19, 21, 22,\n\n 23, 25, 28, 31, 32, 36, 39, 43, 44, 47, 52, 53, 55,\n\n 56, 58, 61, 65, 66, 68, 74, 78, 79, 81, 82, 84, 85,\n\n 88, 97, 98, 106, 107, 109, 123, 125, 127, 128, 133, 136, 141,\n\n 142, 146, 151, 156, 161, 165, 166, 172, 183, 184, 186, 187, 190,\n\n 192, 193, 194, 195, 198 ...]\n\n```\n\n\n\n```python\n\n # Returns duplicated index if the value is found in\n\n # several columns. Exceptions will be ignored\n\n # the dtype argument is useful when searching for\n\n # strings -> dtype='S' (ascii only)\n\n df.ne_search_in_all_columns('b == \"1\"', dtype='S')\n\n\n\narray([ 0, 1, 2, 3, 8, 9, 10, 11, 15, 17, 19, 21, 22,\n\n 23, 25, 28, 31, 32, 36, 39, 43, 44, 47, 52, 53, 55,\n\n 56, 58, 61, 65, 66, 68, 74, 78, 79, 81, 82, 84, 85,\n\n 88, 97, 98, 106, 107, 109, ...]\n\n```\n\n\n\n```python\n\n # Converts all columns to dtype='S' before searching\n\n # Might not work with special characters\n\n # UnicodeEncodeError: 'ascii' codec can't encode character '\\xe4' in position 0:\n\n df.ne_search_string_allhits_contains('C1')\n\nOut[6]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n3 4 1 1 ... 53.1000 C123 S\n\n11 12 1 1 ... 26.5500 C103 S\n\n110 111 0 1 ... 52.0000 C110 S\n\n137 138 0 1 ... 53.1000 C123 S\n\n268 269 1 1 ... 153.4625 C125 S\n\n273 274 0 1 ... 29.7000 C118 C\n\n298 299 1 1 ... 30.5000 C106 S\n\n331 332 0 1 ... 28.5000 C124 S\n\n351 352 0 1 ... 35.0000 C128 S\n\n449 450 1 1 ... 30.5000 C104 S\n\n452 453 0 1 ... 27.7500 C111 C\n\n571 572 1 1 ... 51.4792 C101 S\n\n609 610 1 1 ... 153.4625 C125 S\n\n669 670 1 1 ... 52.0000 C126 S\n\n711 712 0 1 ... 26.5500 C124 S\n\n712 713 1 1 ... 52.0000 C126 S\n\n889 890 1 1 ... 30.0000 C148 C\n\n[17 rows x 12 columns]\n\n```\n\n\n\n\n\n```python\n\n# Series doesn't return duplicated results\n\ndf.Cabin.ne_search_string_allhits_contains('C1')\n\nOut[9]: \n\n3 C123\n\n11 C103\n\n110 C110\n\n137 C123\n\n268 C125\n\n273 C118\n\n298 C106\n\n331 C124\n\n351 C128\n\n449 C104\n\n452 C111\n\n571 C101\n\n609 C125\n\n669 C126\n\n711 C124\n\n712 C126\n\n889 C148\n\nName: Cabin, dtype: object\n\n\n\n%timeit df.Cabin.ne_search_string_allhits_contains('C1')\n\n274 \u00b5s \u00b1 2.74 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n\n\n%timeit df.Cabin.loc[df.Cabin.str.contains('C1', na=False)]\n\n351 \u00b5s \u00b1 1.16 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n\n\n```python\n\n# All rows where the string/substring C1 is found.\n\n# Numbers are converted to string (ascii)\n\ndf.ne_search_string_dataframe_contains('C1')\n\nOut[13]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n3 4 1 1 ... 53.1000 C123 S\n\n11 12 1 1 ... 26.5500 C103 S\n\n110 111 0 1 ... 52.0000 C110 S\n\n137 138 0 1 ... 53.1000 C123 S\n\n268 269 1 1 ... 153.4625 C125 S\n\n273 274 0 1 ... 29.7000 C118 C\n\n298 299 1 1 ... 30.5000 C106 S\n\n331 332 0 1 ... 28.5000 C124 S\n\n351 352 0 1 ... 35.0000 C128 S\n\n449 450 1 1 ... 30.5000 C104 S\n\n452 453 0 1 ... 27.7500 C111 C\n\n571 572 1 1 ... 51.4792 C101 S\n\n609 610 1 1 ... 153.4625 C125 S\n\n669 670 1 1 ... 52.0000 C126 S\n\n711 712 0 1 ... 26.5500 C124 S\n\n712 713 1 1 ... 52.0000 C126 S\n\n889 890 1 1 ... 30.0000 C148 C\n\n[17 rows x 12 columns]\n\n\n\n\n\ndf.ne_search_string_dataframe_contains('610')\n\nOut[14]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n194 195 1 1 ... 27.7208 B4 C\n\n609 610 1 1 ... 153.4625 C125 S\n\n[2 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# Converts all columns to ascii and searches in each column\n\n# For each presence in a column, you get a duplicate of the index\n\ndf.ne_search_string_dataframe_allhits_equal('1')\n\ndf.ne_search_string_dataframe_allhits_equal('1')\n\nOut[15]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n0 1 0 3 ... 7.2500 NaN S\n\n1 2 1 1 ... 71.2833 C85 C\n\n1 2 1 1 ... 71.2833 C85 C\n\n1 2 1 1 ... 71.2833 C85 C\n\n.. ... ... ... ... ... ... ...\n\n887 888 1 1 ... 30.0000 B42 S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n889 890 1 1 ... 30.0000 C148 C\n\n[886 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# All equal strings in a Series\n\ndf.Embarked.ne_search_string_dataframe_allhits_equal('S')\n\n Out[16]: \n\n0 S\n\n2 S\n\n3 S\n\n4 S\n\n6 S\n\n ..\n\n883 S\n\n884 S\n\n886 S\n\n887 S\n\n888 S\n\nName: Embarked, Length: 644, dtype: object\n\n\n\n%timeit df.Embarked.ne_search_string_dataframe_allhits_equal('S')\n\n160 \u00b5s \u00b1 2.14 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.Embarked.loc[df.Embarked=='S']\n\n178 \u00b5s \u00b1 3.04 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n```\n\n\n\n```python\n\n# Converts the whole df to ascii and checks where the\n\n# the value is present. Exceptions are ignored\n\ndf.ne_search_string_dataframe_equal('C123')\n\nOut[20]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n3 4 1 1 ... 53.1 C123 S\n\n137 138 0 1 ... 53.1 C123 S\n\n[2 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# Might not be efficient (The only method that was slower during testing)! \n\n%timeit df.Cabin.loc[df.Cabin.ne_search_for_string_series_equal('C123')]\n\n252 \u00b5s \u00b1 1.02 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n%timeit df.Cabin.loc[df.Cabin=='C123']\n\n158 \u00b5s \u00b1 1.28 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n\n\nOut[21]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n3 4 1 1 ... 53.1 C123 S\n\n137 138 0 1 ... 53.1 C123 S\n\n[2 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# Returns bool values\n\ndf.loc[df.ne_search_for_string_contains('C1')]\n\nOut[7]: \n\narray([[False, False, False, ..., False, False, False],\n\n [False, False, False, ..., False, False, False],\n\n [False, False, False, ..., False, False, False],\n\n ...,\n\n [False, False, False, ..., False, False, False],\n\n [False, False, False, ..., False, True, False],\n\n [False, False, False, ..., False, False, False]])\n\n```\n\n\n\n```python\n\n# returns Bool\n\ndf.loc[df.Cabin.ne_search_for_string_contains('C1')]\n\n\n\nOut[14]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n3 4 1 1 ... 53.1000 C123 S\n\n11 12 1 1 ... 26.5500 C103 S\n\n110 111 0 1 ... 52.0000 C110 S\n\n137 138 0 1 ... 53.1000 C123 S\n\n268 269 1 1 ... 153.4625 C125 S\n\n273 274 0 1 ... 29.7000 C118 C\n\n298 299 1 1 ... 30.5000 C106 S\n\n331 332 0 1 ... 28.5000 C124 S\n\n351 352 0 1 ... 35.0000 C128 S\n\n449 450 1 1 ... 30.5000 C104 S\n\n452 453 0 1 ... 27.7500 C111 C\n\n571 572 1 1 ... 51.4792 C101 S\n\n609 610 1 1 ... 153.4625 C125 S\n\n669 670 1 1 ... 52.0000 C126 S\n\n711 712 0 1 ... 26.5500 C124 S\n\n712 713 1 1 ... 52.0000 C126 S\n\n889 890 1 1 ... 30.0000 C148 C\n\n[17 rows x 12 columns]\n\n\n\n\n\n%timeit df.loc[df.Cabin.ne_search_for_string_contains('C1')]\n\n329 \u00b5s \u00b1 1.18 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n%timeit df.loc[df.Cabin.str.contains('C1',na=False)]\n\n403 \u00b5s \u00b1 924 ns per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n\n\n```python\n\n# Returns the index of all rows where the value was found.\n\n# Exceptions (e.g. wrong datatype etc.) are ignored.\n\n# duplicates (more positive results in one row) are not deleted\n\ndf.ne_equal_df_ind(1)\n\narray([ 0, 1, 2, 3, 8, 9, 10, 11, 15, 17, 19, 21, 22,\n\n 23, 25, 28, 31, 32, 36, 39, 43, 44, 47, 52, 53, 55,\n\n 56, 58, 61, 65, 66, 68, 74, 78, 79, 81, 82, 84, 85,\n\n 88, 97, 98, 106, 107, 109, 123...]\n\n```\n\n\n\n```python\n\n# You can pass dtype='S' to convert the values to string \n\n# (or other formats) before performing the search.\n\n# df.ne_equal_df_ind(b'1', 'S')\n\n# If you use 'S', you have to pass a binary value\n\ndf.ne_equal_df_ind(b'1', 'S')\n\nOut[16]: \n\narray([ 0, 1, 2, 3, 8, 9, 10, 11, 15, 17, 19, 21, 22,\n\n 23, 25, 28, 31, 32, 36, 39, 43, 44, 47, 52, 53, 55,\n\n 56, 58, 61, 65, 66, 68, 74,...]\n\n```\n\n\n\n```python\n\n# same as DataFrame.ne_equal_df_ind\n\n# but deletes all duplicates\n\ndf.ne_equal_df_ind_no_dup(b'1', 'S')\n\narray([ 0, 1, 2, 3, 6, 7, 8, 9, 10, 11, 13, 15, 16,\n\n 17, 18, 19, 21, 22, 23, 24, 25, 27, 28, 30, 31, 32,\n\n 34, 35, 36, 39, 40, 41, 43, 44\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_equal_df_ind,\n\n# but returns the DataFrame (df.loc[])\n\ndf.ne_equal_df_dup(b'1', 'S')\n\nOut[18]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n8 9 1 3 ... 11.1333 NaN S\n\n.. ... ... ... ... ... ... ...\n\n856 857 1 1 ... 164.8667 NaN S\n\n869 870 1 3 ... 11.1333 NaN S\n\n871 872 1 1 ... 52.5542 D35 S\n\n879 880 1 1 ... 83.1583 C50 C\n\n880 881 1 2 ... 26.0000 NaN S\n\n[886 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_equal_df_ind_no_dup\n\n# but returns the DataFrame (df.loc)\n\ndf.ne_equal_df_no_dup(b'1', 'S')\n\nOut[19]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n6 7 0 1 ... 51.8625 E46 S\n\n.. ... ... ... ... ... ... ...\n\n879 880 1 1 ... 83.1583 C50 C\n\n880 881 1 2 ... 26.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n[524 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# Returns bool\n\narray([ True, False, False, False, False, False, False, False, False,\n\n False, False, False, ...]\n\ndf.loc[df.PassengerId.ne_equal(1)]\n\n%timeit df.loc[df.PassengerId.ne_equal(1)]\n\n138 \u00b5s \u00b1 626 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.PassengerId == 1]\n\n209 \u00b5s \u00b1 1.04 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n\n\n```python\n\n# Every time the condition is False, the index is\n\n# added to the return value.\n\n# Example:\n\n# A row has 6 columns. 2 of them have the value 1.\n\n# That means the index of the row will be added 4 times\n\n# to the final result\n\ndf.loc[df.ne_not_equal_df_ind(1)]\n\narray([ 1, 2, 3, ..., 888, 889, 890], dtype=int64)\n\n\n\ndf.loc[df.ne_not_equal_df_ind(1)]\n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n4 5 0 3 ... 8.0500 NaN S\n\n5 6 0 3 ... 8.4583 NaN Q\n\n.. ... ... ... ... ... ... ...\n\n886 887 0 2 ... 13.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n890 891 0 3 ... 7.7500 NaN Q\n\n[5344 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_not_equal_df_ind\n\n# but drops all duplicates\n\n\n\ndf.ne_not_equal_df_ind_no_dup(0)\n\narray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,\n\n 13, 14, 15, 16,...]\n\n\n\ndf.loc[df.ne_not_equal_df_ind_no_dup(0)]\n\nOut[26]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n4 5 0 3 ... 8.0500 NaN S\n\n.. ... ... ... ... ... ... ...\n\n886 887 0 2 ... 13.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n890 891 0 3 ... 7.7500 NaN Q\n\n[891 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# same as DataFrame.ne_not_equal_df_ind\n\n# but returns the DataFrame (df.loc)\n\ndf.ne_not_equal_df_dup(0)\n\nOut[28]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n4 5 0 3 ... 8.0500 NaN S\n\n.. ... ... ... ... ... ... ...\n\n886 887 0 2 ... 13.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n890 891 0 3 ... 7.7500 NaN Q\n\n[4387 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# same as DataFrame.ne_not_equal_df_no_dup\n\n# but returns the DataFrame (df.loc)\n\ndf.ne_not_equal_df_no_dup(0)\n\nOut[29]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n4 5 0 3 ... 8.0500 NaN S\n\n.. ... ... ... ... ... ... ...\n\n886 887 0 2 ... 13.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n890 891 0 3 ... 7.7500 NaN Q\n\n[891 rows x 12 columns]\n\n```\n\n\n\n```python\n\nreturns Bool\n\ndf.Survived.ne_not_equal(0)\n\narray([False, True, True, True, False, False, False, False, True,\n\n True, True, True, False, False, False, True, False, True,\n\n False, True ...]\n\n\n\ndf.loc[df.Survived.ne_not_equal(0)]\n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n8 9 1 3 ... 11.1333 NaN S\n\n9 10 1 2 ... 30.0708 NaN C\n\n.. ... ... ... ... ... ... ...\n\n875 876 1 3 ... 7.2250 NaN C\n\n879 880 1 1 ... 83.1583 C50 C\n\n880 881 1 2 ... 26.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n889 890 1 1 ... 30.0000 C148 C\n\n[342 rows x 12 columns]\n\n\n\n%timeit df.loc[df.Survived.ne_not_equal(0)]\n\n157 \u00b5s \u00b1 390 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.Survived!=0]\n\n229 \u00b5s \u00b1 1.46 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n\n\n```python\n\n# returns index, duplicates are possible\n\n# if the condition is valid for more than one\n\n# column. Exceptions (e.g. wrong dtype) are ignored\n\ndf.ne_greater_than_df_ind(100)\n\narray([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,\n\n 113, 114, 115...]\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_greater_than_df_ind\n\n# but gets rid off all duplicates\n\ndf.ne_greater_than_df_ind_no_dup(0)\n\narray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,\n\n 13, 14, 15, 16...]\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_greater_than_df_ind\n\n# but returns the DataFrame (df.loc)\n\ndf.ne_greater_than_df_dup(0)\n\nOut[22]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n4 5 0 3 ... 8.0500 NaN S\n\n.. ... ... ... ... ... ... ...\n\n886 887 0 2 ... 13.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n890 891 0 3 ... 7.7500 NaN Q\n\n[4210 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# same as DataFrame.ne_greater_than_df_ind_no_dup\n\n# but returns the DataFrame (df.loc)\n\ndf.ne_greater_than_df_no_dup(600)\n\nOut[24]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n600 601 1 2 ... 27.0000 NaN S\n\n601 602 0 3 ... 7.8958 NaN S\n\n602 603 0 1 ... 42.4000 NaN S\n\n603 604 0 3 ... 8.0500 NaN S\n\n604 605 1 1 ... 26.5500 NaN C\n\n.. ... ... ... ... ... ... ...\n\n886 887 0 2 ... 13.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n890 891 0 3 ... 7.7500 NaN Q\n\n[291 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# Returns bool\n\ndf.PassengerId.ne_greater_than(5)\n\nOut[26]: \n\narray([False, False, False, False, False, True, True, True, True,\n\n True, True, True, True, True, True...]\n\n\n\ndf.loc[df.PassengerId.ne_greater_than(100)]\n\n%timeit df.loc[df.PassengerId.ne_greater_than(100)]\n\n174 \u00b5s \u00b1 375 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.PassengerId>100]\n\n248 \u00b5s \u00b1 2.26 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n\n\n```python\n\n# returns index, duplicates are possible\n\n# if the condition is valid for more than one\n\n# column. Exceptions (e.g. wrong dtype) are ignored\n\n\n\ndf.ne_less_than_df_ind(10)\n\narray([ 0, 1, 2, ..., 881, 884, 890], dtype=int64)\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_less_than_df_ind\n\n# but without duplicates\n\ndf.ne_less_than_df_ind_no_dup(100)\n\nOut[28]: \n\narray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,\n\n 13, 14, 15, 16, 17, 18, 19, 20, 21,...]\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_less_than_df_ind,\n\n# but returns DataFrame (df.loc)\n\ndf.ne_less_than_df_dup(1)\n\nOut[29]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n4 5 0 3 ... 8.0500 NaN S\n\n5 6 0 3 ... 8.4583 NaN Q\n\n6 7 0 1 ... 51.8625 E46 S\n\n7 8 0 3 ... 21.0750 NaN S\n\n.. ... ... ... ... ... ... ...\n\n674 675 0 2 ... 0.0000 NaN S\n\n732 733 0 2 ... 0.0000 NaN S\n\n806 807 0 1 ... 0.0000 A36 S\n\n815 816 0 1 ... 0.0000 B102 S\n\n822 823 0 1 ... 0.0000 NaN S\n\n[1857 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_less_than_df_ind_no_dup\n\n# but returns DataFrame (df.loc)\n\ndf.ne_less_than_df_no_dup(1)\n\nOut[30]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n4 5 0 3 ... 8.0500 NaN S\n\n.. ... ... ... ... ... ... ...\n\n886 887 0 2 ... 13.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n890 891 0 3 ... 7.7500 NaN Q\n\n[834 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# returns bool\n\n# df.PassengerId.ne_less_than(100)\n\ndf.PassengerId.ne_less_than(100)\n\nOut[31]: \n\narray([ True, True, True, True, True, True...]\n\n%timeit df.loc[df.PassengerId.ne_less_than(100)]\n\n142 \u00b5s \u00b1 412 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.PassengerId <100]\n\n212 \u00b5s \u00b1 897 ns per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n\n\n```python\n\n# returns index, duplicates are possible\n\n# if the condition is valid for more than one\n\n# column. Exceptions (e.g. wrong dtype) are ignored\n\ndf.ne_greater_than_or_equal_to_df_ind(100)\n\nOut[35]: \n\narray([ 99, 100, 101, 102, 103, 104, ...]\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_greater_than_or_equal_to_df_ind ,\n\n# but without duplicates\n\n# df.ne_greater_than_or_equal_to_df_ind_no_dup(100)\n\ndf.ne_greater_than_or_equal_to_df_ind_no_dup(100)\n\nOut[36]: \n\narray([ 27, 31, 88, 99, 100, 101, 102,...] \n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_greater_than_or_equal_to_df_ind,\n\n# but returns DataFrame (df.loc)\n\ndf.ne_greater_than_or_equal_to_df_dup(100)\n\nOut[37]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n99 100 0 2 ... 26.0000 NaN S\n\n100 101 0 3 ... 7.8958 NaN S\n\n101 102 0 3 ... 7.8958 NaN S\n\n102 103 0 1 ... 77.2875 D26 S\n\n103 104 0 3 ... 8.6542 NaN S\n\n.. ... ... ... ... ... ... ...\n\n742 743 1 1 ... 262.3750 B57 B59 B63 B66 C\n\n763 764 1 1 ... 120.0000 B96 B98 S\n\n779 780 1 1 ... 211.3375 B3 S\n\n802 803 1 1 ... 120.0000 B96 B98 S\n\n856 857 1 1 ... 164.8667 NaN S\n\n[845 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_greater_than_or_equal_to_df_ind,\n\n# but returns DataFrame (df.loc)\n\ndf.ne_greater_than_or_equal_to_df_no_dup(100)\n\nOut[38]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n27 28 0 1 ... 263.0000 C23 C25 C27 S\n\n31 32 1 1 ... 146.5208 B78 C\n\n88 89 1 1 ... 263.0000 C23 C25 C27 S\n\n99 100 0 2 ... 26.0000 NaN S\n\n100 101 0 3 ... 7.8958 NaN S\n\n.. ... ... ... ... ... ... ...\n\n886 887 0 2 ... 13.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n890 891 0 3 ... 7.7500 NaN Q\n\n[795 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# returns bool\n\ndf.PassengerId.ne_greater_than_or_equal_to(100)\n\nOut[39]: \n\narray([False, False, False, False, False...])\n\ndf.PassengerId.ne_greater_than_or_equal_to(100)\n\n%timeit df.loc[df.PassengerId.ne_greater_than_or_equal_to(100)]\n\n175 \u00b5s \u00b1 832 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.PassengerId>=100]\n\n251 \u00b5s \u00b1 2.77 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n\n\n```python\n\n# returns index, duplicates are possible\n\n# if the condition is valid for more than one\n\n# column. Exceptions (e.g. wrong dtype) are ignored\n\ndf.ne_less_than_or_equal_to_df_ind(100)\n\nOut[40]: array([ 0, 1, 2, ..., 888, 889, 890], dtype=int64)\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_less_than_or_equal_to_df_ind ,\n\n# but without duplicates\n\ndf.ne_less_than_or_equal_to_df_ind_no_dup(100)\n\nOut[41]: \n\narray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, ...])\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_less_than_or_equal_to_df_ind,\n\n# but returns DataFrame (df.loc)\n\ndf.ne_less_than_or_equal_to_df_dup(100)\n\nOut[42]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n4 5 0 3 ... 8.0500 NaN S\n\n.. ... ... ... ... ... ... ...\n\n886 887 0 2 ... 13.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n890 891 0 3 ... 7.7500 NaN Q\n\n[5216 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# Same as DataFrame.ne_less_than_or_equal_to_df_ind,\n\n# but returns DataFrame (df.loc)\n\ndf.ne_less_than_or_equal_to_df_no_dup(0)\n\nOut[53]: \n\n PassengerId Survived Pclass ... Fare Cabin Embarked\n\n0 1 0 3 ... 7.2500 NaN S\n\n1 2 1 1 ... 71.2833 C85 C\n\n2 3 1 3 ... 7.9250 NaN S\n\n3 4 1 1 ... 53.1000 C123 S\n\n4 5 0 3 ... 8.0500 NaN S\n\n.. ... ... ... ... ... ... ...\n\n886 887 0 2 ... 13.0000 NaN S\n\n887 888 1 1 ... 30.0000 B42 S\n\n888 889 0 3 ... 23.4500 NaN S\n\n889 890 1 1 ... 30.0000 C148 C\n\n890 891 0 3 ... 7.7500 NaN Q\n\n[829 rows x 12 columns]\n\n```\n\n\n\n```python\n\n# returns bool\n\ndf.PassengerId.ne_less_than_or_equal_to(100)\n\nOut[55]: \n\narray([ True, True, True, True, ....]\n\n\n\n%timeit df.loc[df.PassengerId.ne_less_than_or_equal_to(100)]\n\n145 \u00b5s \u00b1 1.82 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df.PassengerId <=100]\n\n212 \u00b5s \u00b1 1.63 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n\n\n```python\n\n# Combining conditions\n\n%timeit df.loc[df.PassengerId.ne_greater_than(100) & df.Cabin.ne_search_for_string_series_contains('C1')]\n\n360 \u00b5s \u00b1 2.56 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n%timeit df.loc[(df.PassengerId>100) & df.Cabin.str.contains('C1',na=False)]\n\n552 \u00b5s \u00b1 3.49 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n\n\n```python\n\n# you can pass your own queries\n\n# If you want to compare the DataFrame/Series to another array\n\n# the variable 'b' represents the DataFrame/Series \n\n# That means: don't use it for something else\n\nwholedict = {'c': np.array([1])}\n\ndf[['Survived','Pclass']].ne_query('b == c',local_dict=wholedict)\n\nOut[14]: \n\narray([[False, False],\n\n [ True, True],\n\n [ True, False],\n\n ...,\n\n [False, False],\n\n [ True, True],\n\n [False, False]])\n\n\n\n\n\n# You can use any NumExpr operator/function\n\n# https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/user_guide.html\n\n# And get a tremendous speedup (even with small DataFrames)\n\n%timeit df['Survived'] + df.Pclass\n\n68.6 \u00b5s \u00b1 167 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df['Survived'] * df.Pclass\n\n69 \u00b5s \u00b1 260 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df['Survived'] == df.Pclass\n\n72.3 \u00b5s \u00b1 817 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n\n\n# You have to pass the Series/Arrays that you are using in the expression as a dict (local_dict)\n\nwholedict = {'c': df.Pclass}\n\n%timeit df['Survived'].ne_query('b + c',local_dict=wholedict)\n\n25.2 \u00b5s \u00b1 130 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df['Survived'].ne_query('b * c',local_dict=wholedict)\n\n25.3 \u00b5s \u00b1 177 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df['Survived'].ne_query('b == c',local_dict=wholedict)\n\n25.2 \u00b5s \u00b1 197 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n\n\n# Exceptions are not ignored\n\n# If you want to compare the DataFrame with a scalar:\n\ndf[['Survived','Pclass']].ne_query('b == 1')\n\n\n\n# works also for Series\n\nwholedict = {'c': np.array([1])}\n\ndf['Survived'].ne_query('b == c',local_dict=wholedict)\n\n\n\n# scalar\n\ndf['Pclass'].ne_query('b == 1')\n\n\n\n%timeit df.loc[df['Pclass'].ne_query('b == 1')]\n\n155 \u00b5s \u00b1 530 ns per loop (mean \u00b1 std. dev. of 7 runs, 10,000 loops each)\n\n%timeit df.loc[df['Pclass'] == 1]\n\n220 \u00b5s \u00b1 3.96 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n\n```\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Pandas DataFrame/Series operations 8 times faster (or even more)",
"version": "0.10",
"split_keywords": [
"numexpr",
"numpy",
"sort",
"pandas",
"series"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5580bd0efe141a087946f4e5ac1fea3ff25f47da8f8b0f4d322af6c9a981473a",
"md5": "fefeea7af88af6ff0ad3274ff24937da",
"sha256": "ef3fcbb36a2bbd67558da0076277c5d46f3364a7117a3311c581a4cc1702f825"
},
"downloads": -1,
"filename": "a_pandas_ex_numexpr-0.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fefeea7af88af6ff0ad3274ff24937da",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 16788,
"upload_time": "2023-02-03T00:27:19",
"upload_time_iso_8601": "2023-02-03T00:27:19.195393Z",
"url": "https://files.pythonhosted.org/packages/55/80/bd0efe141a087946f4e5ac1fea3ff25f47da8f8b0f4d322af6c9a981473a/a_pandas_ex_numexpr-0.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7cfa67301d80ba78883296c68552984ecfd440d241831289440d87c895ac0c3c",
"md5": "66f4bf63290e128eed656743e2a2fd86",
"sha256": "8b31c3907ae8e5117cf73615338ffdb9f549cbaf3904fdb015ad21d63dff045c"
},
"downloads": -1,
"filename": "a_pandas_ex_numexpr-0.10.tar.gz",
"has_sig": false,
"md5_digest": "66f4bf63290e128eed656743e2a2fd86",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 26161,
"upload_time": "2023-02-03T00:27:21",
"upload_time_iso_8601": "2023-02-03T00:27:21.767462Z",
"url": "https://files.pythonhosted.org/packages/7c/fa/67301d80ba78883296c68552984ecfd440d241831289440d87c895ac0c3c/a_pandas_ex_numexpr-0.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-02-03 00:27:21",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "hansalemaos",
"github_project": "a_pandas_ex_numexpr",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "a-pandas-ex-numexpr"
}