Regex enhancements for Pandas DataFrame / Series
###Installation
pip install a-pandas-ex-regex-enhancements
###Usage
from a_pandas_ex_regex_enhancements import pd_add_regex_enhancements
pd_add_regex_enhancements()
import pandas as pd
The code above will add some methods to! You can use pandas like you did before, but you will have a couple of methods more:
-
pandas.DataFrame.ds_trie_regex_search / pandas.Series.ds_trie_regex_search
-
pandas.DataFrame.ds_trie_regex_sub / pandas.Series.ds_trie_regex_sub
-
pandas.DataFrame.ds_trie_regex_find_all / pandas.Series.ds_trie_regex_find_all
-
pandas.DataFrame.ds_regex_find_all / pandas.Series.ds_regex_find_all
-
pandas.DataFrame.ds_regex_find_all_special / pandas.Series.ds_regex_find_all_special
-
pandas.DataFrame.ds_regex_search / pandas.Series.ds_regex_search
-
pandas.DataFrame.ds_regex_sub / pandas.Series.ds_regex_sub
###How to use the new methods
pandas.DataFrame.ds_regex_find_all_special / pandas.Series.ds_regex_find_all_special*
df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
special=df.ds_regex_find_all_special(r'\b(Ma(\w)+)(\w+)\b', dtype_string=False)
aa_start_0 ... aa_match_6
aa_index aa_column aa_whole_match aa_whole_start aa_whole_end aa_group ...
7 Name Master 9 15 0 9 ... NaN
1 9 ... NaN
2 11 ... NaN
3 14 ... NaN
10 Name Marguerite 17 27 0 17 ... NaN
... ... ...
885 Name Margaret 20 28 3 27 ... NaN
887 Name Margaret 14 22 0 14 ... NaN
1 14 ... NaN
2 16 ... NaN
3 21 ... NaN
import re
re.findall('(Ma(\w)+)', 'Margaret')
Out[11]: [('Margaret', 't')]
aa_start_0 aa_start_1 aa_start_2 aa_start_3 aa_start_4 aa_start_5 aa_start_6 aa_stop_0 aa_stop_1 aa_stop_2 aa_stop_3 aa_stop_4 aa_stop_5 aa_stop_6 aa_match_0 aa_match_1 aa_match_2 aa_match_3 aa_match_4 aa_match_5 aa_match_6
aa_column aa_whole_match aa_whole_start aa_whole_end aa_group
Name Margaret 14 22 0 14 <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA> <NA> <NA> <NA> <NA> <NA> Margaret <NA> <NA> <NA> <NA> <NA> <NA>
1 14 <NA> <NA> <NA> <NA> <NA> <NA> 21 <NA> <NA> <NA> <NA> <NA> <NA> Margare <NA> <NA> <NA> <NA> <NA> <NA>
2 16 17 18 19 20 <NA> <NA> 17 18 19 20 21 <NA> <NA> r g a r e <NA> <NA>
3 21 <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA> <NA> <NA> <NA> <NA> <NA> t <NA> <NA> <NA> <NA> <NA> <NA>
If you want to convert the results to the best available dtype, use:
df.ds_regex_find_all_special(r'\b(Ma(\w)+)(\w+)\b', dtype_string=False)
Out[3]:
aa_start_0 ... aa_match_6
7 Name Master 9 15 0 9 ... <NA>
1 9 ... <NA>
2 11 ... <NA>
3 14 ... <NA>
10 Name Marguerite 17 27 0 17 ... <NA>
... ... ...
885 Name Margaret 20 28 3 27 ... <NA>
887 Name Margaret 14 22 0 14 ... <NA>
1 14 ... <NA>
2 16 ... <NA>
3 21 ... <NA>
[764 rows x 21 columns]
aa_start_0 uint8
aa_start_1 Int64
aa_start_2 Int64
aa_start_3 Int64
aa_start_4 Int64
aa_start_5 Int64
aa_start_6 Int64
aa_stop_0 uint8
aa_stop_1 Int64
aa_stop_2 Int64
aa_stop_3 Int64
aa_stop_4 Int64
aa_stop_5 Int64
aa_stop_6 Int64
aa_match_0 category
aa_match_1 category
aa_match_2 category
aa_match_3 category
aa_match_4 category
aa_match_5 category
aa_match_6 category
Parameters:
df: Union[pd.DataFrame, pd.Series]
regular_expression: str
Syntax from https://pypi.org/project/regex/
flags:int
You can use any flag that is available here: https://pypi.org/project/regex/
(default =regex.UNICODE)
dtype_string:bool
If True, it returns all results as a string
If False, data types are converted to the best available
(default =True)
Returns:
Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_regex_find_all / pandas.Series.ds_regex_find_all
df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
df.Name.ds_regex_find_all(regular_expression=r'(\bM\w+\b)\s+(\bW\w+\b)')
result_0 result_1
426 Name Maria Winfield
472 Name Mary Worth
862 Name Margaret Welles
multilinetest=df.Name.map(lambda x: f'{x}\n' * 3)
multilinetest.ds_regex_find_all(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=False)
Out[3]:
result_0 result_1
58 Name Mirium West
426 Name Maria Winfield
472 Name Mary Worth
862 Name Margaret Welles
multilinetest.ds_regex_find_all(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=True)
Out[7]:
result_0 result_1
426 Name Maria Winfield
Name Maria Winfield
Name Maria Winfield
472 Name Mary Worth
Name Mary Worth
Name Mary Worth
862 Name Margaret Welles
Name Margaret Welles
Name Margaret Welles
Parameters:
df: Union[pd.DataFrame, pd.Series]
regular_expression: str
Syntax from https://pypi.org/project/regex/
flags:int
You can use any flag that is available here: https://pypi.org/project/regex/
(default =regex.UNICODE)
dtype_string:bool
If True, it returns all results as a string
If False, data types are converted to the best available
(default =True)
line_by_line:bool
If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
(default =False)
Returns:
Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_trie_regex_find_all / pandas.Series.ds_trie_regex_find_all
Example:
df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
allstrings=pd.DataFrame([[df.Name.to_string() *2] *2,[df.Name.to_string() *2] *2])
hugeregexlist=df.Name.str.extract(r'^\s*(\w+)').drop_duplicates()[0].to_list()
%timeit allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
776 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
allstrings.ds_trie_regex_find_all(hugeregexlist, add_left_to_regex=r'\b', add_right_to_regex=r'\b')
Out[6]:
result_0 result_1 result_2 ... result_2133 result_2134 result_2135
0 0 Braund Harris Cumings ... Johnston Behr Dooley
1 Braund Harris Cumings ... Johnston Behr Dooley
1 0 Braund Harris Cumings ... Johnston Behr Dooley
1 Braund Harris Cumings ... Johnston Behr Dooley
Let's compare with a regular regex search
hugeregex=r"\b(?:" + "|".join([f'(?:{y})' for y in df.Name.str.extract(r'^\s*(\w+)').drop_duplicates()[0].to_list()]) + r")\b" #let's create a regex from all names
%timeit allstrings.ds_regex_find_all(hugeregex)
945 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Parameters:
df: Union[pd.DataFrame, pd.Series]
wordlist: list[str]
All strings you are looking for
add_left_to_regex: str
if you want to add something before the generated Trie regex -> \b for example
allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
(default = "")
add_right_to_regex: str
if you want to add something after the generated Trie regex -> \b for example
allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
(default = "")
flags:int
You can use any flag that is available here: https://pypi.org/project/regex/
(default =regex.UNICODE)
dtype_string:bool
If True, it returns all results as a string
If False, data types are converted to the best available
(default =True)
line_by_line:bool
If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
(default =False)
Returns:
Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_regex_sub / pandas.Series.ds_regex_sub
df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[891 rows x 12 columns]
subst=df.ds_regex_sub(regular_expression=r'^\b8\d(\d)\b', replace=r'\g<1>00000',dtype_string=False)
Out[5]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 <NA> S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 <NA> S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 <NA> S
.. ... ... ... ... ... ... ...
886 700000 0 2 ... 13.0000 <NA> S
887 800000 1 1 ... 30.0000 B42 S
888 900000 0 3 ... 23.4500 <NA> S
889 0 1 1 ... 30.0000 C148 C
890 100000 0 3 ... 7.7500 <NA> Q
[891 rows x 12 columns]
subst.dtypes
Out[8]:
PassengerId uint32
Survived uint8
Pclass uint8
Name string
Sex category
Age object
SibSp uint8
Parch uint8
Ticket object
Fare float64
Cabin category
Embarked category
df.Fare
Out[16]:
0 7.2500
1 71.2833
2 7.9250
3 53.1000
4 8.0500
...
886 13.0000
887 30.0000
888 23.4500
889 30.0000
890 7.7500
Name: Fare, Length: 891, dtype: float64
Fareint=df.Fare.ds_regex_sub(r'(\d+)\.\d+$', r'\g<1>',dtype_string=False)
0 7
1 71
2 7
3 53
4 8
.. ...
886 13
887 30
888 23
889 30
890 7
Fareint.dtypes
Out[18]:
Fare uint16
Parameters:
df: Union[pd.DataFrame, pd.Series]
regular_expression: str
Syntax from https://pypi.org/project/regex/
replace: str
the replacement you want to use (groups are allowed)
flags:int
You can use any flag that is available here: https://pypi.org/project/regex/
(default =regex.UNICODE)
dtype_string:bool
If True, it returns all results as a string
If False, data types are converted to the best available
(default =True)
line_by_line:bool
If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
(default =False)
Returns:
Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_trie_regex_sub / pandas.Series.ds_trie_regex_sub
df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
df.Name.ds_trie_regex_sub(hugeregexlist, 'HANS',add_left_to_regex=r'^\b',add_right_to_regex=r'\b')
Out[16]:
Name
0 HANS, Mr. Owen Harris
1 HANS, Mrs. John Bradley (Florence Briggs Thayer)
2 HANS, Miss. Laina
3 HANS, Mrs. Jacques Heath (Lily May Peel)
4 HANS, Mr. William Henry
.. ...
886 HANS, Rev. Juozas
887 HANS, Miss. Margaret Edith
888 HANS, Miss. Catherine Helen "Carrie"
889 HANS, Mr. Karl Howell
890 HANS, Mr. Patrick
[891 rows x 1 columns]
allstrings.ds_trie_regex_search(hugeregexlist,line_by_line=True)
Out[25]:
result_0
0 0 Braund
0 Cumings
0 Heikkinen
0 Futrelle
0 Allen
.. ...
1 1 Montvila
1 Graham
1 Johnston
1 Behr
1 Dooley
[7124 rows x 1 columns]
Parameters:
df: Union[pd.DataFrame, pd.Series]
wordlist: list[str]
All strings you are looking for
replace: str
the replacement you want to use (groups are allowed)
add_left_to_regex: str
if you want to add something before the generated Trie regex -> \b for example
allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
(default = "")
add_right_to_regex: str
if you want to add something after the generated Trie regex -> \b for example
allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
(default = "")
flags:int
You can use any flag that is available here: https://pypi.org/project/regex/
(default =regex.UNICODE)
dtype_string:bool
If True, it returns all results as a string
If False, data types are converted to the best available
(default =True)
line_by_line:bool
If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
(default =False)
Returns:
Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_regex_search / pandas.Series.ds_regex_search
df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
multilinetest=df.Name.map(lambda x: f'{x}\n' * 3)
multilinetest.ds_regex_search(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=False, flags=re.IGNORECASE)
Out[13]:
result_0
58 Name West, Miss. Constance Mirium\nWest
Name Mirium
Name West
426 Name Clarke, Mrs. Charles V (Ada Maria Winfield
Name Maria
Name Winfield
472 Name West, Mrs. Edwy Arthur (Ada Mary Worth
Name Mary
Name Worth
862 Name Swift, Mrs. Frederick Joel (Margaret Welles
Name Margaret
Name Welles
multilinetest.ds_regex_search(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=True, flags=re.IGNORECASE)
Out[19]:
result_0
426 Name Clarke, Mrs. Charles V (Ada Maria Winfield
Name Maria
Name Winfield
Name Clarke, Mrs. Charles V (Ada Maria Winfield
Name Maria
Name Winfield
Name Clarke, Mrs. Charles V (Ada Maria Winfield
Name Maria
Name Winfield
472 Name West, Mrs. Edwy Arthur (Ada Mary Worth
Name Mary
Name Worth
Name West, Mrs. Edwy Arthur (Ada Mary Worth
Name Mary
Name Worth
Name West, Mrs. Edwy Arthur (Ada Mary Worth
Name Mary
Name Worth
862 Name Swift, Mrs. Frederick Joel (Margaret Welles
Name Margaret
Name Welles
Name Swift, Mrs. Frederick Joel (Margaret Welles
Name Margaret
Name Welles
Name Swift, Mrs. Frederick Joel (Margaret Welles
Name Margaret
Name Welles
Parameters:
df: Union[pd.DataFrame, pd.Series]
regular_expression: str
Syntax from https://pypi.org/project/regex/
flags:int
You can use any flag that is available here: https://pypi.org/project/regex/
(default =regex.UNICODE)
dtype_string:bool
If True, it returns all results as a string
If False, data types are converted to the best available
(default =True)
line_by_line:bool
If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
(default =False)
Returns:
Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_regex_sub / pandas.Series.ds_regex_sub
df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[891 rows x 12 columns]
subst=df.ds_regex_sub(regular_expression=r'^\b8\d(\d)\b', replace=r'\g<1>00000',dtype_string=False)
Out[5]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 <NA> S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 <NA> S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 <NA> S
.. ... ... ... ... ... ... ...
886 700000 0 2 ... 13.0000 <NA> S
887 800000 1 1 ... 30.0000 B42 S
888 900000 0 3 ... 23.4500 <NA> S
889 0 1 1 ... 30.0000 C148 C
890 100000 0 3 ... 7.7500 <NA> Q
[891 rows x 12 columns]
subst.dtypes
Out[8]:
PassengerId uint32
Survived uint8
Pclass uint8
Name string
Sex category
Age object
SibSp uint8
Parch uint8
Ticket object
Fare float64
Cabin category
Embarked category
df.Fare
Out[16]:
0 7.2500
1 71.2833
2 7.9250
3 53.1000
4 8.0500
...
886 13.0000
887 30.0000
888 23.4500
889 30.0000
890 7.7500
Name: Fare, Length: 891, dtype: float64
Fareint=df.Fare.ds_regex_sub(r'(\d+)\.\d+$', r'\g<1>',dtype_string=False)
0 7
1 71
2 7
3 53
4 8
.. ...
886 13
887 30
888 23
889 30
890 7
Fareint.dtypes
Out[18]:
Fare uint16
Parameters:
df: Union[pd.DataFrame, pd.Series]
regular_expression: str
Syntax from https://pypi.org/project/regex/
replace: str
the replacement you want to use (groups are allowed)
flags:int
You can use any flag that is available here: https://pypi.org/project/regex/
(default =regex.UNICODE)
dtype_string:bool
If True, it returns all results as a string
If False, data types are converted to the best available
(default =True)
line_by_line:bool
If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
(default =False)
Returns:
Union[pd.Series, pd.DataFrame]