How to use regex with Pandas DataFrame
In this tutorial, we will go over some useful functions in pandas that you can use with regular experessions to process texts.
function | description |
---|---|
contains() | Test if pattern or regex is contained within a string of a Series or Index. |
count() | Count occurrences of pattern in each string of the Series/Index |
findall() | Find all occurrences of pattern or regular expression in the Series/Index. |
replace() | Replace each occurrence of pattern/regex in the Series/Index with a custom string |
split() | Split strings around given pattern |
Create a DataFrame if you'd like to follow along with the tutorial:
from datasets import load_dataset
agnews = load_dataset('ag_news')
agnews.set_format(type="pandas")
df = agnews['train'][:]
df.head()
contains
- find texts containing the word "business"
df[df['text'].str.contains(r'\bbusiness\b')].head()
count
- count the total number of times the word "business" occurs in texts
df['text'].str.count(r'\bbusiness\b').sum()
findall
- equivalent to re.findall()
-
see another tutorial on re.findall() and re.search()
-
below is an example of how to find all the a's in texts
df['text'].str.findall(r'\ba\b')
replace
- replace the all the occurence of "today" or "Today" with "TODAYYYYYY"
- check second to the last row!
df['text'].str.replace(r'\b[Tt]oday\b','TODAYYYYYY')
split
- split texts by "the", the function returns a list of strings
- check first row of the output
df['text'].str.split(r"\bthe\b")
You may be interested