In this tutorial, we will go over some useful functions in pandas that you can use with regular experessions to process texts.

function	description
contains()	Test if pattern or regex is contained within a string of a Series or Index.
count()	Count occurrences of pattern in each string of the Series/Index
findall()	Find all occurrences of pattern or regular expression in the Series/Index.
replace()	Replace each occurrence of pattern/regex in the Series/Index with a custom string
split()	Split strings around given pattern

Create a DataFrame if you'd like to follow along with the tutorial:

from datasets import load_dataset
agnews = load_dataset('ag_news')

Using custom data configuration default

Downloading and preparing dataset ag_news/default (download: 29.88 MiB, generated: 30.23 MiB, post-processed: Unknown size, total: 60.10 MiB) to /root/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548...
Dataset ag_news downloaded and prepared to /root/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548. Subsequent calls will reuse this data.

agnews.set_format(type="pandas")
df = agnews['train'][:]
df.head()

contains

find texts containing the word "business"

df[df['text'].str.contains(r'\bbusiness\b')].head()

count

count the total number of times the word "business" occurs in texts

df['text'].str.count(r'\bbusiness\b').sum()

2759

findall

equivalent to re.findall()
see another tutorial on re.findall() and re.search()
below is an example of how to find all the a's in texts

df['text'].str.findall(r'\ba\b')

0                   []
1                  [a]
2                   []
3                  [a]
4                  [a]
              ...     
119995             [a]
119996    [a, a, a, a]
119997             [a]
119998              []
119999             [a]
Name: text, Length: 120000, dtype: object

replace

replace the all the occurence of "today" or "Today" with "TODAYYYYYY"
check second to the last row!

df['text'].str.replace(r'\b[Tt]oday\b','TODAYYYYYY')

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  """Entry point for launching an IPython kernel.

0         Wall St. Bears Claw Back Into the Black (Reute...
1         Carlyle Looks Toward Commercial Aerospace (Reu...
2         Oil and Economy Cloud Stocks' Outlook (Reuters...
3         Iraq Halts Oil Exports from Main Southern Pipe...
4         Oil prices soar to all-time record, posing new...
                                ...                        
119995    Pakistan's Musharraf Says Won't Quit as Army C...
119996    Renteria signing a top-shelf deal Red Sox gene...
119997    Saban not going to Dolphins yet The Miami Dolp...
119998    TODAYYYYYY's NFL games PITTSBURGH at NY GIANTS...
119999    Nets get Carter from Raptors INDIANAPOLIS -- A...
Name: text, Length: 120000, dtype: object

split

split texts by "the", the function returns a list of strings
check first row of the output

df['text'].str.split(r"\bthe\b")

0         [Wall St. Bears Claw Back Into ,  Black (Reute...
1         [Carlyle Looks Toward Commercial Aerospace (Re...
2         [Oil and Economy Cloud Stocks' Outlook (Reuter...
3         [Iraq Halts Oil Exports from Main Southern Pip...
4         [Oil prices soar to all-time record, posing ne...
                                ...                        
119995    [Pakistan's Musharraf Says Won't Quit as Army ...
119996    [Renteria signing a top-shelf deal Red Sox gen...
119997    [Saban not going to Dolphins yet The Miami Dol...
119998    [Today's NFL games PITTSBURGH at NY GIANTS Tim...
119999    [Nets get Carter from Raptors INDIANAPOLIS -- ...
Name: text, Length: 120000, dtype: object

You may be interested

how to load datasets from Hugging Face Datasets

	text	label
42	Technology company sues five ex-employees A M...	2
62	Downhome Pinoy Blues, Intersecting Life Paths,...	2
63	The Real Time Modern Manila Blues: Bill Monroe...	2
65	What are the best cities for business in Asia?...	2
74	HP to Buy Synstar Hewlett-Packard will pay \$2...	2

	text	label
0	Wall St. Bears Claw Back Into the Black (Reute...	2
1	Carlyle Looks Toward Commercial Aerospace (Reu...	2
2	Oil and Economy Cloud Stocks' Outlook (Reuters...	2
3	Iraq Halts Oil Exports from Main Southern Pipe...	2
4	Oil prices soar to all-time record, posing new...	2