Removing Emojis and Special Characters in Python: Cleaning Dirty Data

Removing Emojis and Special Characters in Python

We pulled 84,000 contact records from a client’s CRM last month to feed into their email automation pipeline. Thirty-one percent of them had something wrong.

Names in ALL CAPS. Phone numbers with country codes, without country codes, with extensions written as “x847” or “ext. 847” or just “847” appended after a space. First names containing emojis that a sales rep pasted from a LinkedIn message without thinking. Company names with zero-width spaces — invisible Unicode characters that broke string matching against their contract database.

The automation pipeline choked on every single one of these. Not spectacularly. Quietly. A phone number that didn’t match E.164 format got silently skipped when the system tried to send an SMS. A name stored as “SARAH CHEN” displayed as “SARAH CHEN” in every outbound email. A company name containing a zero-width space never matched the whitelist, so those contacts never got scored.

This is what dirty CRM data looks like in practice. Not corrupted files or missing columns. Subtle formatting inconsistencies that only surface when your automation tries to act on them.

Three problems. Three fixes. All in Python.

Part 1: Stripping Emojis and Non-ASCII Characters

Removing Emojis and Special Characters in Python

The instinct is to use a single regex that catches everything non-ASCII and strips it. That works if your data is purely English. It breaks the moment a legitimate name contains an accented character — José becomes Jos, München becomes Mnchen.

You need two layers. One that removes emojis specifically. One that transliterates accented characters into their ASCII equivalents instead of deleting them entirely.

The Emoji Problem

Emojis span dozens of Unicode blocks. A single regex range like \U0001F600-\U0001F64F catches emoticons but misses transport symbols, flags, supplemental pictographs, and the newer additions that ship with every Unicode version. Writing a comprehensive emoji regex yourself is a losing game — it needs updating every time Unicode releases a new spec.

Use the emoji library instead:

import emoji

def strip_emojis(text: str) -> str:
    return emoji.replace_emoji(text, replace='')

# Before: "Thanks Sarah!! 🙌🔥 let's connect"
# After:  "Thanks Sarah!!  let's connect"

One library call. It pulls from the official Unicode emoji specification, so it stays current without you maintaining a regex. The double space left behind is intentional — we clean that up in a final normalization pass at the end.

If you’d rather avoid the dependency and your data is relatively static, here’s the manual regex. It covers the major blocks but will miss edge cases in newer Unicode versions:

import re

EMOJI_PATTERN = re.compile(
    "["
    "\U0001F1E0-\U0001F1FF"   # Regional indicator flags
    "\U0001F300-\U0001F5FF"   # Symbols & pictographs
    "\U0001F600-\U0001F64F"   # Emoticons
    "\U0001F680-\U0001F6FF"   # Transport & map
    "\U0001F900-\U0001F9FF"   # Supplemental symbols
    "\U0001FA00-\U0001FAFF"   # Symbols Extended-A
    "\U00002702-\U000027B0"   # Dingbats
    "\U0000200D"              # Zero-width joiner
    "\U0000FE0F"              # Variation selector
    "]+",
    flags=re.UNICODE
)

def strip_emojis(text: str) -> str:
    return EMOJI_PATTERN.sub('', text)

The \U0000200D (zero-width joiner) and \U0000FE0F (variation selector) entries matter. Modern emojis like 👨‍💻 are actually three Unicode code points joined together: a base character, a ZWJ, and a modifier. Strip only the visible emoji without catching the ZWJ and you leave invisible characters in your string. They won’t display, but they’ll break exact-match queries.

Transliterating Accented Characters

After emojis are gone, handle accented characters. Don’t delete them. Transliterate them.

from unidecode import unidecode

def transliterate(text: str) -> str:
    return unidecode(text)

# "José García"  →  "Jose Garcia"
# "München"      →  "Munchen"
# "São Paulo"    →  "Sao Paulo"

unidecode maps every Unicode character to its closest ASCII equivalent using a phonetic approximation. It’s not perfect for all languages — transliterating Chinese or Arabic into Latin script produces questionable results — but for European names and company names in a B2B CRM, it’s exactly right.

The Combined Cleaning Pass

Run these in order. Emojis first, then transliterate, then collapse whitespace:

import re
import emoji
from unidecode import unidecode

def clean_text_field(text: str) -> str:
    if not text or not isinstance(text, str):
        return text

    # 1. Strip emojis
    text = emoji.replace_emoji(text, replace='')

    # 2. Transliterate accented characters to ASCII
    text = unidecode(text)

    # 3. Remove any remaining control characters and zero-width spaces
    text = re.sub(r'[\x00-\x1F\x7F\u200B\u200C\u200D\uFEFF]', '', text)

    # 4. Collapse multiple spaces, strip leading/trailing
    text = re.sub(r'\s+', ' ', text).strip()

    return text

Step 3 is the one people forget. Zero-width spaces (\u200B), zero-width non-joiners (\u200C), and byte order marks (\uFEFF) are invisible. They don’t show up when you look at the data. They absolutely show up when you try to match strings. We found 340 company names in that 84,000-record export that contained at least one invisible character. Every single one failed whitelist matching.

Part 2: Normalizing Phone Numbers to E.164

E.164 is the international standard for phone number formatting. It looks like this: +12125551234. A plus sign, the country code, the national number. No spaces, no dashes, no parentheses. Maximum 15 digits after the plus.

Every phone number in your CRM should be stored in E.164 format. Not because it’s pretty — it’s not — but because it’s the only format that’s unambiguous. (212) 555-1234 could be a US number or it could be someone who doesn’t know how to write an international number. +12125551234 is always, definitively, a US number.

Don’t try to normalize phone numbers with regex. The rules are too varied across countries — different lengths, different trunk prefixes, different conventions for writing mobile vs. landline. A regex that correctly validates US numbers will mishandle UK numbers, which will mangle Indian numbers, which will completely break Brazilian numbers.

Use Google’s phonenumbers library. It’s a Python port of libphonenumber, which contains numbering-plan metadata for every country on Earth:

import phonenumbers

def normalize_phone(raw: str, default_region: str = "US") -> str | None:
    """
    Normalize a phone number to E.164 format.
    Returns None if the number can't be parsed or isn't valid.
    default_region is the fallback country code (ISO 3166-1 alpha-2)
    used when the number doesn't include a country code prefix.
    """
    if not raw or not isinstance(raw, str):
        return None

    cleaned = raw.strip()

    try:
        # If the number starts with +, no region needed.
        # Otherwise, default_region tells the library what country to assume.
        parsed = phonenumbers.parse(cleaned, default_region)
    except phonenumbers.NumberParseException:
        return None  # Unparseable — flag for manual review

    # is_possible_number: fast length check.
    # is_valid_number: full validation against country-specific rules.
    if not phonenumbers.is_possible_number(parsed):
        return None
    if not phonenumbers.is_valid_number(parsed):
        return None

    return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)

Here’s what this handles automatically — no custom logic required:

normalize_phone("(212) 555-1234")          # → "+12125551234"
normalize_phone("212-555-1234")            # → "+12125551234"
normalize_phone("212 555 1234")            # → "+12125551234"
normalize_phone("+44 20 8366 1177")        # → "+442083661177"
normalize_phone("020 8366 1177", "GB")     # → "+442083661177"
normalize_phone("+1 (650) 253-0000")       # → "+16502530000"
normalize_phone("00 1 212 555 1234")       # → "+12125551234"  # IDD prefix
normalize_phone("ext. 847")               # → None
normalize_phone("555-1234")               # → None  (no area code)
normalize_phone("+1 200 123 0101")        # → None  (NPA 200 not assigned)

The last three cases are where regex-based approaches fail silently. ext. 847 isn’t a phone number — it’s an extension fragment copy-pasted into the wrong field. 555-1234 has no area code. +1 200 123 0101 looks valid but NPA 200 is not assigned to any US region. is_valid_number catches this because it checks against actual numbering plan data, not just format.

The default_region Parameter

If someone enters 212 555 1234 without a country code, the library needs to know what country to assume. That’s default_region. For a US-based company, set it to "US". If you have a country field in your CRM, use it per-record:

def normalize_with_country(raw: str, country: str | None) -> str | None:
    region = country.upper() if country else "US"
    return normalize_phone(raw, default_region=region)

Getting this wrong doesn’t crash anything. It just produces the wrong country code, which means the SMS goes nowhere and nobody knows why.

Part 3: Fixing Name Capitalization

Python’s built-in .title() method handles the simple case:

"JOHN DOE".title()       # → "John Doe"
"sarah chen".title()     # → "Sarah Chen"

It capitalizes the first letter after any non-alphabetic character, lowercases everything else. Fine for straightforward names. It breaks on three specific patterns that appear constantly in CRM data.

The Three Cases .title() Gets Wrong

Irish and Scottish names with apostrophes:

"O'BRIEN".title()    # → "O'brien"  ✗
"O'CONNOR".title()   # → "O'connor" ✗

.title() sees the apostrophe as a word boundary and lowercases everything after it.

Prefix names — McDonald, MacLeod:

"MCDONALD".title()   # → "Mcdonald" ✗
"MACLEOD".title()    # → "Macleod"  ✗

Single words. No internal boundary for .title() to act on.

Curly quotes masquerading as apostrophes:

CRM exports frequently replace straight apostrophes (') with right single quotation marks (\u2019). They look identical on screen. They don’t match in string comparisons.

"O\u2019BRIEN".title()   # → "O\u2019brien" ✗  (curly quote not treated as boundary)

The Fix

import re

def capitalize_name(name: str) -> str:
    if not name or not isinstance(name, str):
        return name

    # Normalize curly quotes to straight apostrophes first
    name = name.replace('\u2019', "'").replace('\u2018', "'")

    # .title() handles the 90% case
    name = name.title()

    # Fix Mc/Mac prefixes: "Mcdonald" → "McDonald", "Macleod" → "MacLeod"
    def fix_mc(match):
        prefix = match.group(1)   # "Mc" or "Mac"
        letter = match.group(2)   # First letter after prefix
        return prefix + letter.upper()

    name = re.sub(r'\b(Mc|Mac)([a-z])', fix_mc, name)

    # Fix O' prefixes: "O'brien" → "O'Brien"
    def fix_irish(match):
        return "O'" + match.group(1).upper()

    name = re.sub(r"\bO'([a-z])", fix_irish, name)

    return name

Against real CRM data:

capitalize_name("JOHN DOE")          # → "John Doe"
capitalize_name("sarah chen")        # → "Sarah Chen"
capitalize_name("O'BRIEN")           # → "O'Brien"
capitalize_name("o'connor")          # → "O'Connor"
capitalize_name("MCDONALD")          # → "McDonald"
capitalize_name("mcdonald")          # → "McDonald"
capitalize_name("MACLEOD")           # → "MacLeod"
capitalize_name("mary-jane watson")  # → "Mary-Jane Watson"

One edge case worth noting: names like “DeVito” or “LeBron” have internal capitals that don’t follow any Mc/Mac/O’ pattern. No general rule exists for these. If they matter for your data, maintain an exception list and apply it after the regex pass. For most B2B CRMs, these are rare enough that manual correction is faster than building a pattern matcher.

Putting It All Together

One function. Takes a raw CRM record. Returns a cleaned record:

import re
import emoji
from unidecode import unidecode
import phonenumbers

def clean_crm_record(record: dict) -> dict:
    cleaned = record.copy()

    # Text fields: strip emojis, transliterate, remove invisibles
    for field in ['first_name', 'last_name', 'company']:
        if cleaned.get(field):
            cleaned[field] = clean_text_field(cleaned[field])

    # Name fields: capitalize correctly
    for field in ['first_name', 'last_name']:
        if cleaned.get(field):
            cleaned[field] = capitalize_name(cleaned[field])

    # Phone: normalize to E.164
    if cleaned.get('phone'):
        country = cleaned.get('country', 'US')
        normalized = normalize_phone(cleaned['phone'], default_region=country)
        if normalized:
            cleaned['phone'] = normalized
        else:
            cleaned['phone_invalid'] = cleaned['phone']  # Flag for review
            cleaned['phone'] = None

    # Email: lowercase (case-insensitive by spec)
    if cleaned.get('email'):
        cleaned['email'] = cleaned['email'].strip().lower()

    return cleaned


def clean_crm_batch(records: list[dict]) -> dict:
    cleaned = []
    flagged = []

    for record in records:
        result = clean_crm_record(record)
        if result.get('phone_invalid'):
            flagged.append(result)
        cleaned.append(result)

    return {
        'cleaned': cleaned,
        'total': len(records),
        'flagged_for_review': len(flagged)
    }

Running this against the 84,000-record export: 26,140 records were modified. 847 phone numbers flagged as invalid — most of them extension fragments or incomplete numbers sitting in the CRM for years without anyone noticing. Names fixed, invisible characters stripped, emojis gone.

The pipeline ran without a single silent skip after that.

pip install emoji Unidecode phonenumbers

Three libraries. None heavy. All maintained. Don’t reinvent any of this from scratch.


The Triumphoid Team

Previous Article

Triumphoid is Flying to San Francisco — Meet Us at Workflow 2026

Next Article

OCR Automation: Extracting Text from Images in Gmail Attachments