Skip to content
LangtonTools
← All posts
3 min read

Building a search term n-gram analyzer in Python

Alex LangtonSenior B2B paid media manager · ~$650K/mo industrial spend

The problem with analyzing search term reports manually: you're looking at individual queries. You can't see the patterns.

"Industrial marking machine" — irrelevant. "Industrial marking system specifications" — relevant. "Industrial marking for metal parts" — relevant. "How to do industrial marking" — irrelevant.

Reading row by row, you catch each bad term. You add it as a negative. Two weeks later, a hundred new variations of the same bad terms appear.

What you're missing is the pattern: the word "how" appears in 40 irrelevant queries and zero relevant ones. If you could see that pattern, you'd add "how" as a negative and catch all 40 at once.

That's what an n-gram analyzer does.

What n-grams are

An n-gram is a sequence of n words. Unigrams are single words. Bigrams are two-word combinations. Trigrams are three-word sequences.

In a search term n-gram analysis, you're breaking every query into its component pieces and counting how often each piece appears across your entire query dataset — and how it correlates with conversions.

If "how" appears in 200 queries and zero of them converted: add "how" as a negative.

If "arc flash" appears in 50 queries and 35 of them converted: "arc flash" is a strong positive signal. Bid more aggressively on anything containing "arc flash."

This is pattern recognition that's impossible to do manually on a 10,000-row search term report.

The Python implementation

The core logic is simple. Here's the structure:

import pandas as pd
from collections import Counter

def tokenize(query):
    # Clean, lowercase, split
    return query.lower().replace(',', '').split()

def get_ngrams(tokens, n):
    return [' '.join(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

def analyze_ngrams(df, n=1):
    # df has columns: query, clicks, conversions, cost
    all_ngrams = []
    
    for _, row in df.iterrows():
        tokens = tokenize(row['query'])
        ngrams = get_ngrams(tokens, n)
        for ngram in ngrams:
            all_ngrams.append({
                'ngram': ngram,
                'clicks': row['clicks'],
                'conversions': row['conversions'],
                'cost': row['cost']
            })
    
    result = pd.DataFrame(all_ngrams).groupby('ngram').agg({
        'clicks': 'sum',
        'conversions': 'sum',
        'cost': 'sum'
    }).reset_index()
    
    result['conversion_rate'] = result['conversions'] / result['clicks']
    result['cpa'] = result['cost'] / result['conversions'].replace(0, float('inf'))
    
    return result.sort_values('cost', ascending=False)

Export your search term report from Google Ads as CSV. Load it into pandas. Run the analysis for unigrams, bigrams, and trigrams.

Sort by cost descending. Look at the n-grams with high cost and zero conversions. Those are your new negative keywords.

What to do with the output

Zero-conversion n-grams with significant spend: These are category-level patterns to add as negatives. "How to," "free," "tutorial," "schematic," "for sale" (if you're B2B only), "Amazon" — all appear as high-spend, zero-conversion unigrams in industrial accounts.

High-conversion n-grams: These are signals to build toward. If "arc flash" converts at 15% across all queries containing it, you want to find every reasonable variation of "arc flash" keywords you're not already bidding on and add them.

High-spend, low-conversion bigrams: "industrial equipment" might be converting at 0.5% while "industrial marking equipment" converts at 8%. The word "marking" is doing the filtering work. This tells you to add "marking" as a required element to any broad or phrase match targeting "industrial equipment."

Scale and frequency

Run this analysis quarterly minimum. Monthly if you're running phrase or broad match campaigns that generate high query volume.

On a $650K/month portfolio, I run it monthly. The Python script takes about 3 minutes to run. The analysis time is maybe 30 minutes. The new negatives and positive keyword discoveries from each run are worth thousands of dollars in improved targeting.

Signal, one of the Langton Tools extensions, automates this in the Google Ads UI — highlights which n-grams are statistically associated with irrelevant traffic vs. qualified traffic, so you can make decisions without running the Python script manually.

But even without Signal, the Python version above is enough to transform how you understand your search term data.

Alex Langton

Senior B2B paid media manager · ~$650K/mo industrial spend

12+ years running B2B Google Ads accounts in industrial, manufacturing, and B2B e-commerce. Builds Langton Tools because generic PPC SaaS was never designed for the multi-MCC, complex- pacing, B2B-vocabulary reality of the accounts that actually drive industrial revenue.