Overview
A keyword intelligence tool that transforms raw product listings and searchbar autocomplete data into structured and ranked signals to allow for commercially meaningful product research and listing optimisation across marketplaces and languages.
Why it’s interesting
It’s a multi-faceted production project that combines browser automation, NLP and ranking logic all into a single tool that solves a real commercial problem: finding the language top-performing listings actually use. It asynchronously scrapes products to maximise speed.
What it does
- Corpus-based keyword extraction
- Ingest top N product listings from a given seed term/query and asynchronously scrape Title, Bullets, Description etc from each listing.
- Clean and lemmatise scraped data and then extract candidate keyword phrases using a language-agnostic approach.
- Score each candidate keyword phrase based on: section frequency and weighting, ranks of listings it appeared in and measure of cross-listing coverage.
- Breaks down score by contribution of each original surface term.
- Searchbar Autocomplete Expansion
- Uses autocomplete from searchbar to determine best performing keywords from seed query.
- Iteratively expand seed query to discover new keywords from base keyword.
- Rank keywords using frequency and suggestion-order statistics.
- Reverse Search Listing
- Scrapes keywords found in single listing and compare to a pre-stored corpus of listings to highlight which keyphrases a given listing used.
Key Technical Points
- Asynchronous Pipeline: Used asynchronous Producer-Batch Consumer pattern so that network-bound scraping could and CPU/GPU heavy spaCy operations could run quickly and efficiently.
- Reliable browser automation: Fault-tolerant marketplace aware Playwright automation with retry paths, robust page-state detection and handling of inconsistent page state.
- Efficient Workflow: Used spaCy pipe operations, NumPy vectorised calculations, Pandas column operations, asyncio and dictionary/list compression wherever possible to ensure fast runtime.
- Multi-lingual & Multi-market: My project dynamically loaded marketplace details and correct spaCy model to support finding keywords in multiple languages over multiple marketplaces.
- Easy to use PyQt GUI: Custom modern and easy-to-use GUI with table inspection, filtering, error handling and logging.
Tech Stack
Language: Python
Data: Pandas, NumPy, scikit-learn
Scraping: Playwright, BeautifulSoup