Client Success Story

Discover How we Help a Client Automate Product Data Scraping and Enrichment While Keeping Humans in the Loop

The client

eCommerce Reseller of Machinery, Industrial Supplies, and Household Appliances

Founded in 1981, our client has established itself as a reliable eCommerce reseller with long-standing relationships with top brand manufacturers. They specialize in machinery, equipment, and supplies, operating across various sectors such as hardware, plumbing, and electrical goods. Their extensive portfolio includes over 7,000 brands, all managed through two dedicated eCommerce platforms, ensuring a broad and diverse product offering for their vendors.

PROJECT REQUIREMENTS

Product Information Management for 2M+ SKUs

The client, managing over 2 million products across its websites, needed our help in extracting relevant data and ensuring the information was clean and enriched. Our support was requested in the following key areas:

PROJECT CHALLENGES

Addressing Diverse Website Structures, Anti-Bot Obstacles, and Script Scalability Challenges in Web Scraping

Our Solution

Custom Script Creation, Taxonomy Design & AI-Driven Enrichment

To effectively tackle the project requirements, we delegated a dedicated team of six dedicated professionals, including data scraping experts, a prompt engineer, and a QA resource.

Custom Script Creation for Product Data Scraping

The websites to be scraped were provided in batches. Our data scraping team developed custom Python scripts and used extraction tools to extract complex and dynamic content from multiple brand websites. These scripts automated the collection of thousands of product entries with high precision. The extracted data included key attributes such as product categories, descriptions, pricing, taxonomy, and reviews. Our team then categorized and organized this data, removing any special characters, all within the client's specified time frame.

To address scraping challenges, we employed a variety of techniques:

1

Curl Requests were used to interact directly with the client's web resources, enabling efficient data testing and retrieval while bypassing web interface restrictions.

2

Python Requests were employed to automate HTTP requests to specified URLs, ensuring consistent and error-free data downloads.

3

BeautifulSoup Objects were utilized for parsing HTML content, allowing for efficient extraction and cleaning of specific data points for further processing.

To prevent any risk of website blocking and ensure compliance with legal and ethical standards, the client provided clear guidelines on which data could be scraped and which should be excluded.

Taxonomy Development and UNSPSC Categorization

To meet the client’s need for a structured product categorization system, we designed a custom taxonomy based on Google’s framework, tailored to the client’s specific product line. Using ChatGPT and the UNSPSC website, we assigned accurate UNSPSC codes to each product. ChatGPT was used to identify the closest category matches, and our team meticulously verified all codes and categories to ensure there were no errors or discrepancies in the classification.

Data Enrichment Using Custom ChatGPT Development

We purchased API tokens and developed a custom GPT solution using ChatGPT-4 to address partially scraped data and missing product information in the client's database. This automated the process of filling in gaps such as product descriptions, weights, and categories, ensuring data consistency. Our prompt engineers crafted highly specific prompts to guide ChatGPT in delivering the most accurate and relevant results, ensuring the enriched data met the client’s quality standards.

We also leveraged our in-house master database for enrichment, effectively supplementing and improving the quality of records through the integration of additional relevant information.

Optimizing Efficiency Through Automation with Human Oversight

Our approach combines automated processes with strategic human oversight to maximize efficiency and ensure accuracy. As requested by the client, we have successfully automated multiple workflows, including data scraping, enrichment, categorization, and cleansing. While automation significantly accelerates these processes, human expertise remains essential for validation and refinement, ultimately providing the most reliable solution.

TASK AUTOMATION HUMAN INTERVENTION
Data Scraping Automated using Python scripts or scraping tools to extract data. Manual review to verify accuracy, remove unwanted characters, and ensure the data is formatted to the client's specifications.
Data Enrichment ChatGPT was used to fill in missing information in the dataset. Human resources validated the data and used the in-house master database to ensure all gaps were accurately filled.
Product Categorization ChatGPT provided initial category suggestions and assigned UNSPSC codes. Resources manually mapped codes using the UNSPSC website, as ChatGPT occasionally provided outdated or ambiguous codes.

Project Outcomes

Comprehensive Taxonomy Implementation

Successfully developed and implemented a product categorization system for over 2 million items on the client's platform

Superior Data Quality Assurance

Delivered 99.8% error-free data through meticulous cleansing and accurate product categorization

Operational Efficiency Optimization

Achieved a 78% increase in efficiency through strategic task automation

Project Workflow

flowchart

Contact Us

Need help with eCommerce product data management?

Reach out to our team and get complete support with end-to-end product information management services- from data extraction to cleansing, categorization, and more. To share your business challenges, write to us at