10 Commits

Author SHA1 Message Date
6eac0b92ed Fix converter reading nested scraped data and brand priority
Both motousher and dirtstreet converters were reading product fields
(title, sku, price, images) directly from the aggregated record, but
those fields live inside record.scraped after fetchWebsiteData wraps
them. Results were: Untitled Product, missing images, SKU=variant-1.

Also fixed brand priority: per-product brand (e.g. Evans Coolant,
SC Project) now takes precedence over the global SHOPIFY_BRAND env
var (KYT), which was incorrectly overriding all products from the
new sources.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 16:39:54 +05:30
c0132ab0aa Add motousher and dirtstreet as active import pipeline sources
Both sources are now registered in sources/index.js and fully wired
into the 6-stage pipeline (fetch → download → watermark → upload →
convert → upsert). The frontend will automatically show them as tabs
via GET /pipeline/sources without any frontend changes needed.

motousher/ (Shopify JSON API — 12 brands, ~2,446 products):
- scraper.js: fetches /collections/{slug}/products.json + /products/{handle}.json
- converter.js: maps scraped products to standard pipeline format
- index.js: fetchWebsiteData() loops all brands, normalises to
  productSummary.img format for shared download/upload utilities
- Supports MOTOUSHER_BRANDS env var to filter brands on a run

dirtstreet/ (WooCommerce HTML + JSON-LD — 5 brands, ~1,087 products):
- scraper.js: pure fetch, paginates /brand/{slug}/page/N/,
  extracts price from offers.priceSpecification[0].price,
  stock from JSON-LD availability field
- converter.js: maps scraped products to standard pipeline format,
  builds descriptionHtml from body + short desc + attributes table
- index.js: fetchWebsiteData() loops all brands, normalises to
  productSummary.img format
- Supports DIRTSTREET_BRANDS env var to filter brands on a run

sources/index.js: registered all 4 sources (kyt, brocks-performance,
motousher, dirtstreet). GET /pipeline/sources now returns all 4.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 12:25:31 +05:30
b8d9478afa Add test_source scrapers for motousher.com and dirtstreet.in
Adds two new experimental product scrapers under test_source/, isolated
from the active pipeline until verified and ready to promote.

motousher/ (Shopify store — Shopify JSON API):
- Scrapes 12 brands: All Balls Racing, DID Chains, EBC Brakes, Esjot
  Sprockets, Evans Coolant, Grip Puppies, HiFlo Filters, JT Sprockets,
  Maxima Racing Oils, Putoline, Ram Mount, Wunderlich
- 2,446 products total scraped and verified
- Uses /collections/{slug}/products.json + /products/{handle}.json
- Parallel fetch (concurrency 3), paginated collection listing

dirtstreet/ (WooCommerce store — HTML + JSON-LD):
- Scrapes 5 brands: SC Project, Evotech Performance, DNA Air Filters,
  WRS, Zero Gravity Racing
- 1,087 products total scraped and verified
- Pure fetch with JSON-LD schema.org extraction (no browser)
- Handles paginated /brand/{slug}/page/N/ archives
- Price extracted from offers.priceSpecification[0].price
- Stock status derived from JSON-LD availability field

Both scrapers are standalone (node index.js), support --brand and
--limit flags, save per-brand JSON files and a combined.json.
Scraped data lives in data/sources/test_source/ (gitignored).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 12:17:23 +05:30
1d254a9009 Install Playwright browser for backend 2026-05-15 00:25:40 +05:30
68949f124e Add multi-source import pipeline 2026-05-14 23:57:27 +05:30
bef07eff10 Refactor code structure for improved readability and maintainability 2026-05-14 23:56:13 +05:30
2320d1d5c3 Fix typo in health check service name 2026-04-14 14:23:00 +05:30
33ad269821 Fix health check service name in response 2026-04-14 14:22:25 +05:30
9480832478 Add concurrency handling and logging enhancements to KYT pipeline 2026-04-14 13:26:56 +05:30
e87bd907ea first commit 2026-04-13 17:31:26 +05:30