12 Commits

Author SHA1 Message Date
08f21d9bc9 Add job cancellation — backend pipeline + cancel API route
pipelineJobs.js:
- cancelJob(jobId): marks job as cancelled=true, status=cancelling
- isJobCancelled(jobId): checked by the pipeline between stages

runSourcePipeline.js:
- PipelineCancelledError class
- checkCancelled() called before each of the 6 pipeline stages
- Accepts options.isCancelled() callback from the job runner

runKytPipelineJob.js:
- Passes isCancelled: () => isJobCancelled(job.id) into pipeline
- Catches PipelineCancelledError separately, sets status=cancelled

routes/pipeline.js:
- POST /pipeline/cancel/:jobId — marks job for cancellation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 16:57:35 +05:30
4e536f08b3 Add 24h cache to motousher and dirtstreet fetchWebsiteData
Subsequent pipeline runs within 24 hours reuse the existing
01_products_aggregated.json instead of re-scraping all brands,
eliminating redundant HTTP requests and 429 rate-limit retries.

Cache lifetime controlled per source:
  MOTOUSHER_CACHE_HOURS=0  → always re-scrape
  DIRTSTREET_CACHE_HOURS=0 → always re-scrape
  (default: 24h)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 16:44:38 +05:30
6eac0b92ed Fix converter reading nested scraped data and brand priority
Both motousher and dirtstreet converters were reading product fields
(title, sku, price, images) directly from the aggregated record, but
those fields live inside record.scraped after fetchWebsiteData wraps
them. Results were: Untitled Product, missing images, SKU=variant-1.

Also fixed brand priority: per-product brand (e.g. Evans Coolant,
SC Project) now takes precedence over the global SHOPIFY_BRAND env
var (KYT), which was incorrectly overriding all products from the
new sources.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 16:39:54 +05:30
c0132ab0aa Add motousher and dirtstreet as active import pipeline sources
Both sources are now registered in sources/index.js and fully wired
into the 6-stage pipeline (fetch → download → watermark → upload →
convert → upsert). The frontend will automatically show them as tabs
via GET /pipeline/sources without any frontend changes needed.

motousher/ (Shopify JSON API — 12 brands, ~2,446 products):
- scraper.js: fetches /collections/{slug}/products.json + /products/{handle}.json
- converter.js: maps scraped products to standard pipeline format
- index.js: fetchWebsiteData() loops all brands, normalises to
  productSummary.img format for shared download/upload utilities
- Supports MOTOUSHER_BRANDS env var to filter brands on a run

dirtstreet/ (WooCommerce HTML + JSON-LD — 5 brands, ~1,087 products):
- scraper.js: pure fetch, paginates /brand/{slug}/page/N/,
  extracts price from offers.priceSpecification[0].price,
  stock from JSON-LD availability field
- converter.js: maps scraped products to standard pipeline format,
  builds descriptionHtml from body + short desc + attributes table
- index.js: fetchWebsiteData() loops all brands, normalises to
  productSummary.img format
- Supports DIRTSTREET_BRANDS env var to filter brands on a run

sources/index.js: registered all 4 sources (kyt, brocks-performance,
motousher, dirtstreet). GET /pipeline/sources now returns all 4.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 12:25:31 +05:30
b8d9478afa Add test_source scrapers for motousher.com and dirtstreet.in
Adds two new experimental product scrapers under test_source/, isolated
from the active pipeline until verified and ready to promote.

motousher/ (Shopify store — Shopify JSON API):
- Scrapes 12 brands: All Balls Racing, DID Chains, EBC Brakes, Esjot
  Sprockets, Evans Coolant, Grip Puppies, HiFlo Filters, JT Sprockets,
  Maxima Racing Oils, Putoline, Ram Mount, Wunderlich
- 2,446 products total scraped and verified
- Uses /collections/{slug}/products.json + /products/{handle}.json
- Parallel fetch (concurrency 3), paginated collection listing

dirtstreet/ (WooCommerce store — HTML + JSON-LD):
- Scrapes 5 brands: SC Project, Evotech Performance, DNA Air Filters,
  WRS, Zero Gravity Racing
- 1,087 products total scraped and verified
- Pure fetch with JSON-LD schema.org extraction (no browser)
- Handles paginated /brand/{slug}/page/N/ archives
- Price extracted from offers.priceSpecification[0].price
- Stock status derived from JSON-LD availability field

Both scrapers are standalone (node index.js), support --brand and
--limit flags, save per-brand JSON files and a combined.json.
Scraped data lives in data/sources/test_source/ (gitignored).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 12:17:23 +05:30
1d254a9009 Install Playwright browser for backend 2026-05-15 00:25:40 +05:30
68949f124e Add multi-source import pipeline 2026-05-14 23:57:27 +05:30
bef07eff10 Refactor code structure for improved readability and maintainability 2026-05-14 23:56:13 +05:30
2320d1d5c3 Fix typo in health check service name 2026-04-14 14:23:00 +05:30
33ad269821 Fix health check service name in response 2026-04-14 14:22:25 +05:30
9480832478 Add concurrency handling and logging enhancements to KYT pipeline 2026-04-14 13:26:56 +05:30
e87bd907ea first commit 2026-04-13 17:31:26 +05:30