Building a Production Web Scraper
Jan 18, 2026
Most Go concurrency tutorials show you how to spawn goroutines and pass data through channels. Then you try to build something real and immediately hit problems the tutorials never mentioned: deadlocks, backpressure, graceful shutdowns that aren’t actually graceful.
I built Drop, a price tracking service that scrapes hundreds of product URLs daily and notifies users about price drops. Here’s what I learned about Go concurrency patterns that actually matter in production.
The Architecture: Scheduler + Worker Pool + Scraper
Drop’s core is simple: check product prices periodically, update the database, notify users when prices drop below their targets.
The naive approach would be to loop through items and scrape them one by one. For 1,000 items checking every hour, that’s over 16 minutes of sequential scraping. Unacceptable.
Instead, I built a worker pool that processes items concurrently while respecting resource constraints.
Pattern 1: Buffered Channels for Producer-Consumer Decoupling
The scheduler needs to distribute work to multiple workers. Here’s the critical decision:
// Bad: Unbuffered channel
jobs := make(chan ItemJob)
// Good: Buffered to item count
jobs := make(chan ItemJob, len(items))
Why buffer to len(items)?
With an unbuffered channel, every send blocks until a worker receives. This creates tight coupling between producer and consumer speeds. If workers aren’t ready yet, or you miscalculate worker count, you get deadlocks.
Buffered channels decouple this completely. The producer enqueues all jobs immediately without waiting for workers:
func (s *PriceRefresherScheduler) refreshAllPrices() {
ctx := context.Background()
items, err := s.itemsService.GetItemsDueForCheck(ctx)
if err != nil {
log.Printf("Error while refreshing prices: %s", err.Error())
return
}
if len(items) == 0 {
log.Printf("No items due for price refresh")
return
}
log.Printf("Starting concurrent refresh of %d items with %d workers",
len(items), s.workerCount)
// Create channels for work distribution
jobs := make(chan ItemJob, len(items))
results := make(chan string, len(items))
// Start worker pool
for w := 1; w <= s.workerCount; w++ {
go s.priceRefreshWorker(w, jobs, results)
}
// Producer fills the queue
for _, item := range items {
jobs <- ItemJob{
ID: item.ID,
UserID: item.UserID,
URL: item.URL,
Name: item.Name,
}
}
close(jobs) // Signal: no more work coming
// Collect results...
}
This gives you:
- No producer blocking - fill the queue instantly
- Workers can start anytime - no timing assumptions
- Clean shutdown - close the channel when done
The buffer size matters. Too small and you’re back to blocking. Too large and you waste memory. Buffering to exactly len(items) is perfect for bounded work batches.
Pattern 2: Worker Pool with Independent Goroutines
Each worker is dead simple - no shared state, just pure functions:
// priceRefreshWorker processes individual refresh jobs
// Each worker runs independently with no shared state
func (s *PriceRefresherScheduler) priceRefreshWorker(
workerID int,
jobs <-chan ItemJob,
results chan<- string,
) {
for job := range jobs {
log.Printf("Worker %d processing item: %s (ID: %s)",
workerID, job.Name, job.ID)
err := s.itemsService.RefreshPrice(
context.Background(),
job.ID,
job.UserID,
job.URL,
)
if err != nil {
results <- fmt.Sprintf("FAILED: %s (%s): %v",
job.ID, job.Name, err)
} else {
results <- fmt.Sprintf("SUCCESS: %s", job.Name)
}
}
}
The for job := range jobs pattern is crucial. It:
- Automatically handles channel closing (loop exits when channel closes)
- Processes jobs until queue is empty
- Requires zero synchronization primitives
Workers are completely independent. No mutexes, no wait groups in the worker itself, no coordination needed.
Pattern 3: Results Collection with Blocking Receive
After dispatching jobs, we need to wait for all results:
successCount := 0
failCount := 0
for range items {
result := <-results
if strings.HasPrefix(result, "SUCCESS:") {
successCount++
log.Println(result)
} else {
failCount++
log.Println(result)
}
}
log.Printf("Price refresh complete: %d succeeded, %d failed out of %d total",
successCount, failCount, len(items))
This blocks until exactly len(items) results come back. No busy waiting, no sleep loops - just synchronous collection of async work.
The results channel is also buffered to len(items), preventing workers from blocking when sending results. Workers finish faster, resources are released sooner.
Pattern 4: Timeout Context for Individual Operations
Web scraping has an enemy: hanging requests. A single stuck HTTP call can block a worker indefinitely.
In the service layer, I wrap each scrape with a timeout context:
func (s *service) CreateItem(ctx context.Context, userID string,
req CreateItemRequest) (*ItemResponse, error) {
if err := utils.ValidateURL(req.URL); err != nil {
return nil, fmt.Errorf("invalid URL: %w", err)
}
if err := s.checkForDuplicates(ctx, userID, req.URL); err != nil {
return nil, fmt.Errorf("duplicate item: %w", err)
}
// Create a timeout context for scraping to prevent hanging
scrapeCtx, cancel := context.WithTimeout(ctx, 15*time.Second)
defer cancel()
// Create channels to receive the scrape result or timeout
resultChan := make(chan *scraper.PriceInfo, 1)
errorChan := make(chan error, 1)
// Run scraping in a separate goroutine
go func() {
pi, err := s.scraper.ScrapePrice(req.URL)
if err != nil {
errorChan <- err
} else {
resultChan <- pi
}
}()
// Wait for result or timeout
var priceInfo *scraper.PriceInfo
var scrapeErr error
select {
case pi := <-resultChan:
priceInfo = pi
case err := <-errorChan:
scrapeErr = err
case <-scrapeCtx.Done():
return nil, fmt.Errorf("price scraping timed out after 15 seconds")
}
// Handle the result
if scrapeErr != nil {
if strings.Contains(scrapeErr.Error(), "out of stock") {
currentPrice := 0.0
if req.TargetPrice != nil {
currentPrice = *req.TargetPrice
}
inStock := false
req.CurrentPrice = currentPrice
req.InStock = &inStock
} else {
return nil, fmt.Errorf("failed to scrape price: %w", scrapeErr)
}
} else if priceInfo != nil {
req.CurrentPrice = priceInfo.Price
req.InStock = &priceInfo.InStock
}
// Create item in database...
}
Why not just rely on HTTP client timeout?
The HTTP client timeout only covers the request/response cycle. It doesn’t account for:
- HTML parsing time (goquery can be slow on massive pages)
- Price extraction logic
- Any other processing in the scraping function
The context timeout covers the entire operation. After 15 seconds, we abandon it completely and return an error. The goroutine might still be running, but we’ve moved on.
The channel buffers (make(chan X, 1)) prevent goroutine leaks - even if we timeout and stop listening, the goroutine can still send its result without blocking.
Pattern 5: Ticker-Based Scheduling with Graceful Stop
The scheduler runs continuously, checking prices at regular intervals:
type PriceRefresherScheduler struct {
itemsService items.Service
interval time.Duration
workerCount int
stopChan chan bool
}
func NewPriceRefresherScheduler(itemsService items.Service,
interval time.Duration, workerCount int) *PriceRefresherScheduler {
return &PriceRefresherScheduler{
itemsService: itemsService,
interval: interval,
workerCount: workerCount,
stopChan: make(chan bool),
}
}
func (s *PriceRefresherScheduler) Start() {
s.refreshAllPrices() // Initial run
ticker := time.NewTicker(s.interval)
go func() {
for {
select {
case <-ticker.C:
s.refreshAllPrices()
case <-s.stopChan:
ticker.Stop()
return
}
}
}()
}
func (s *PriceRefresherScheduler) Stop() {
s.stopChan <- true
}
The select statement handles two cases:
ticker.C: Time to run another batchstopChan: Shutdown signal received
This is intentionally simple. When Stop() is called, the scheduler stops accepting new batches immediately. In-flight scraping jobs continue until completion - we don’t forcefully cancel them.
Why not wait for workers to finish?
In practice, scraping jobs complete quickly (under 15s due to our timeout). Forcefully canceling them mid-scrape creates more problems than it solves - half-written database records, resource leaks, complex cleanup logic.
The tradeoff: shutdown takes up to 15 seconds. For a background service, that’s acceptable.
The Scraper: Keeping It Simple
The actual scraping logic is deliberately simple:
type Scraper struct {
client *http.Client
}
func NewScraper() *Scraper {
return &Scraper{
client: &http.Client{
Timeout: 10 * time.Second,
},
}
}
func (s *Scraper) ScrapePrice(url string) (*PriceInfo, error) {
req, err := http.NewRequest("GET", url, nil)
if err != nil {
return nil, fmt.Errorf("failed to create request: %w", err)
}
req.Header.Set("User-Agent",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36...")
resp, err := s.client.Do(req)
if err != nil {
return nil, fmt.Errorf("failed to fetch page: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return nil, fmt.Errorf("bad status code: %d", resp.StatusCode)
}
...