Files
instagram-scraper/ANTI-BOT-RECOMMENDATIONS.md

5.2 KiB

Instagram Scraper - Anti-Bot Detection Recommendations

Based on Scrapfly's Instagram Scraping Guide

Already Implemented

  1. Puppeteer Stealth Plugin - Bypasses basic browser detection
  2. Random User Agents - Different browser signatures
  3. Human-like behaviors:
    • Mouse movements
    • Random scrolling
    • Variable delays (2.5-6 seconds between profiles)
    • Typing delays
    • Breaks every 10 profiles
  4. Variable viewport sizes - Randomized window dimensions
  5. Network payload interception - Capturing API responses instead of DOM scraping
  6. Critical headers - Including x-ig-app-id: 936619743392459

⚠️ Critical Improvements Needed

1. Residential Proxies (MOST IMPORTANT)

Status: Not implemented

Issue:

  • Datacenter IPs (AWS, Google Cloud, etc.) are blocked instantly by Instagram
  • Your current setup will be detected as soon as you deploy to any cloud server

Solution:

const browser = await puppeteer.launch({
  headless: true,
  args: [
    "--proxy-server=residential-proxy-provider.com:port",
    // Residential proxies required - NOT datacenter
  ],
});

Recommended Proxy Providers:

  • Bright Data (formerly Luminati)
  • Oxylabs
  • Smartproxy
  • GeoSurf

Requirements:

  • Must be residential IPs (from real ISPs like Comcast, AT&T)
  • Rotate IPs every 5-10 minutes (sticky sessions)
  • Each IP allows ~200 requests/hour
  • Cost: ~$10-15 per GB

2. Rate Limit Handling with Exponential Backoff

Status: ⚠️ Partial - needs improvement

Current: Random delays exist Needed: Proper 429 error handling

async function makeRequest(fn, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (error.status === 429 && i < retries - 1) {
        const delay = Math.pow(2, i) * 2000; // 2s, 4s, 8s
        console.log(`Rate limited, waiting ${delay}ms...`);
        await new Promise((res) => setTimeout(res, delay));
        continue;
      }
      throw error;
    }
  }
}

3. Session Cookies Management

Status: ⚠️ Partial - extractSession exists but not reused

Issue: Creating new sessions repeatedly looks suspicious

Solution:

  • Save cookies after login
  • Reuse cookies across multiple scraping sessions
  • Rotate sessions periodically
// Save cookies after login
const cookies = await extractSession(page);
fs.writeFileSync("session.json", JSON.stringify(cookies));

// Reuse cookies in next session
const savedCookies = JSON.parse(fs.readFileSync("session.json"));
await page.setCookie(...savedCookies.cookies);

4. Realistic Browsing Patterns

Status: Implemented but can improve

Additional improvements:

  • Visit homepage before going to target profile
  • Occasionally view posts/stories during following list scraping
  • Don't always scrape in the same order (randomize)
  • Add occasional "browsing breaks" of 30-60 seconds

5. Monitor doc_id Changes

Status: Not monitoring

Issue: Instagram changes GraphQL doc_id values every 2-4 weeks

Current doc_ids (as of article):

  • Profile posts: 9310670392322965
  • Post details: 8845758582119845
  • Reels: 25981206651899035

Solution:

  • Monitor Instagram's GraphQL requests in browser DevTools
  • Update when API calls start failing
  • Or use a service like Scrapfly that auto-updates

📊 Instagram's Blocking Layers

  1. IP Quality Check → Blocks datacenter IPs instantly
  2. TLS Fingerprinting → Detects non-browser tools (Puppeteer Stealth helps)
  3. Rate Limiting → ~200 requests/hour per IP
  4. Behavioral Detection → Flags unnatural patterns

🎯 Priority Implementation Order

  1. HIGH PRIORITY: Add residential proxy support
  2. HIGH PRIORITY: Implement exponential backoff for 429 errors
  3. MEDIUM: Improve session cookie reuse
  4. MEDIUM: Add doc_id monitoring system
  5. LOW: Additional browsing pattern randomization

💰 Cost Estimates (for 10,000 profiles)

  • Proxy bandwidth: ~750 MB
  • Cost: $7.50-$11.25 in residential proxy fees
  • With Proxy Saver: $5.25-$7.88 (30-50% savings)
  • Only scrape publicly available data
  • Respect rate limits
  • Don't store PII of EU citizens without GDPR compliance
  • Add delays to avoid damaging Instagram's servers
  • Check Instagram's Terms of Service

📚 Additional Resources

Quick Wins

Things you can implement immediately:

  1. Critical headers added (x-ig-app-id)
  2. Human simulation functions integrated
  3. Exponential backoff added (see EXPONENTIAL-BACKOFF.md)
  4. Implement cookie persistence (15 min)
  5. Research residential proxy providers (1 hour)

Bottom Line: Without residential proxies, this scraper will be blocked immediately on any cloud infrastructure. That's the #1 priority to address.