# Instagram Scraper - Anti-Bot Detection Recommendations Based on [Scrapfly's Instagram Scraping Guide](https://scrapfly.io/blog/posts/how-to-scrape-instagram) ## ✅ Already Implemented 1. **Puppeteer Stealth Plugin** - Bypasses basic browser detection 2. **Random User Agents** - Different browser signatures 3. **Human-like behaviors**: - Mouse movements - Random scrolling - Variable delays (2.5-6 seconds between profiles) - Typing delays - Breaks every 10 profiles 4. **Variable viewport sizes** - Randomized window dimensions 5. **Network payload interception** - Capturing API responses instead of DOM scraping 6. **Critical headers** - Including `x-ig-app-id: 936619743392459` ## ⚠️ Critical Improvements Needed ### 1. **Residential Proxies** (MOST IMPORTANT) **Status**: ❌ Not implemented **Issue**: - Datacenter IPs (AWS, Google Cloud, etc.) are **blocked instantly** by Instagram - Your current setup will be detected as soon as you deploy to any cloud server **Solution**: ```javascript const browser = await puppeteer.launch({ headless: true, args: [ "--proxy-server=residential-proxy-provider.com:port", // Residential proxies required - NOT datacenter ], }); ``` **Recommended Proxy Providers**: - Bright Data (formerly Luminati) - Oxylabs - Smartproxy - GeoSurf **Requirements**: - Must be residential IPs (from real ISPs like Comcast, AT&T) - Rotate IPs every 5-10 minutes (sticky sessions) - Each IP allows ~200 requests/hour - Cost: ~$10-15 per GB ### 2. **Rate Limit Handling with Exponential Backoff** **Status**: ⚠️ Partial - needs improvement **Current**: Random delays exist **Needed**: Proper 429 error handling ```javascript async function makeRequest(fn, retries = 3) { for (let i = 0; i < retries; i++) { try { return await fn(); } catch (error) { if (error.status === 429 && i < retries - 1) { const delay = Math.pow(2, i) * 2000; // 2s, 4s, 8s console.log(`Rate limited, waiting ${delay}ms...`); await new Promise((res) => setTimeout(res, delay)); continue; } throw error; } } } ``` ### 3. **Session Cookies Management** **Status**: ⚠️ Partial - extractSession exists but not reused **Issue**: Creating new sessions repeatedly looks suspicious **Solution**: - Save cookies after login - Reuse cookies across multiple scraping sessions - Rotate sessions periodically ```javascript // Save cookies after login const cookies = await extractSession(page); fs.writeFileSync("session.json", JSON.stringify(cookies)); // Reuse cookies in next session const savedCookies = JSON.parse(fs.readFileSync("session.json")); await page.setCookie(...savedCookies.cookies); ``` ### 4. **Realistic Browsing Patterns** **Status**: ✅ Implemented but can improve **Additional improvements**: - Visit homepage before going to target profile - Occasionally view posts/stories during following list scraping - Don't always scrape in the same order (randomize) - Add occasional "browsing breaks" of 30-60 seconds ### 5. **Monitor doc_id Changes** **Status**: ❌ Not monitoring **Issue**: Instagram changes GraphQL `doc_id` values every 2-4 weeks **Current doc_ids** (as of article): - Profile posts: `9310670392322965` - Post details: `8845758582119845` - Reels: `25981206651899035` **Solution**: - Monitor Instagram's GraphQL requests in browser DevTools - Update when API calls start failing - Or use a service like Scrapfly that auto-updates ## 📊 Instagram's Blocking Layers 1. **IP Quality Check** → Blocks datacenter IPs instantly 2. **TLS Fingerprinting** → Detects non-browser tools (Puppeteer Stealth helps) 3. **Rate Limiting** → ~200 requests/hour per IP 4. **Behavioral Detection** → Flags unnatural patterns ## 🎯 Priority Implementation Order 1. **HIGH PRIORITY**: Add residential proxy support 2. **HIGH PRIORITY**: Implement exponential backoff for 429 errors 3. **MEDIUM**: Improve session cookie reuse 4. **MEDIUM**: Add doc_id monitoring system 5. **LOW**: Additional browsing pattern randomization ## 💰 Cost Estimates (for 10,000 profiles) - **Proxy bandwidth**: ~750 MB - **Cost**: $7.50-$11.25 in residential proxy fees - **With Proxy Saver**: $5.25-$7.88 (30-50% savings) ## 🚨 Legal Considerations - Only scrape **publicly available** data - Respect rate limits - Don't store PII of EU citizens without GDPR compliance - Add delays to avoid damaging Instagram's servers - Check Instagram's Terms of Service ## 📚 Additional Resources - [Scrapfly Instagram Scraper](https://github.com/scrapfly/scrapfly-scrapers/tree/main/instagram-scraper) - Open source reference - [Instagram GraphQL Endpoint Documentation](https://scrapfly.io/blog/posts/how-to-scrape-instagram#how-instagrams-scraping-api-works) - [Proxy comparison guide](https://scrapfly.io/blog/best-proxy-providers-for-web-scraping) ## ⚡ Quick Wins Things you can implement immediately: 1. ✅ Critical headers added (x-ig-app-id) 2. ✅ Human simulation functions integrated 3. ✅ Exponential backoff added (see EXPONENTIAL-BACKOFF.md) 4. Implement cookie persistence (15 min) 5. Research residential proxy providers (1 hour) --- **Bottom Line**: Without residential proxies, this scraper will be blocked immediately on any cloud infrastructure. That's the #1 priority to address.