5.2 KiB
5.2 KiB
Instagram Scraper - Anti-Bot Detection Recommendations
Based on Scrapfly's Instagram Scraping Guide
✅ Already Implemented
- Puppeteer Stealth Plugin - Bypasses basic browser detection
- Random User Agents - Different browser signatures
- Human-like behaviors:
- Mouse movements
- Random scrolling
- Variable delays (2.5-6 seconds between profiles)
- Typing delays
- Breaks every 10 profiles
- Variable viewport sizes - Randomized window dimensions
- Network payload interception - Capturing API responses instead of DOM scraping
- Critical headers - Including
x-ig-app-id: 936619743392459
⚠️ Critical Improvements Needed
1. Residential Proxies (MOST IMPORTANT)
Status: ❌ Not implemented
Issue:
- Datacenter IPs (AWS, Google Cloud, etc.) are blocked instantly by Instagram
- Your current setup will be detected as soon as you deploy to any cloud server
Solution:
const browser = await puppeteer.launch({
headless: true,
args: [
"--proxy-server=residential-proxy-provider.com:port",
// Residential proxies required - NOT datacenter
],
});
Recommended Proxy Providers:
- Bright Data (formerly Luminati)
- Oxylabs
- Smartproxy
- GeoSurf
Requirements:
- Must be residential IPs (from real ISPs like Comcast, AT&T)
- Rotate IPs every 5-10 minutes (sticky sessions)
- Each IP allows ~200 requests/hour
- Cost: ~$10-15 per GB
2. Rate Limit Handling with Exponential Backoff
Status: ⚠️ Partial - needs improvement
Current: Random delays exist Needed: Proper 429 error handling
async function makeRequest(fn, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
return await fn();
} catch (error) {
if (error.status === 429 && i < retries - 1) {
const delay = Math.pow(2, i) * 2000; // 2s, 4s, 8s
console.log(`Rate limited, waiting ${delay}ms...`);
await new Promise((res) => setTimeout(res, delay));
continue;
}
throw error;
}
}
}
3. Session Cookies Management
Status: ⚠️ Partial - extractSession exists but not reused
Issue: Creating new sessions repeatedly looks suspicious
Solution:
- Save cookies after login
- Reuse cookies across multiple scraping sessions
- Rotate sessions periodically
// Save cookies after login
const cookies = await extractSession(page);
fs.writeFileSync("session.json", JSON.stringify(cookies));
// Reuse cookies in next session
const savedCookies = JSON.parse(fs.readFileSync("session.json"));
await page.setCookie(...savedCookies.cookies);
4. Realistic Browsing Patterns
Status: ✅ Implemented but can improve
Additional improvements:
- Visit homepage before going to target profile
- Occasionally view posts/stories during following list scraping
- Don't always scrape in the same order (randomize)
- Add occasional "browsing breaks" of 30-60 seconds
5. Monitor doc_id Changes
Status: ❌ Not monitoring
Issue: Instagram changes GraphQL doc_id values every 2-4 weeks
Current doc_ids (as of article):
- Profile posts:
9310670392322965 - Post details:
8845758582119845 - Reels:
25981206651899035
Solution:
- Monitor Instagram's GraphQL requests in browser DevTools
- Update when API calls start failing
- Or use a service like Scrapfly that auto-updates
📊 Instagram's Blocking Layers
- IP Quality Check → Blocks datacenter IPs instantly
- TLS Fingerprinting → Detects non-browser tools (Puppeteer Stealth helps)
- Rate Limiting → ~200 requests/hour per IP
- Behavioral Detection → Flags unnatural patterns
🎯 Priority Implementation Order
- HIGH PRIORITY: Add residential proxy support
- HIGH PRIORITY: Implement exponential backoff for 429 errors
- MEDIUM: Improve session cookie reuse
- MEDIUM: Add doc_id monitoring system
- LOW: Additional browsing pattern randomization
💰 Cost Estimates (for 10,000 profiles)
- Proxy bandwidth: ~750 MB
- Cost: $7.50-$11.25 in residential proxy fees
- With Proxy Saver: $5.25-$7.88 (30-50% savings)
🚨 Legal Considerations
- Only scrape publicly available data
- Respect rate limits
- Don't store PII of EU citizens without GDPR compliance
- Add delays to avoid damaging Instagram's servers
- Check Instagram's Terms of Service
📚 Additional Resources
- Scrapfly Instagram Scraper - Open source reference
- Instagram GraphQL Endpoint Documentation
- Proxy comparison guide
⚡ Quick Wins
Things you can implement immediately:
- ✅ Critical headers added (x-ig-app-id)
- ✅ Human simulation functions integrated
- ✅ Exponential backoff added (see EXPONENTIAL-BACKOFF.md)
- Implement cookie persistence (15 min)
- Research residential proxy providers (1 hour)
Bottom Line: Without residential proxies, this scraper will be blocked immediately on any cloud infrastructure. That's the #1 priority to address.