feat: Instagram scraper with GraphQL API integration - Automated followings list extraction via API interception - Profile scraping using GraphQL endpoint interception - DOM fallback for edge cases - Performance timing for all operations - Anti-bot measures and human-like behavior simulation

2025-10-31 23:06:06 +05:45
parent ba2dcec881
commit 6f4f37bee5
8 changed files with 3474 additions and 0 deletions
--- a/ANTI-BOT-RECOMMENDATIONS.md
+++ b/ANTI-BOT-RECOMMENDATIONS.md
@@ -0,0 +1,179 @@
+# Instagram Scraper - Anti-Bot Detection Recommendations
+
+Based on [Scrapfly's Instagram Scraping Guide](https://scrapfly.io/blog/posts/how-to-scrape-instagram)
+
+## ✅ Already Implemented
+
+1. **Puppeteer Stealth Plugin** - Bypasses basic browser detection
+2. **Random User Agents** - Different browser signatures
+3. **Human-like behaviors**:
+   - Mouse movements
+   - Random scrolling
+   - Variable delays (2.5-6 seconds between profiles)
+   - Typing delays
+   - Breaks every 10 profiles
+4. **Variable viewport sizes** - Randomized window dimensions
+5. **Network payload interception** - Capturing API responses instead of DOM scraping
+6. **Critical headers** - Including `x-ig-app-id: 936619743392459`
+
+## ⚠️ Critical Improvements Needed
+
+### 1. **Residential Proxies** (MOST IMPORTANT)
+
+**Status**: ❌ Not implemented
+
+**Issue**:
+
+- Datacenter IPs (AWS, Google Cloud, etc.) are **blocked instantly** by Instagram
+- Your current setup will be detected as soon as you deploy to any cloud server
+
+**Solution**:
+
+```javascript
+const browser = await puppeteer.launch({
+  headless: true,
+  args: [
+    "--proxy-server=residential-proxy-provider.com:port",
+    // Residential proxies required - NOT datacenter
+  ],
+});
+```
+
+**Recommended Proxy Providers**:
+
+- Bright Data (formerly Luminati)
+- Oxylabs
+- Smartproxy
+- GeoSurf
+
+**Requirements**:
+
+- Must be residential IPs (from real ISPs like Comcast, AT&T)
+- Rotate IPs every 5-10 minutes (sticky sessions)
+- Each IP allows ~200 requests/hour
+- Cost: ~$10-15 per GB
+
+### 2. **Rate Limit Handling with Exponential Backoff**
+
+**Status**: ⚠️ Partial - needs improvement
+
+**Current**: Random delays exist
+**Needed**: Proper 429 error handling
+
+```javascript
+async function makeRequest(fn, retries = 3) {
+  for (let i = 0; i < retries; i++) {
+    try {
+      return await fn();
+    } catch (error) {
+      if (error.status === 429 && i < retries - 1) {
+        const delay = Math.pow(2, i) * 2000; // 2s, 4s, 8s
+        console.log(`Rate limited, waiting ${delay}ms...`);
+        await new Promise((res) => setTimeout(res, delay));
+        continue;
+      }
+      throw error;
+    }
+  }
+}
+```
+
+### 3. **Session Cookies Management**
+
+**Status**: ⚠️ Partial - extractSession exists but not reused
+
+**Issue**: Creating new sessions repeatedly looks suspicious
+
+**Solution**:
+
+- Save cookies after login
+- Reuse cookies across multiple scraping sessions
+- Rotate sessions periodically
+
+```javascript
+// Save cookies after login
+const cookies = await extractSession(page);
+fs.writeFileSync("session.json", JSON.stringify(cookies));
+
+// Reuse cookies in next session
+const savedCookies = JSON.parse(fs.readFileSync("session.json"));
+await page.setCookie(...savedCookies.cookies);
+```
+
+### 4. **Realistic Browsing Patterns**
+
+**Status**: ✅ Implemented but can improve
+
+**Additional improvements**:
+
+- Visit homepage before going to target profile
+- Occasionally view posts/stories during following list scraping
+- Don't always scrape in the same order (randomize)
+- Add occasional "browsing breaks" of 30-60 seconds
+
+### 5. **Monitor doc_id Changes**
+
+**Status**: ❌ Not monitoring
+
+**Issue**: Instagram changes GraphQL `doc_id` values every 2-4 weeks
+
+**Current doc_ids** (as of article):
+
+- Profile posts: `9310670392322965`
+- Post details: `8845758582119845`
+- Reels: `25981206651899035`
+
+**Solution**:
+
+- Monitor Instagram's GraphQL requests in browser DevTools
+- Update when API calls start failing
+- Or use a service like Scrapfly that auto-updates
+
+## 📊 Instagram's Blocking Layers
+
+1. **IP Quality Check** → Blocks datacenter IPs instantly
+2. **TLS Fingerprinting** → Detects non-browser tools (Puppeteer Stealth helps)
+3. **Rate Limiting** → ~200 requests/hour per IP
+4. **Behavioral Detection** → Flags unnatural patterns
+
+## 🎯 Priority Implementation Order
+
+1. **HIGH PRIORITY**: Add residential proxy support
+2. **HIGH PRIORITY**: Implement exponential backoff for 429 errors
+3. **MEDIUM**: Improve session cookie reuse
+4. **MEDIUM**: Add doc_id monitoring system
+5. **LOW**: Additional browsing pattern randomization
+
+## 💰 Cost Estimates (for 10,000 profiles)
+
+- **Proxy bandwidth**: ~750 MB
+- **Cost**: $7.50-$11.25 in residential proxy fees
+- **With Proxy Saver**: $5.25-$7.88 (30-50% savings)
+
+## 🚨 Legal Considerations
+
+- Only scrape **publicly available** data
+- Respect rate limits
+- Don't store PII of EU citizens without GDPR compliance
+- Add delays to avoid damaging Instagram's servers
+- Check Instagram's Terms of Service
+
+## 📚 Additional Resources
+
+- [Scrapfly Instagram Scraper](https://github.com/scrapfly/scrapfly-scrapers/tree/main/instagram-scraper) - Open source reference
+- [Instagram GraphQL Endpoint Documentation](https://scrapfly.io/blog/posts/how-to-scrape-instagram#how-instagrams-scraping-api-works)
+- [Proxy comparison guide](https://scrapfly.io/blog/best-proxy-providers-for-web-scraping)
+
+## ⚡ Quick Wins
+
+Things you can implement immediately:
+
+1. ✅ Critical headers added (x-ig-app-id)
+2. ✅ Human simulation functions integrated
+3. ✅ Exponential backoff added (see EXPONENTIAL-BACKOFF.md)
+4. Implement cookie persistence (15 min)
+5. Research residential proxy providers (1 hour)
+
+---
+
+**Bottom Line**: Without residential proxies, this scraper will be blocked immediately on any cloud infrastructure. That's the #1 priority to address.