feat: Instagram scraper with GraphQL API integration - Automated followings list extraction via API interception - Profile scraping using GraphQL endpoint interception - DOM fallback for edge cases - Performance timing for all operations - Anti-bot measures and human-like behavior simulation

This commit is contained in:
2025-10-31 23:06:06 +05:45
parent ba2dcec881
commit 6f4f37bee5
8 changed files with 3474 additions and 0 deletions

179
ANTI-BOT-RECOMMENDATIONS.md Normal file
View File

@@ -0,0 +1,179 @@
# Instagram Scraper - Anti-Bot Detection Recommendations
Based on [Scrapfly's Instagram Scraping Guide](https://scrapfly.io/blog/posts/how-to-scrape-instagram)
## ✅ Already Implemented
1. **Puppeteer Stealth Plugin** - Bypasses basic browser detection
2. **Random User Agents** - Different browser signatures
3. **Human-like behaviors**:
- Mouse movements
- Random scrolling
- Variable delays (2.5-6 seconds between profiles)
- Typing delays
- Breaks every 10 profiles
4. **Variable viewport sizes** - Randomized window dimensions
5. **Network payload interception** - Capturing API responses instead of DOM scraping
6. **Critical headers** - Including `x-ig-app-id: 936619743392459`
## ⚠️ Critical Improvements Needed
### 1. **Residential Proxies** (MOST IMPORTANT)
**Status**: ❌ Not implemented
**Issue**:
- Datacenter IPs (AWS, Google Cloud, etc.) are **blocked instantly** by Instagram
- Your current setup will be detected as soon as you deploy to any cloud server
**Solution**:
```javascript
const browser = await puppeteer.launch({
headless: true,
args: [
"--proxy-server=residential-proxy-provider.com:port",
// Residential proxies required - NOT datacenter
],
});
```
**Recommended Proxy Providers**:
- Bright Data (formerly Luminati)
- Oxylabs
- Smartproxy
- GeoSurf
**Requirements**:
- Must be residential IPs (from real ISPs like Comcast, AT&T)
- Rotate IPs every 5-10 minutes (sticky sessions)
- Each IP allows ~200 requests/hour
- Cost: ~$10-15 per GB
### 2. **Rate Limit Handling with Exponential Backoff**
**Status**: ⚠️ Partial - needs improvement
**Current**: Random delays exist
**Needed**: Proper 429 error handling
```javascript
async function makeRequest(fn, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
return await fn();
} catch (error) {
if (error.status === 429 && i < retries - 1) {
const delay = Math.pow(2, i) * 2000; // 2s, 4s, 8s
console.log(`Rate limited, waiting ${delay}ms...`);
await new Promise((res) => setTimeout(res, delay));
continue;
}
throw error;
}
}
}
```
### 3. **Session Cookies Management**
**Status**: ⚠️ Partial - extractSession exists but not reused
**Issue**: Creating new sessions repeatedly looks suspicious
**Solution**:
- Save cookies after login
- Reuse cookies across multiple scraping sessions
- Rotate sessions periodically
```javascript
// Save cookies after login
const cookies = await extractSession(page);
fs.writeFileSync("session.json", JSON.stringify(cookies));
// Reuse cookies in next session
const savedCookies = JSON.parse(fs.readFileSync("session.json"));
await page.setCookie(...savedCookies.cookies);
```
### 4. **Realistic Browsing Patterns**
**Status**: ✅ Implemented but can improve
**Additional improvements**:
- Visit homepage before going to target profile
- Occasionally view posts/stories during following list scraping
- Don't always scrape in the same order (randomize)
- Add occasional "browsing breaks" of 30-60 seconds
### 5. **Monitor doc_id Changes**
**Status**: ❌ Not monitoring
**Issue**: Instagram changes GraphQL `doc_id` values every 2-4 weeks
**Current doc_ids** (as of article):
- Profile posts: `9310670392322965`
- Post details: `8845758582119845`
- Reels: `25981206651899035`
**Solution**:
- Monitor Instagram's GraphQL requests in browser DevTools
- Update when API calls start failing
- Or use a service like Scrapfly that auto-updates
## 📊 Instagram's Blocking Layers
1. **IP Quality Check** → Blocks datacenter IPs instantly
2. **TLS Fingerprinting** → Detects non-browser tools (Puppeteer Stealth helps)
3. **Rate Limiting** → ~200 requests/hour per IP
4. **Behavioral Detection** → Flags unnatural patterns
## 🎯 Priority Implementation Order
1. **HIGH PRIORITY**: Add residential proxy support
2. **HIGH PRIORITY**: Implement exponential backoff for 429 errors
3. **MEDIUM**: Improve session cookie reuse
4. **MEDIUM**: Add doc_id monitoring system
5. **LOW**: Additional browsing pattern randomization
## 💰 Cost Estimates (for 10,000 profiles)
- **Proxy bandwidth**: ~750 MB
- **Cost**: $7.50-$11.25 in residential proxy fees
- **With Proxy Saver**: $5.25-$7.88 (30-50% savings)
## 🚨 Legal Considerations
- Only scrape **publicly available** data
- Respect rate limits
- Don't store PII of EU citizens without GDPR compliance
- Add delays to avoid damaging Instagram's servers
- Check Instagram's Terms of Service
## 📚 Additional Resources
- [Scrapfly Instagram Scraper](https://github.com/scrapfly/scrapfly-scrapers/tree/main/instagram-scraper) - Open source reference
- [Instagram GraphQL Endpoint Documentation](https://scrapfly.io/blog/posts/how-to-scrape-instagram#how-instagrams-scraping-api-works)
- [Proxy comparison guide](https://scrapfly.io/blog/best-proxy-providers-for-web-scraping)
## ⚡ Quick Wins
Things you can implement immediately:
1. ✅ Critical headers added (x-ig-app-id)
2. ✅ Human simulation functions integrated
3. ✅ Exponential backoff added (see EXPONENTIAL-BACKOFF.md)
4. Implement cookie persistence (15 min)
5. Research residential proxy providers (1 hour)
---
**Bottom Line**: Without residential proxies, this scraper will be blocked immediately on any cloud infrastructure. That's the #1 priority to address.