180 lines
5.2 KiB
Markdown
180 lines
5.2 KiB
Markdown
# Instagram Scraper - Anti-Bot Detection Recommendations
|
|
|
|
Based on [Scrapfly's Instagram Scraping Guide](https://scrapfly.io/blog/posts/how-to-scrape-instagram)
|
|
|
|
## ✅ Already Implemented
|
|
|
|
1. **Puppeteer Stealth Plugin** - Bypasses basic browser detection
|
|
2. **Random User Agents** - Different browser signatures
|
|
3. **Human-like behaviors**:
|
|
- Mouse movements
|
|
- Random scrolling
|
|
- Variable delays (2.5-6 seconds between profiles)
|
|
- Typing delays
|
|
- Breaks every 10 profiles
|
|
4. **Variable viewport sizes** - Randomized window dimensions
|
|
5. **Network payload interception** - Capturing API responses instead of DOM scraping
|
|
6. **Critical headers** - Including `x-ig-app-id: 936619743392459`
|
|
|
|
## ⚠️ Critical Improvements Needed
|
|
|
|
### 1. **Residential Proxies** (MOST IMPORTANT)
|
|
|
|
**Status**: ❌ Not implemented
|
|
|
|
**Issue**:
|
|
|
|
- Datacenter IPs (AWS, Google Cloud, etc.) are **blocked instantly** by Instagram
|
|
- Your current setup will be detected as soon as you deploy to any cloud server
|
|
|
|
**Solution**:
|
|
|
|
```javascript
|
|
const browser = await puppeteer.launch({
|
|
headless: true,
|
|
args: [
|
|
"--proxy-server=residential-proxy-provider.com:port",
|
|
// Residential proxies required - NOT datacenter
|
|
],
|
|
});
|
|
```
|
|
|
|
**Recommended Proxy Providers**:
|
|
|
|
- Bright Data (formerly Luminati)
|
|
- Oxylabs
|
|
- Smartproxy
|
|
- GeoSurf
|
|
|
|
**Requirements**:
|
|
|
|
- Must be residential IPs (from real ISPs like Comcast, AT&T)
|
|
- Rotate IPs every 5-10 minutes (sticky sessions)
|
|
- Each IP allows ~200 requests/hour
|
|
- Cost: ~$10-15 per GB
|
|
|
|
### 2. **Rate Limit Handling with Exponential Backoff**
|
|
|
|
**Status**: ⚠️ Partial - needs improvement
|
|
|
|
**Current**: Random delays exist
|
|
**Needed**: Proper 429 error handling
|
|
|
|
```javascript
|
|
async function makeRequest(fn, retries = 3) {
|
|
for (let i = 0; i < retries; i++) {
|
|
try {
|
|
return await fn();
|
|
} catch (error) {
|
|
if (error.status === 429 && i < retries - 1) {
|
|
const delay = Math.pow(2, i) * 2000; // 2s, 4s, 8s
|
|
console.log(`Rate limited, waiting ${delay}ms...`);
|
|
await new Promise((res) => setTimeout(res, delay));
|
|
continue;
|
|
}
|
|
throw error;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. **Session Cookies Management**
|
|
|
|
**Status**: ⚠️ Partial - extractSession exists but not reused
|
|
|
|
**Issue**: Creating new sessions repeatedly looks suspicious
|
|
|
|
**Solution**:
|
|
|
|
- Save cookies after login
|
|
- Reuse cookies across multiple scraping sessions
|
|
- Rotate sessions periodically
|
|
|
|
```javascript
|
|
// Save cookies after login
|
|
const cookies = await extractSession(page);
|
|
fs.writeFileSync("session.json", JSON.stringify(cookies));
|
|
|
|
// Reuse cookies in next session
|
|
const savedCookies = JSON.parse(fs.readFileSync("session.json"));
|
|
await page.setCookie(...savedCookies.cookies);
|
|
```
|
|
|
|
### 4. **Realistic Browsing Patterns**
|
|
|
|
**Status**: ✅ Implemented but can improve
|
|
|
|
**Additional improvements**:
|
|
|
|
- Visit homepage before going to target profile
|
|
- Occasionally view posts/stories during following list scraping
|
|
- Don't always scrape in the same order (randomize)
|
|
- Add occasional "browsing breaks" of 30-60 seconds
|
|
|
|
### 5. **Monitor doc_id Changes**
|
|
|
|
**Status**: ❌ Not monitoring
|
|
|
|
**Issue**: Instagram changes GraphQL `doc_id` values every 2-4 weeks
|
|
|
|
**Current doc_ids** (as of article):
|
|
|
|
- Profile posts: `9310670392322965`
|
|
- Post details: `8845758582119845`
|
|
- Reels: `25981206651899035`
|
|
|
|
**Solution**:
|
|
|
|
- Monitor Instagram's GraphQL requests in browser DevTools
|
|
- Update when API calls start failing
|
|
- Or use a service like Scrapfly that auto-updates
|
|
|
|
## 📊 Instagram's Blocking Layers
|
|
|
|
1. **IP Quality Check** → Blocks datacenter IPs instantly
|
|
2. **TLS Fingerprinting** → Detects non-browser tools (Puppeteer Stealth helps)
|
|
3. **Rate Limiting** → ~200 requests/hour per IP
|
|
4. **Behavioral Detection** → Flags unnatural patterns
|
|
|
|
## 🎯 Priority Implementation Order
|
|
|
|
1. **HIGH PRIORITY**: Add residential proxy support
|
|
2. **HIGH PRIORITY**: Implement exponential backoff for 429 errors
|
|
3. **MEDIUM**: Improve session cookie reuse
|
|
4. **MEDIUM**: Add doc_id monitoring system
|
|
5. **LOW**: Additional browsing pattern randomization
|
|
|
|
## 💰 Cost Estimates (for 10,000 profiles)
|
|
|
|
- **Proxy bandwidth**: ~750 MB
|
|
- **Cost**: $7.50-$11.25 in residential proxy fees
|
|
- **With Proxy Saver**: $5.25-$7.88 (30-50% savings)
|
|
|
|
## 🚨 Legal Considerations
|
|
|
|
- Only scrape **publicly available** data
|
|
- Respect rate limits
|
|
- Don't store PII of EU citizens without GDPR compliance
|
|
- Add delays to avoid damaging Instagram's servers
|
|
- Check Instagram's Terms of Service
|
|
|
|
## 📚 Additional Resources
|
|
|
|
- [Scrapfly Instagram Scraper](https://github.com/scrapfly/scrapfly-scrapers/tree/main/instagram-scraper) - Open source reference
|
|
- [Instagram GraphQL Endpoint Documentation](https://scrapfly.io/blog/posts/how-to-scrape-instagram#how-instagrams-scraping-api-works)
|
|
- [Proxy comparison guide](https://scrapfly.io/blog/best-proxy-providers-for-web-scraping)
|
|
|
|
## ⚡ Quick Wins
|
|
|
|
Things you can implement immediately:
|
|
|
|
1. ✅ Critical headers added (x-ig-app-id)
|
|
2. ✅ Human simulation functions integrated
|
|
3. ✅ Exponential backoff added (see EXPONENTIAL-BACKOFF.md)
|
|
4. Implement cookie persistence (15 min)
|
|
5. Research residential proxy providers (1 hour)
|
|
|
|
---
|
|
|
|
**Bottom Line**: Without residential proxies, this scraper will be blocked immediately on any cloud infrastructure. That's the #1 priority to address.
|