instagram-scraper/ANTI-BOT-RECOMMENDATIONS.md

# Instagram Scraper - Anti-Bot Detection Recommendations

Based on [Scrapfly's Instagram Scraping Guide](https://scrapfly.io/blog/posts/how-to-scrape-instagram)

## ✅ Already Implemented

1. **Puppeteer Stealth Plugin** - Bypasses basic browser detection
2. **Random User Agents** - Different browser signatures
3. **Human-like behaviors**:
   - Mouse movements
   - Random scrolling
   - Variable delays (2.5-6 seconds between profiles)
   - Typing delays
   - Breaks every 10 profiles
4. **Variable viewport sizes** - Randomized window dimensions
5. **Network payload interception** - Capturing API responses instead of DOM scraping
6. **Critical headers** - Including `x-ig-app-id: 936619743392459`

## ⚠️ Critical Improvements Needed

### 1. **Residential Proxies** (MOST IMPORTANT)

**Status**: ❌ Not implemented

**Issue**:

- Datacenter IPs (AWS, Google Cloud, etc.) are **blocked instantly** by Instagram
- Your current setup will be detected as soon as you deploy to any cloud server

**Solution**:

```javascript
const browser = await puppeteer.launch({
  headless: true,
  args: [
    "--proxy-server=residential-proxy-provider.com:port",
    // Residential proxies required - NOT datacenter
  ],
});
```

**Recommended Proxy Providers**:

- Bright Data (formerly Luminati)
- Oxylabs
- Smartproxy
- GeoSurf

**Requirements**:

- Must be residential IPs (from real ISPs like Comcast, AT&T)
- Rotate IPs every 5-10 minutes (sticky sessions)
- Each IP allows ~200 requests/hour
- Cost: ~$10-15 per GB

### 2. **Rate Limit Handling with Exponential Backoff**

**Status**: ⚠️ Partial - needs improvement

**Current**: Random delays exist
**Needed**: Proper 429 error handling

```javascript
async function makeRequest(fn, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (error.status === 429 && i < retries - 1) {
        const delay = Math.pow(2, i) * 2000; // 2s, 4s, 8s
        console.log(`Rate limited, waiting ${delay}ms...`);
        await new Promise((res) => setTimeout(res, delay));
        continue;
      }
      throw error;
    }
  }
}
```

### 3. **Session Cookies Management**

**Status**: ⚠️ Partial - extractSession exists but not reused

**Issue**: Creating new sessions repeatedly looks suspicious

**Solution**:

- Save cookies after login
- Reuse cookies across multiple scraping sessions
- Rotate sessions periodically

```javascript
// Save cookies after login
const cookies = await extractSession(page);
fs.writeFileSync("session.json", JSON.stringify(cookies));

// Reuse cookies in next session
const savedCookies = JSON.parse(fs.readFileSync("session.json"));
await page.setCookie(...savedCookies.cookies);
```

### 4. **Realistic Browsing Patterns**

**Status**: ✅ Implemented but can improve

**Additional improvements**:

- Visit homepage before going to target profile
- Occasionally view posts/stories during following list scraping
- Don't always scrape in the same order (randomize)
- Add occasional "browsing breaks" of 30-60 seconds

### 5. **Monitor doc_id Changes**

**Status**: ❌ Not monitoring

**Issue**: Instagram changes GraphQL `doc_id` values every 2-4 weeks

**Current doc_ids** (as of article):

- Profile posts: `9310670392322965`
- Post details: `8845758582119845`
- Reels: `25981206651899035`

**Solution**:

- Monitor Instagram's GraphQL requests in browser DevTools
- Update when API calls start failing
- Or use a service like Scrapfly that auto-updates

## 📊 Instagram's Blocking Layers

1. **IP Quality Check** → Blocks datacenter IPs instantly
2. **TLS Fingerprinting** → Detects non-browser tools (Puppeteer Stealth helps)
3. **Rate Limiting** → ~200 requests/hour per IP
4. **Behavioral Detection** → Flags unnatural patterns

## 🎯 Priority Implementation Order

1. **HIGH PRIORITY**: Add residential proxy support
2. **HIGH PRIORITY**: Implement exponential backoff for 429 errors
3. **MEDIUM**: Improve session cookie reuse
4. **MEDIUM**: Add doc_id monitoring system
5. **LOW**: Additional browsing pattern randomization

## 💰 Cost Estimates (for 10,000 profiles)

- **Proxy bandwidth**: ~750 MB
- **Cost**: $7.50-$11.25 in residential proxy fees
- **With Proxy Saver**: $5.25-$7.88 (30-50% savings)

## 🚨 Legal Considerations

- Only scrape **publicly available** data
- Respect rate limits
- Don't store PII of EU citizens without GDPR compliance
- Add delays to avoid damaging Instagram's servers
- Check Instagram's Terms of Service

## 📚 Additional Resources

- [Scrapfly Instagram Scraper](https://github.com/scrapfly/scrapfly-scrapers/tree/main/instagram-scraper) - Open source reference
- [Instagram GraphQL Endpoint Documentation](https://scrapfly.io/blog/posts/how-to-scrape-instagram#how-instagrams-scraping-api-works)
- [Proxy comparison guide](https://scrapfly.io/blog/best-proxy-providers-for-web-scraping)

## ⚡ Quick Wins

Things you can implement immediately:

1. ✅ Critical headers added (x-ig-app-id)
2. ✅ Human simulation functions integrated
3. ✅ Exponential backoff added (see EXPONENTIAL-BACKOFF.md)
4. Implement cookie persistence (15 min)
5. Research residential proxy providers (1 hour)

---

**Bottom Line**: Without residential proxies, this scraper will be blocked immediately on any cloud infrastructure. That's the #1 priority to address.