feat: Instagram scraper with GraphQL API integration - Automated followings list extraction via API interception - Profile scraping using GraphQL endpoint interception - DOM fallback for edge cases - Performance timing for all operations - Anti-bot measures and human-like behavior simulation

2025-10-31 23:06:06 +05:45
parent ba2dcec881
commit 6f4f37bee5
8 changed files with 3474 additions and 0 deletions
--- a/USAGE-GUIDE.md
+++ b/USAGE-GUIDE.md
@@ -0,0 +1,407 @@
+# Instagram Scraper - Usage Guide
+
+Complete guide to using the Instagram scraper with all available workflows.
+
+## 🚀 Quick Start
+
+### 1. Full Workflow (Recommended)
+
+The most comprehensive workflow that uses all scraper functions:
+
+```bash
+# Windows PowerShell
+$env:INSTAGRAM_USERNAME="your_username"
+$env:INSTAGRAM_PASSWORD="your_password"
+$env:TARGET_USERNAME="instagram"
+$env:MAX_FOLLOWING="20"
+$env:MAX_PROFILES="5"
+$env:MODE="full"
+
+node server.js
+```
+
+**What happens:**
+
+1. 🔐 **Login** - Logs into Instagram with human-like behavior
+2. 💾 **Save Session** - Extracts and saves cookies to `session_cookies.json`
+3. 🌐 **Browse** - Simulates random mouse movements and scrolling
+4. 👥 **Fetch Followings** - Gets following list using API interception
+5. 👤 **Scrape Profiles** - Scrapes detailed data for each profile
+6. 📁 **Save Data** - Creates JSON files with all collected data
+
+**Output files:**
+
+- `followings_[username]_[timestamp].json` - Full following list
+- `profiles_[username]_[timestamp].json` - Detailed profile data
+- `session_cookies.json` - Reusable session cookies
+
+### 2. Simple Workflow
+
+Uses the built-in `scrapeWorkflow()` function:
+
+```bash
+$env:MODE="simple"
+node server.js
+```
+
+**What it does:**
+
+- Combines login + following fetch + profile scraping
+- Single output file with all data
+- Less granular control but simpler
+
+### 3. Scheduled Workflow
+
+Runs scraping on a schedule using `cronJobs()`:
+
+```bash
+$env:MODE="scheduled"
+$env:SCRAPE_INTERVAL="60"  # Minutes between runs
+$env:MAX_RUNS="5"          # Stop after 5 runs
+node server.js
+```
+
+**Use case:** Monitor a profile's followings over time
+
+## 📋 Environment Variables
+
+| Variable             | Description                           | Default         | Example               |
+| -------------------- | ------------------------------------- | --------------- | --------------------- |
+| `INSTAGRAM_USERNAME` | Your Instagram username               | `your_username` | `john_doe`            |
+| `INSTAGRAM_PASSWORD` | Your Instagram password               | `your_password` | `MySecureP@ss`        |
+| `TARGET_USERNAME`    | Profile to scrape                     | `instagram`     | `cristiano`           |
+| `MAX_FOLLOWING`      | Max followings to fetch               | `20`            | `100`                 |
+| `MAX_PROFILES`       | Max profiles to scrape                | `5`             | `50`                  |
+| `PROXY`              | Proxy server                          | `None`          | `proxy.com:8080`      |
+| `MODE`               | Workflow type                         | `full`          | `simple`, `scheduled` |
+| `SCRAPE_INTERVAL`    | Minutes between runs (scheduled mode) | `60`            | `30`                  |
+| `MAX_RUNS`           | Max runs (scheduled mode)             | `5`             | `10`                  |
+
+## 🎯 Workflow Details
+
+### Full Workflow Step-by-Step
+
+```javascript
+async function fullScrapingWorkflow() {
+  // Step 1: Login
+  const { browser, page } = await login(credentials, proxy);
+
+  // Step 2: Extract session
+  const session = await extractSession(page);
+
+  // Step 3: Simulate browsing
+  await simulateHumanBehavior(page, { mouseMovements: 5, scrolls: 3 });
+
+  // Step 4: Get followings list
+  const followingsData = await getFollowingsList(
+    page,
+    targetUsername,
+    maxFollowing
+  );
+
+  // Step 5: Scrape individual profiles
+  for (const username of followingsData.usernames) {
+    const profileData = await scrapeProfile(page, username);
+    // ... takes breaks every 3 profiles
+  }
+
+  // Step 6: Save all data
+  // ... creates JSON files
+}
+```
+
+### What Each Function Does
+
+#### `login(credentials, proxy)`
+
+- Launches browser with stealth mode
+- Sets anti-detection headers
+- Simulates human login behavior
+- Returns `{ browser, page }`
+
+#### `extractSession(page)`
+
+- Gets all cookies from current session
+- Returns `{ cookies: [...] }`
+- Save for session reuse
+
+#### `simulateHumanBehavior(page, options)`
+
+- Random mouse movements
+- Random scrolling
+- Mimics real user behavior
+- Options: `{ mouseMovements, scrolls, randomClicks }`
+
+#### `getFollowingsList(page, username, maxUsers)`
+
+- Navigates to profile
+- Clicks "following" button
+- Intercepts Instagram API responses
+- Returns `{ usernames: [...], fullData: [...] }`
+
+**Full data includes:**
+
+```json
+{
+  "pk": "310285748",
+  "username": "example_user",
+  "full_name": "Example User",
+  "profile_pic_url": "https://...",
+  "is_verified": true,
+  "is_private": false,
+  "fbid_v2": "...",
+  "latest_reel_media": 1761853039
+}
+```
+
+#### `scrapeProfile(page, username)`
+
+- Navigates to profile
+- Intercepts API endpoint
+- Falls back to DOM scraping if needed
+- Returns detailed profile data
+
+**Profile data includes:**
+
+```json
+{
+  "username": "example_user",
+  "full_name": "Example User",
+  "bio": "Biography text...",
+  "followerCount": 15000,
+  "followingCount": 500,
+  "postsCount": 100,
+  "is_verified": true,
+  "is_private": false,
+  "is_business_account": true,
+  "email": "contact@example.com",
+  "phone": "+1234567890"
+}
+```
+
+#### `scrapeWorkflow(creds, targetUsername, proxy, maxFollowing)`
+
+- Complete workflow in one function
+- Combines all steps above
+- Returns aggregated results
+
+#### `cronJobs(fn, intervalSec, stopAfter)`
+
+- Runs function on interval
+- Returns stop function
+- Used for scheduled scraping
+
+## 💡 Usage Examples
+
+### Example 1: Scrape Top Influencer's Followers
+
+```bash
+$env:INSTAGRAM_USERNAME="your_account"
+$env:INSTAGRAM_PASSWORD="your_password"
+$env:TARGET_USERNAME="cristiano"
+$env:MAX_FOLLOWING="100"
+$env:MAX_PROFILES="20"
+node server.js
+```
+
+### Example 2: Monitor Competitor Every Hour
+
+```bash
+$env:TARGET_USERNAME="competitor_account"
+$env:MODE="scheduled"
+$env:SCRAPE_INTERVAL="60"
+$env:MAX_RUNS="24"  # Run for 24 hours
+node server.js
+```
+
+### Example 3: Scrape Multiple Accounts
+
+Create `scrape-multiple.js`:
+
+```javascript
+const { fullScrapingWorkflow } = require("./server.js");
+
+const targets = ["account1", "account2", "account3"];
+
+async function scrapeAll() {
+  for (const target of targets) {
+    process.env.TARGET_USERNAME = target;
+    await fullScrapingWorkflow();
+
+    // Wait between accounts
+    await new Promise((r) => setTimeout(r, 300000)); // 5 minutes
+  }
+}
+
+scrapeAll();
+```
+
+### Example 4: Custom Workflow with Your Logic
+
+```javascript
+const { login, getFollowingsList, scrapeProfile } = require("./scraper.js");
+
+async function myCustomWorkflow() {
+  // Login once
+  const { browser, page } = await login({
+    username: "your_username",
+    password: "your_password",
+  });
+
+  try {
+    // Get followings from multiple accounts
+    const accounts = ["account1", "account2"];
+
+    for (const account of accounts) {
+      const followings = await getFollowingsList(page, account, 50);
+
+      // Filter verified users only
+      const verified = followings.fullData.filter((u) => u.is_verified);
+
+      // Scrape verified profiles
+      for (const user of verified) {
+        const profile = await scrapeProfile(page, user.username);
+
+        // Custom logic: save only if business account
+        if (profile.is_business_account) {
+          console.log(`Business: ${profile.username} - ${profile.email}`);
+        }
+      }
+    }
+  } finally {
+    await browser.close();
+  }
+}
+
+myCustomWorkflow();
+```
+
+## 🔍 Output Format
+
+### Followings Data
+
+```json
+{
+  "targetUsername": "instagram",
+  "scrapedAt": "2025-10-31T12:00:00.000Z",
+  "totalFollowings": 20,
+  "followings": [
+    {
+      "pk": "123456",
+      "username": "user1",
+      "full_name": "User One",
+      "is_verified": true,
+      ...
+    }
+  ]
+}
+```
+
+### Profiles Data
+
+```json
+{
+  "targetUsername": "instagram",
+  "scrapedAt": "2025-10-31T12:00:00.000Z",
+  "totalProfiles": 5,
+  "profiles": [
+    {
+      "username": "user1",
+      "followerCount": 50000,
+      "email": "contact@user1.com",
+      ...
+    }
+  ]
+}
+```
+
+## ⚡ Performance Tips
+
+### 1. Optimize Delays
+
+```javascript
+// Faster (more aggressive, higher block risk)
+await randomSleep(1000, 2000);
+
+// Balanced (recommended)
+await randomSleep(2500, 6000);
+
+// Safer (slower but less likely to be blocked)
+await randomSleep(5000, 10000);
+```
+
+### 2. Batch Processing
+
+Scrape in batches to avoid overwhelming Instagram:
+
+```javascript
+const batchSize = 10;
+for (let i = 0; i < usernames.length; i += batchSize) {
+  const batch = usernames.slice(i, i + batchSize);
+  // Scrape batch
+  // Long break between batches
+  await randomSleep(60000, 120000); // 1-2 minutes
+}
+```
+
+### 3. Session Reuse
+
+Reuse cookies to avoid logging in repeatedly:
+
+```javascript
+const savedCookies = JSON.parse(fs.readFileSync("session_cookies.json"));
+await page.setCookie(...savedCookies.cookies);
+```
+
+## 🚨 Common Issues
+
+### "Rate limited (429)"
+
+✅ **Solution**: Exponential backoff is automatic. If persistent:
+
+- Reduce MAX_FOLLOWING and MAX_PROFILES
+- Increase delays
+- Add residential proxies
+
+### "Login failed"
+
+- Check credentials
+- Instagram may require verification
+- Try from your home IP first
+
+### "No data captured"
+
+- Instagram changed their API structure
+- Check if `doc_id` values need updating
+- DOM fallback should still work
+
+### Blocked on cloud servers
+
+❌ **Problem**: Using datacenter IPs  
+✅ **Solution**: Get residential proxies (see ANTI-BOT-RECOMMENDATIONS.md)
+
+## 📊 Best Practices
+
+1. **Start Small**: Test with MAX_FOLLOWING=5, MAX_PROFILES=2
+2. **Use Residential Proxies**: Critical for production use
+3. **Respect Rate Limits**: ~200 requests/hour per IP
+4. **Save Sessions**: Reuse cookies to avoid repeated logins
+5. **Monitor Logs**: Watch for 429 errors
+6. **Add Randomness**: Vary delays and patterns
+7. **Take Breaks**: Schedule longer breaks every N profiles
+
+## 🎓 Learning Path
+
+1. **Start**: Run `MODE=simple` with small numbers
+2. **Understand**: Read the logs and output files
+3. **Customize**: Modify `MAX_FOLLOWING` and `MAX_PROFILES`
+4. **Advanced**: Use `MODE=full` for complete control
+5. **Production**: Add proxies and session management
+
+---
+
+**Need help?** Check:
+
+- [ANTI-BOT-RECOMMENDATIONS.md](./ANTI-BOT-RECOMMENDATIONS.md)
+- [EXPONENTIAL-BACKOFF.md](./EXPONENTIAL-BACKOFF.md)
+- Test script: `node test-retry.js`