instagram-scraper/USAGE-GUIDE.md

# Instagram Scraper - Usage Guide

Complete guide to using the Instagram scraper with all available workflows.

## 🚀 Quick Start

### 1. Full Workflow (Recommended)

The most comprehensive workflow that uses all scraper functions:

```bash
# Windows PowerShell
$env:INSTAGRAM_USERNAME="your_username"
$env:INSTAGRAM_PASSWORD="your_password"
$env:TARGET_USERNAME="instagram"
$env:MAX_FOLLOWING="20"
$env:MAX_PROFILES="5"
$env:MODE="full"

node server.js
```

**What happens:**

1. 🔐 **Login** - Logs into Instagram with human-like behavior
2. 💾 **Save Session** - Extracts and saves cookies to `session_cookies.json`
3. 🌐 **Browse** - Simulates random mouse movements and scrolling
4. 👥 **Fetch Followings** - Gets following list using API interception
5. 👤 **Scrape Profiles** - Scrapes detailed data for each profile
6. 📁 **Save Data** - Creates JSON files with all collected data

**Output files:**

- `followings_[username]_[timestamp].json` - Full following list
- `profiles_[username]_[timestamp].json` - Detailed profile data
- `session_cookies.json` - Reusable session cookies

### 2. Simple Workflow

Uses the built-in `scrapeWorkflow()` function:

```bash
$env:MODE="simple"
node server.js
```

**What it does:**

- Combines login + following fetch + profile scraping
- Single output file with all data
- Less granular control but simpler

### 3. Scheduled Workflow

Runs scraping on a schedule using `cronJobs()`:

```bash
$env:MODE="scheduled"
$env:SCRAPE_INTERVAL="60"  # Minutes between runs
$env:MAX_RUNS="5"          # Stop after 5 runs
node server.js
```

**Use case:** Monitor a profile's followings over time

## 📋 Environment Variables

| Variable             | Description                           | Default         | Example               |
| -------------------- | ------------------------------------- | --------------- | --------------------- |
| `INSTAGRAM_USERNAME` | Your Instagram username               | `your_username` | `john_doe`            |
| `INSTAGRAM_PASSWORD` | Your Instagram password               | `your_password` | `MySecureP@ss`        |
| `TARGET_USERNAME`    | Profile to scrape                     | `instagram`     | `cristiano`           |
| `MAX_FOLLOWING`      | Max followings to fetch               | `20`            | `100`                 |
| `MAX_PROFILES`       | Max profiles to scrape                | `5`             | `50`                  |
| `PROXY`              | Proxy server                          | `None`          | `proxy.com:8080`      |
| `MODE`               | Workflow type                         | `full`          | `simple`, `scheduled` |
| `SCRAPE_INTERVAL`    | Minutes between runs (scheduled mode) | `60`            | `30`                  |
| `MAX_RUNS`           | Max runs (scheduled mode)             | `5`             | `10`                  |

## 🎯 Workflow Details

### Full Workflow Step-by-Step

```javascript
async function fullScrapingWorkflow() {
  // Step 1: Login
  const { browser, page } = await login(credentials, proxy);

  // Step 2: Extract session
  const session = await extractSession(page);

  // Step 3: Simulate browsing
  await simulateHumanBehavior(page, { mouseMovements: 5, scrolls: 3 });

  // Step 4: Get followings list
  const followingsData = await getFollowingsList(
    page,
    targetUsername,
    maxFollowing
  );

  // Step 5: Scrape individual profiles
  for (const username of followingsData.usernames) {
    const profileData = await scrapeProfile(page, username);
    // ... takes breaks every 3 profiles
  }

  // Step 6: Save all data
  // ... creates JSON files
}
```

### What Each Function Does

#### `login(credentials, proxy)`

- Launches browser with stealth mode
- Sets anti-detection headers
- Simulates human login behavior
- Returns `{ browser, page }`

#### `extractSession(page)`

- Gets all cookies from current session
- Returns `{ cookies: [...] }`
- Save for session reuse

#### `simulateHumanBehavior(page, options)`

- Random mouse movements
- Random scrolling
- Mimics real user behavior
- Options: `{ mouseMovements, scrolls, randomClicks }`

#### `getFollowingsList(page, username, maxUsers)`

- Navigates to profile
- Clicks "following" button
- Intercepts Instagram API responses
- Returns `{ usernames: [...], fullData: [...] }`

**Full data includes:**

```json
{
  "pk": "310285748",
  "username": "example_user",
  "full_name": "Example User",
  "profile_pic_url": "https://...",
  "is_verified": true,
  "is_private": false,
  "fbid_v2": "...",
  "latest_reel_media": 1761853039
}
```

#### `scrapeProfile(page, username)`

- Navigates to profile
- Intercepts API endpoint
- Falls back to DOM scraping if needed
- Returns detailed profile data

**Profile data includes:**

```json
{
  "username": "example_user",
  "full_name": "Example User",
  "bio": "Biography text...",
  "followerCount": 15000,
  "followingCount": 500,
  "postsCount": 100,
  "is_verified": true,
  "is_private": false,
  "is_business_account": true,
  "email": "contact@example.com",
  "phone": "+1234567890"
}
```

#### `scrapeWorkflow(creds, targetUsername, proxy, maxFollowing)`

- Complete workflow in one function
- Combines all steps above
- Returns aggregated results

#### `cronJobs(fn, intervalSec, stopAfter)`

- Runs function on interval
- Returns stop function
- Used for scheduled scraping

## 💡 Usage Examples

### Example 1: Scrape Top Influencer's Followers

```bash
$env:INSTAGRAM_USERNAME="your_account"
$env:INSTAGRAM_PASSWORD="your_password"
$env:TARGET_USERNAME="cristiano"
$env:MAX_FOLLOWING="100"
$env:MAX_PROFILES="20"
node server.js
```

### Example 2: Monitor Competitor Every Hour

```bash
$env:TARGET_USERNAME="competitor_account"
$env:MODE="scheduled"
$env:SCRAPE_INTERVAL="60"
$env:MAX_RUNS="24"  # Run for 24 hours
node server.js
```

### Example 3: Scrape Multiple Accounts

Create `scrape-multiple.js`:

```javascript
const { fullScrapingWorkflow } = require("./server.js");

const targets = ["account1", "account2", "account3"];

async function scrapeAll() {
  for (const target of targets) {
    process.env.TARGET_USERNAME = target;
    await fullScrapingWorkflow();

    // Wait between accounts
    await new Promise((r) => setTimeout(r, 300000)); // 5 minutes
  }
}

scrapeAll();
```

### Example 4: Custom Workflow with Your Logic

```javascript
const { login, getFollowingsList, scrapeProfile } = require("./scraper.js");

async function myCustomWorkflow() {
  // Login once
  const { browser, page } = await login({
    username: "your_username",
    password: "your_password",
  });

  try {
    // Get followings from multiple accounts
    const accounts = ["account1", "account2"];

    for (const account of accounts) {
      const followings = await getFollowingsList(page, account, 50);

      // Filter verified users only
      const verified = followings.fullData.filter((u) => u.is_verified);

      // Scrape verified profiles
      for (const user of verified) {
        const profile = await scrapeProfile(page, user.username);

        // Custom logic: save only if business account
        if (profile.is_business_account) {
          console.log(`Business: ${profile.username} - ${profile.email}`);
        }
      }
    }
  } finally {
    await browser.close();
  }
}

myCustomWorkflow();
```

## 🔍 Output Format

### Followings Data

```json
{
  "targetUsername": "instagram",
  "scrapedAt": "2025-10-31T12:00:00.000Z",
  "totalFollowings": 20,
  "followings": [
    {
      "pk": "123456",
      "username": "user1",
      "full_name": "User One",
      "is_verified": true,
      ...
    }
  ]
}
```

### Profiles Data

```json
{
  "targetUsername": "instagram",
  "scrapedAt": "2025-10-31T12:00:00.000Z",
  "totalProfiles": 5,
  "profiles": [
    {
      "username": "user1",
      "followerCount": 50000,
      "email": "contact@user1.com",
      ...
    }
  ]
}
```

## ⚡ Performance Tips

### 1. Optimize Delays

```javascript
// Faster (more aggressive, higher block risk)
await randomSleep(1000, 2000);

// Balanced (recommended)
await randomSleep(2500, 6000);

// Safer (slower but less likely to be blocked)
await randomSleep(5000, 10000);
```

### 2. Batch Processing

Scrape in batches to avoid overwhelming Instagram:

```javascript
const batchSize = 10;
for (let i = 0; i < usernames.length; i += batchSize) {
  const batch = usernames.slice(i, i + batchSize);
  // Scrape batch
  // Long break between batches
  await randomSleep(60000, 120000); // 1-2 minutes
}
```

### 3. Session Reuse

Reuse cookies to avoid logging in repeatedly:

```javascript
const savedCookies = JSON.parse(fs.readFileSync("session_cookies.json"));
await page.setCookie(...savedCookies.cookies);
```

## 🚨 Common Issues

### "Rate limited (429)"

✅ **Solution**: Exponential backoff is automatic. If persistent:

- Reduce MAX_FOLLOWING and MAX_PROFILES
- Increase delays
- Add residential proxies

### "Login failed"

- Check credentials
- Instagram may require verification
- Try from your home IP first

### "No data captured"

- Instagram changed their API structure
- Check if `doc_id` values need updating
- DOM fallback should still work

### Blocked on cloud servers

❌ **Problem**: Using datacenter IPs
✅ **Solution**: Get residential proxies (see ANTI-BOT-RECOMMENDATIONS.md)

## 📊 Best Practices

1. **Start Small**: Test with MAX_FOLLOWING=5, MAX_PROFILES=2
2. **Use Residential Proxies**: Critical for production use
3. **Respect Rate Limits**: ~200 requests/hour per IP
4. **Save Sessions**: Reuse cookies to avoid repeated logins
5. **Monitor Logs**: Watch for 429 errors
6. **Add Randomness**: Vary delays and patterns
7. **Take Breaks**: Schedule longer breaks every N profiles

## 🎓 Learning Path

1. **Start**: Run `MODE=simple` with small numbers
2. **Understand**: Read the logs and output files
3. **Customize**: Modify `MAX_FOLLOWING` and `MAX_PROFILES`
4. **Advanced**: Use `MODE=full` for complete control
5. **Production**: Add proxies and session management

---

**Need help?** Check:

- [ANTI-BOT-RECOMMENDATIONS.md](./ANTI-BOT-RECOMMENDATIONS.md)
- [EXPONENTIAL-BACKOFF.md](./EXPONENTIAL-BACKOFF.md)
- Test script: `node test-retry.js`