Files

praharshaAdhikari 6f4f37bee5 feat: Instagram scraper with GraphQL API integration - Automated followings list extraction via API interception - Profile scraping using GraphQL endpoint interception - DOM fallback for edge cases - Performance timing for all operations - Anti-bot measures and human-like behavior simulation

2025-10-31 23:06:06 +05:45

10 KiB

Raw Permalink Blame History

Instagram Scraper - Usage Guide

Complete guide to using the Instagram scraper with all available workflows.

🚀 Quick Start

1. Full Workflow (Recommended)

The most comprehensive workflow that uses all scraper functions:

# Windows PowerShell
$env:INSTAGRAM_USERNAME="your_username"
$env:INSTAGRAM_PASSWORD="your_password"
$env:TARGET_USERNAME="instagram"
$env:MAX_FOLLOWING="20"
$env:MAX_PROFILES="5"
$env:MODE="full"

node server.js

What happens:

🔐 Login - Logs into Instagram with human-like behavior
💾 Save Session - Extracts and saves cookies to session_cookies.json
🌐 Browse - Simulates random mouse movements and scrolling
👥 Fetch Followings - Gets following list using API interception
👤 Scrape Profiles - Scrapes detailed data for each profile
📁 Save Data - Creates JSON files with all collected data

Output files:

followings_[username]_[timestamp].json - Full following list
profiles_[username]_[timestamp].json - Detailed profile data
session_cookies.json - Reusable session cookies

2. Simple Workflow

Uses the built-in scrapeWorkflow() function:

$env:MODE="simple"
node server.js

What it does:

Combines login + following fetch + profile scraping
Single output file with all data
Less granular control but simpler

3. Scheduled Workflow

Runs scraping on a schedule using cronJobs():

$env:MODE="scheduled"
$env:SCRAPE_INTERVAL="60"  # Minutes between runs
$env:MAX_RUNS="5"          # Stop after 5 runs
node server.js

Use case: Monitor a profile's followings over time

📋 Environment Variables

Variable	Description	Default	Example
`INSTAGRAM_USERNAME`	Your Instagram username	`your_username`	`john_doe`
`INSTAGRAM_PASSWORD`	Your Instagram password	`your_password`	`MySecureP@ss`
`TARGET_USERNAME`	Profile to scrape	`instagram`	`cristiano`
`MAX_FOLLOWING`	Max followings to fetch	`20`	`100`
`MAX_PROFILES`	Max profiles to scrape	`5`	`50`
`PROXY`	Proxy server	`None`	`proxy.com:8080`
`MODE`	Workflow type	`full`	`simple`, `scheduled`
`SCRAPE_INTERVAL`	Minutes between runs (scheduled mode)	`60`	`30`
`MAX_RUNS`	Max runs (scheduled mode)	`5`	`10`

🎯 Workflow Details

Full Workflow Step-by-Step

async function fullScrapingWorkflow() {
  // Step 1: Login
  const { browser, page } = await login(credentials, proxy);

  // Step 2: Extract session
  const session = await extractSession(page);

  // Step 3: Simulate browsing
  await simulateHumanBehavior(page, { mouseMovements: 5, scrolls: 3 });

  // Step 4: Get followings list
  const followingsData = await getFollowingsList(
    page,
    targetUsername,
    maxFollowing
  );

  // Step 5: Scrape individual profiles
  for (const username of followingsData.usernames) {
    const profileData = await scrapeProfile(page, username);
    // ... takes breaks every 3 profiles
  }

  // Step 6: Save all data
  // ... creates JSON files
}

What Each Function Does

`login(credentials, proxy)`

Launches browser with stealth mode
Sets anti-detection headers
Simulates human login behavior
Returns { browser, page }

`extractSession(page)`

Gets all cookies from current session
Returns { cookies: [...] }
Save for session reuse

`simulateHumanBehavior(page, options)`

Random mouse movements
Random scrolling
Mimics real user behavior
Options: { mouseMovements, scrolls, randomClicks }

`getFollowingsList(page, username, maxUsers)`

Navigates to profile
Clicks "following" button
Intercepts Instagram API responses
Returns { usernames: [...], fullData: [...] }

Full data includes:

{
  "pk": "310285748",
  "username": "example_user",
  "full_name": "Example User",
  "profile_pic_url": "https://...",
  "is_verified": true,
  "is_private": false,
  "fbid_v2": "...",
  "latest_reel_media": 1761853039
}

`scrapeProfile(page, username)`

Navigates to profile
Intercepts API endpoint
Falls back to DOM scraping if needed
Returns detailed profile data

Profile data includes:

{
  "username": "example_user",
  "full_name": "Example User",
  "bio": "Biography text...",
  "followerCount": 15000,
  "followingCount": 500,
  "postsCount": 100,
  "is_verified": true,
  "is_private": false,
  "is_business_account": true,
  "email": "contact@example.com",
  "phone": "+1234567890"
}

`scrapeWorkflow(creds, targetUsername, proxy, maxFollowing)`

Complete workflow in one function
Combines all steps above
Returns aggregated results

`cronJobs(fn, intervalSec, stopAfter)`

Runs function on interval
Returns stop function
Used for scheduled scraping

💡 Usage Examples

Example 1: Scrape Top Influencer's Followers

$env:INSTAGRAM_USERNAME="your_account"
$env:INSTAGRAM_PASSWORD="your_password"
$env:TARGET_USERNAME="cristiano"
$env:MAX_FOLLOWING="100"
$env:MAX_PROFILES="20"
node server.js

Example 2: Monitor Competitor Every Hour

$env:TARGET_USERNAME="competitor_account"
$env:MODE="scheduled"
$env:SCRAPE_INTERVAL="60"
$env:MAX_RUNS="24"  # Run for 24 hours
node server.js

Example 3: Scrape Multiple Accounts

Create scrape-multiple.js:

const { fullScrapingWorkflow } = require("./server.js");

const targets = ["account1", "account2", "account3"];

async function scrapeAll() {
  for (const target of targets) {
    process.env.TARGET_USERNAME = target;
    await fullScrapingWorkflow();

    // Wait between accounts
    await new Promise((r) => setTimeout(r, 300000)); // 5 minutes
  }
}

scrapeAll();

Example 4: Custom Workflow with Your Logic

const { login, getFollowingsList, scrapeProfile } = require("./scraper.js");

async function myCustomWorkflow() {
  // Login once
  const { browser, page } = await login({
    username: "your_username",
    password: "your_password",
  });

  try {
    // Get followings from multiple accounts
    const accounts = ["account1", "account2"];

    for (const account of accounts) {
      const followings = await getFollowingsList(page, account, 50);

      // Filter verified users only
      const verified = followings.fullData.filter((u) => u.is_verified);

      // Scrape verified profiles
      for (const user of verified) {
        const profile = await scrapeProfile(page, user.username);

        // Custom logic: save only if business account
        if (profile.is_business_account) {
          console.log(`Business: ${profile.username} - ${profile.email}`);
        }
      }
    }
  } finally {
    await browser.close();
  }
}

myCustomWorkflow();

🔍 Output Format

Followings Data

{
  "targetUsername": "instagram",
  "scrapedAt": "2025-10-31T12:00:00.000Z",
  "totalFollowings": 20,
  "followings": [
    {
      "pk": "123456",
      "username": "user1",
      "full_name": "User One",
      "is_verified": true,
      ...
    }
  ]
}

Profiles Data

{
  "targetUsername": "instagram",
  "scrapedAt": "2025-10-31T12:00:00.000Z",
  "totalProfiles": 5,
  "profiles": [
    {
      "username": "user1",
      "followerCount": 50000,
      "email": "contact@user1.com",
      ...
    }
  ]
}

⚡ Performance Tips

1. Optimize Delays

// Faster (more aggressive, higher block risk)
await randomSleep(1000, 2000);

// Balanced (recommended)
await randomSleep(2500, 6000);

// Safer (slower but less likely to be blocked)
await randomSleep(5000, 10000);

2. Batch Processing

Scrape in batches to avoid overwhelming Instagram:

const batchSize = 10;
for (let i = 0; i < usernames.length; i += batchSize) {
  const batch = usernames.slice(i, i + batchSize);
  // Scrape batch
  // Long break between batches
  await randomSleep(60000, 120000); // 1-2 minutes
}

3. Session Reuse

Reuse cookies to avoid logging in repeatedly:

const savedCookies = JSON.parse(fs.readFileSync("session_cookies.json"));
await page.setCookie(...savedCookies.cookies);

🚨 Common Issues

"Rate limited (429)"

✅ Solution: Exponential backoff is automatic. If persistent:

Reduce MAX_FOLLOWING and MAX_PROFILES
Increase delays
Add residential proxies

Check credentials
Instagram may require verification
Try from your home IP first

"No data captured"

Instagram changed their API structure
Check if doc_id values need updating
DOM fallback should still work

Blocked on cloud servers

❌ Problem: Using datacenter IPs
✅ Solution: Get residential proxies (see ANTI-BOT-RECOMMENDATIONS.md)

📊 Best Practices

Start Small: Test with MAX_FOLLOWING=5, MAX_PROFILES=2
Use Residential Proxies: Critical for production use
Respect Rate Limits: ~200 requests/hour per IP
Save Sessions: Reuse cookies to avoid repeated logins
Monitor Logs: Watch for 429 errors
Add Randomness: Vary delays and patterns
Take Breaks: Schedule longer breaks every N profiles

🎓 Learning Path

Start: Run MODE=simple with small numbers
Understand: Read the logs and output files
Customize: Modify MAX_FOLLOWING and MAX_PROFILES
Advanced: Use MODE=full for complete control
Production: Add proxies and session management

Need help? Check:

ANTI-BOT-RECOMMENDATIONS.md
EXPONENTIAL-BACKOFF.md
Test script: node test-retry.js

10 KiB Raw Permalink Blame History