# Instagram Scraper - Usage Guide Complete guide to using the Instagram scraper with all available workflows. ## 🚀 Quick Start ### 1. Full Workflow (Recommended) The most comprehensive workflow that uses all scraper functions: ```bash # Windows PowerShell $env:INSTAGRAM_USERNAME="your_username" $env:INSTAGRAM_PASSWORD="your_password" $env:TARGET_USERNAME="instagram" $env:MAX_FOLLOWING="20" $env:MAX_PROFILES="5" $env:MODE="full" node server.js ``` **What happens:** 1. 🔐 **Login** - Logs into Instagram with human-like behavior 2. 💾 **Save Session** - Extracts and saves cookies to `session_cookies.json` 3. 🌐 **Browse** - Simulates random mouse movements and scrolling 4. 👥 **Fetch Followings** - Gets following list using API interception 5. 👤 **Scrape Profiles** - Scrapes detailed data for each profile 6. 📁 **Save Data** - Creates JSON files with all collected data **Output files:** - `followings_[username]_[timestamp].json` - Full following list - `profiles_[username]_[timestamp].json` - Detailed profile data - `session_cookies.json` - Reusable session cookies ### 2. Simple Workflow Uses the built-in `scrapeWorkflow()` function: ```bash $env:MODE="simple" node server.js ``` **What it does:** - Combines login + following fetch + profile scraping - Single output file with all data - Less granular control but simpler ### 3. Scheduled Workflow Runs scraping on a schedule using `cronJobs()`: ```bash $env:MODE="scheduled" $env:SCRAPE_INTERVAL="60" # Minutes between runs $env:MAX_RUNS="5" # Stop after 5 runs node server.js ``` **Use case:** Monitor a profile's followings over time ## 📋 Environment Variables | Variable | Description | Default | Example | | -------------------- | ------------------------------------- | --------------- | --------------------- | | `INSTAGRAM_USERNAME` | Your Instagram username | `your_username` | `john_doe` | | `INSTAGRAM_PASSWORD` | Your Instagram password | `your_password` | `MySecureP@ss` | | `TARGET_USERNAME` | Profile to scrape | `instagram` | `cristiano` | | `MAX_FOLLOWING` | Max followings to fetch | `20` | `100` | | `MAX_PROFILES` | Max profiles to scrape | `5` | `50` | | `PROXY` | Proxy server | `None` | `proxy.com:8080` | | `MODE` | Workflow type | `full` | `simple`, `scheduled` | | `SCRAPE_INTERVAL` | Minutes between runs (scheduled mode) | `60` | `30` | | `MAX_RUNS` | Max runs (scheduled mode) | `5` | `10` | ## 🎯 Workflow Details ### Full Workflow Step-by-Step ```javascript async function fullScrapingWorkflow() { // Step 1: Login const { browser, page } = await login(credentials, proxy); // Step 2: Extract session const session = await extractSession(page); // Step 3: Simulate browsing await simulateHumanBehavior(page, { mouseMovements: 5, scrolls: 3 }); // Step 4: Get followings list const followingsData = await getFollowingsList( page, targetUsername, maxFollowing ); // Step 5: Scrape individual profiles for (const username of followingsData.usernames) { const profileData = await scrapeProfile(page, username); // ... takes breaks every 3 profiles } // Step 6: Save all data // ... creates JSON files } ``` ### What Each Function Does #### `login(credentials, proxy)` - Launches browser with stealth mode - Sets anti-detection headers - Simulates human login behavior - Returns `{ browser, page }` #### `extractSession(page)` - Gets all cookies from current session - Returns `{ cookies: [...] }` - Save for session reuse #### `simulateHumanBehavior(page, options)` - Random mouse movements - Random scrolling - Mimics real user behavior - Options: `{ mouseMovements, scrolls, randomClicks }` #### `getFollowingsList(page, username, maxUsers)` - Navigates to profile - Clicks "following" button - Intercepts Instagram API responses - Returns `{ usernames: [...], fullData: [...] }` **Full data includes:** ```json { "pk": "310285748", "username": "example_user", "full_name": "Example User", "profile_pic_url": "https://...", "is_verified": true, "is_private": false, "fbid_v2": "...", "latest_reel_media": 1761853039 } ``` #### `scrapeProfile(page, username)` - Navigates to profile - Intercepts API endpoint - Falls back to DOM scraping if needed - Returns detailed profile data **Profile data includes:** ```json { "username": "example_user", "full_name": "Example User", "bio": "Biography text...", "followerCount": 15000, "followingCount": 500, "postsCount": 100, "is_verified": true, "is_private": false, "is_business_account": true, "email": "contact@example.com", "phone": "+1234567890" } ``` #### `scrapeWorkflow(creds, targetUsername, proxy, maxFollowing)` - Complete workflow in one function - Combines all steps above - Returns aggregated results #### `cronJobs(fn, intervalSec, stopAfter)` - Runs function on interval - Returns stop function - Used for scheduled scraping ## 💡 Usage Examples ### Example 1: Scrape Top Influencer's Followers ```bash $env:INSTAGRAM_USERNAME="your_account" $env:INSTAGRAM_PASSWORD="your_password" $env:TARGET_USERNAME="cristiano" $env:MAX_FOLLOWING="100" $env:MAX_PROFILES="20" node server.js ``` ### Example 2: Monitor Competitor Every Hour ```bash $env:TARGET_USERNAME="competitor_account" $env:MODE="scheduled" $env:SCRAPE_INTERVAL="60" $env:MAX_RUNS="24" # Run for 24 hours node server.js ``` ### Example 3: Scrape Multiple Accounts Create `scrape-multiple.js`: ```javascript const { fullScrapingWorkflow } = require("./server.js"); const targets = ["account1", "account2", "account3"]; async function scrapeAll() { for (const target of targets) { process.env.TARGET_USERNAME = target; await fullScrapingWorkflow(); // Wait between accounts await new Promise((r) => setTimeout(r, 300000)); // 5 minutes } } scrapeAll(); ``` ### Example 4: Custom Workflow with Your Logic ```javascript const { login, getFollowingsList, scrapeProfile } = require("./scraper.js"); async function myCustomWorkflow() { // Login once const { browser, page } = await login({ username: "your_username", password: "your_password", }); try { // Get followings from multiple accounts const accounts = ["account1", "account2"]; for (const account of accounts) { const followings = await getFollowingsList(page, account, 50); // Filter verified users only const verified = followings.fullData.filter((u) => u.is_verified); // Scrape verified profiles for (const user of verified) { const profile = await scrapeProfile(page, user.username); // Custom logic: save only if business account if (profile.is_business_account) { console.log(`Business: ${profile.username} - ${profile.email}`); } } } } finally { await browser.close(); } } myCustomWorkflow(); ``` ## 🔍 Output Format ### Followings Data ```json { "targetUsername": "instagram", "scrapedAt": "2025-10-31T12:00:00.000Z", "totalFollowings": 20, "followings": [ { "pk": "123456", "username": "user1", "full_name": "User One", "is_verified": true, ... } ] } ``` ### Profiles Data ```json { "targetUsername": "instagram", "scrapedAt": "2025-10-31T12:00:00.000Z", "totalProfiles": 5, "profiles": [ { "username": "user1", "followerCount": 50000, "email": "contact@user1.com", ... } ] } ``` ## ⚡ Performance Tips ### 1. Optimize Delays ```javascript // Faster (more aggressive, higher block risk) await randomSleep(1000, 2000); // Balanced (recommended) await randomSleep(2500, 6000); // Safer (slower but less likely to be blocked) await randomSleep(5000, 10000); ``` ### 2. Batch Processing Scrape in batches to avoid overwhelming Instagram: ```javascript const batchSize = 10; for (let i = 0; i < usernames.length; i += batchSize) { const batch = usernames.slice(i, i + batchSize); // Scrape batch // Long break between batches await randomSleep(60000, 120000); // 1-2 minutes } ``` ### 3. Session Reuse Reuse cookies to avoid logging in repeatedly: ```javascript const savedCookies = JSON.parse(fs.readFileSync("session_cookies.json")); await page.setCookie(...savedCookies.cookies); ``` ## 🚨 Common Issues ### "Rate limited (429)" ✅ **Solution**: Exponential backoff is automatic. If persistent: - Reduce MAX_FOLLOWING and MAX_PROFILES - Increase delays - Add residential proxies ### "Login failed" - Check credentials - Instagram may require verification - Try from your home IP first ### "No data captured" - Instagram changed their API structure - Check if `doc_id` values need updating - DOM fallback should still work ### Blocked on cloud servers ❌ **Problem**: Using datacenter IPs ✅ **Solution**: Get residential proxies (see ANTI-BOT-RECOMMENDATIONS.md) ## 📊 Best Practices 1. **Start Small**: Test with MAX_FOLLOWING=5, MAX_PROFILES=2 2. **Use Residential Proxies**: Critical for production use 3. **Respect Rate Limits**: ~200 requests/hour per IP 4. **Save Sessions**: Reuse cookies to avoid repeated logins 5. **Monitor Logs**: Watch for 429 errors 6. **Add Randomness**: Vary delays and patterns 7. **Take Breaks**: Schedule longer breaks every N profiles ## 🎓 Learning Path 1. **Start**: Run `MODE=simple` with small numbers 2. **Understand**: Read the logs and output files 3. **Customize**: Modify `MAX_FOLLOWING` and `MAX_PROFILES` 4. **Advanced**: Use `MODE=full` for complete control 5. **Production**: Add proxies and session management --- **Need help?** Check: - [ANTI-BOT-RECOMMENDATIONS.md](./ANTI-BOT-RECOMMENDATIONS.md) - [EXPONENTIAL-BACKOFF.md](./EXPONENTIAL-BACKOFF.md) - Test script: `node test-retry.js`