feat: Instagram scraper with GraphQL API integration - Automated followings list extraction via API interception - Profile scraping using GraphQL endpoint interception - DOM fallback for edge cases - Performance timing for all operations - Anti-bot measures and human-like behavior simulation
This commit is contained in:
407
USAGE-GUIDE.md
Normal file
407
USAGE-GUIDE.md
Normal file
@@ -0,0 +1,407 @@
|
||||
# Instagram Scraper - Usage Guide
|
||||
|
||||
Complete guide to using the Instagram scraper with all available workflows.
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
### 1. Full Workflow (Recommended)
|
||||
|
||||
The most comprehensive workflow that uses all scraper functions:
|
||||
|
||||
```bash
|
||||
# Windows PowerShell
|
||||
$env:INSTAGRAM_USERNAME="your_username"
|
||||
$env:INSTAGRAM_PASSWORD="your_password"
|
||||
$env:TARGET_USERNAME="instagram"
|
||||
$env:MAX_FOLLOWING="20"
|
||||
$env:MAX_PROFILES="5"
|
||||
$env:MODE="full"
|
||||
|
||||
node server.js
|
||||
```
|
||||
|
||||
**What happens:**
|
||||
|
||||
1. 🔐 **Login** - Logs into Instagram with human-like behavior
|
||||
2. 💾 **Save Session** - Extracts and saves cookies to `session_cookies.json`
|
||||
3. 🌐 **Browse** - Simulates random mouse movements and scrolling
|
||||
4. 👥 **Fetch Followings** - Gets following list using API interception
|
||||
5. 👤 **Scrape Profiles** - Scrapes detailed data for each profile
|
||||
6. 📁 **Save Data** - Creates JSON files with all collected data
|
||||
|
||||
**Output files:**
|
||||
|
||||
- `followings_[username]_[timestamp].json` - Full following list
|
||||
- `profiles_[username]_[timestamp].json` - Detailed profile data
|
||||
- `session_cookies.json` - Reusable session cookies
|
||||
|
||||
### 2. Simple Workflow
|
||||
|
||||
Uses the built-in `scrapeWorkflow()` function:
|
||||
|
||||
```bash
|
||||
$env:MODE="simple"
|
||||
node server.js
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
|
||||
- Combines login + following fetch + profile scraping
|
||||
- Single output file with all data
|
||||
- Less granular control but simpler
|
||||
|
||||
### 3. Scheduled Workflow
|
||||
|
||||
Runs scraping on a schedule using `cronJobs()`:
|
||||
|
||||
```bash
|
||||
$env:MODE="scheduled"
|
||||
$env:SCRAPE_INTERVAL="60" # Minutes between runs
|
||||
$env:MAX_RUNS="5" # Stop after 5 runs
|
||||
node server.js
|
||||
```
|
||||
|
||||
**Use case:** Monitor a profile's followings over time
|
||||
|
||||
## 📋 Environment Variables
|
||||
|
||||
| Variable | Description | Default | Example |
|
||||
| -------------------- | ------------------------------------- | --------------- | --------------------- |
|
||||
| `INSTAGRAM_USERNAME` | Your Instagram username | `your_username` | `john_doe` |
|
||||
| `INSTAGRAM_PASSWORD` | Your Instagram password | `your_password` | `MySecureP@ss` |
|
||||
| `TARGET_USERNAME` | Profile to scrape | `instagram` | `cristiano` |
|
||||
| `MAX_FOLLOWING` | Max followings to fetch | `20` | `100` |
|
||||
| `MAX_PROFILES` | Max profiles to scrape | `5` | `50` |
|
||||
| `PROXY` | Proxy server | `None` | `proxy.com:8080` |
|
||||
| `MODE` | Workflow type | `full` | `simple`, `scheduled` |
|
||||
| `SCRAPE_INTERVAL` | Minutes between runs (scheduled mode) | `60` | `30` |
|
||||
| `MAX_RUNS` | Max runs (scheduled mode) | `5` | `10` |
|
||||
|
||||
## 🎯 Workflow Details
|
||||
|
||||
### Full Workflow Step-by-Step
|
||||
|
||||
```javascript
|
||||
async function fullScrapingWorkflow() {
|
||||
// Step 1: Login
|
||||
const { browser, page } = await login(credentials, proxy);
|
||||
|
||||
// Step 2: Extract session
|
||||
const session = await extractSession(page);
|
||||
|
||||
// Step 3: Simulate browsing
|
||||
await simulateHumanBehavior(page, { mouseMovements: 5, scrolls: 3 });
|
||||
|
||||
// Step 4: Get followings list
|
||||
const followingsData = await getFollowingsList(
|
||||
page,
|
||||
targetUsername,
|
||||
maxFollowing
|
||||
);
|
||||
|
||||
// Step 5: Scrape individual profiles
|
||||
for (const username of followingsData.usernames) {
|
||||
const profileData = await scrapeProfile(page, username);
|
||||
// ... takes breaks every 3 profiles
|
||||
}
|
||||
|
||||
// Step 6: Save all data
|
||||
// ... creates JSON files
|
||||
}
|
||||
```
|
||||
|
||||
### What Each Function Does
|
||||
|
||||
#### `login(credentials, proxy)`
|
||||
|
||||
- Launches browser with stealth mode
|
||||
- Sets anti-detection headers
|
||||
- Simulates human login behavior
|
||||
- Returns `{ browser, page }`
|
||||
|
||||
#### `extractSession(page)`
|
||||
|
||||
- Gets all cookies from current session
|
||||
- Returns `{ cookies: [...] }`
|
||||
- Save for session reuse
|
||||
|
||||
#### `simulateHumanBehavior(page, options)`
|
||||
|
||||
- Random mouse movements
|
||||
- Random scrolling
|
||||
- Mimics real user behavior
|
||||
- Options: `{ mouseMovements, scrolls, randomClicks }`
|
||||
|
||||
#### `getFollowingsList(page, username, maxUsers)`
|
||||
|
||||
- Navigates to profile
|
||||
- Clicks "following" button
|
||||
- Intercepts Instagram API responses
|
||||
- Returns `{ usernames: [...], fullData: [...] }`
|
||||
|
||||
**Full data includes:**
|
||||
|
||||
```json
|
||||
{
|
||||
"pk": "310285748",
|
||||
"username": "example_user",
|
||||
"full_name": "Example User",
|
||||
"profile_pic_url": "https://...",
|
||||
"is_verified": true,
|
||||
"is_private": false,
|
||||
"fbid_v2": "...",
|
||||
"latest_reel_media": 1761853039
|
||||
}
|
||||
```
|
||||
|
||||
#### `scrapeProfile(page, username)`
|
||||
|
||||
- Navigates to profile
|
||||
- Intercepts API endpoint
|
||||
- Falls back to DOM scraping if needed
|
||||
- Returns detailed profile data
|
||||
|
||||
**Profile data includes:**
|
||||
|
||||
```json
|
||||
{
|
||||
"username": "example_user",
|
||||
"full_name": "Example User",
|
||||
"bio": "Biography text...",
|
||||
"followerCount": 15000,
|
||||
"followingCount": 500,
|
||||
"postsCount": 100,
|
||||
"is_verified": true,
|
||||
"is_private": false,
|
||||
"is_business_account": true,
|
||||
"email": "contact@example.com",
|
||||
"phone": "+1234567890"
|
||||
}
|
||||
```
|
||||
|
||||
#### `scrapeWorkflow(creds, targetUsername, proxy, maxFollowing)`
|
||||
|
||||
- Complete workflow in one function
|
||||
- Combines all steps above
|
||||
- Returns aggregated results
|
||||
|
||||
#### `cronJobs(fn, intervalSec, stopAfter)`
|
||||
|
||||
- Runs function on interval
|
||||
- Returns stop function
|
||||
- Used for scheduled scraping
|
||||
|
||||
## 💡 Usage Examples
|
||||
|
||||
### Example 1: Scrape Top Influencer's Followers
|
||||
|
||||
```bash
|
||||
$env:INSTAGRAM_USERNAME="your_account"
|
||||
$env:INSTAGRAM_PASSWORD="your_password"
|
||||
$env:TARGET_USERNAME="cristiano"
|
||||
$env:MAX_FOLLOWING="100"
|
||||
$env:MAX_PROFILES="20"
|
||||
node server.js
|
||||
```
|
||||
|
||||
### Example 2: Monitor Competitor Every Hour
|
||||
|
||||
```bash
|
||||
$env:TARGET_USERNAME="competitor_account"
|
||||
$env:MODE="scheduled"
|
||||
$env:SCRAPE_INTERVAL="60"
|
||||
$env:MAX_RUNS="24" # Run for 24 hours
|
||||
node server.js
|
||||
```
|
||||
|
||||
### Example 3: Scrape Multiple Accounts
|
||||
|
||||
Create `scrape-multiple.js`:
|
||||
|
||||
```javascript
|
||||
const { fullScrapingWorkflow } = require("./server.js");
|
||||
|
||||
const targets = ["account1", "account2", "account3"];
|
||||
|
||||
async function scrapeAll() {
|
||||
for (const target of targets) {
|
||||
process.env.TARGET_USERNAME = target;
|
||||
await fullScrapingWorkflow();
|
||||
|
||||
// Wait between accounts
|
||||
await new Promise((r) => setTimeout(r, 300000)); // 5 minutes
|
||||
}
|
||||
}
|
||||
|
||||
scrapeAll();
|
||||
```
|
||||
|
||||
### Example 4: Custom Workflow with Your Logic
|
||||
|
||||
```javascript
|
||||
const { login, getFollowingsList, scrapeProfile } = require("./scraper.js");
|
||||
|
||||
async function myCustomWorkflow() {
|
||||
// Login once
|
||||
const { browser, page } = await login({
|
||||
username: "your_username",
|
||||
password: "your_password",
|
||||
});
|
||||
|
||||
try {
|
||||
// Get followings from multiple accounts
|
||||
const accounts = ["account1", "account2"];
|
||||
|
||||
for (const account of accounts) {
|
||||
const followings = await getFollowingsList(page, account, 50);
|
||||
|
||||
// Filter verified users only
|
||||
const verified = followings.fullData.filter((u) => u.is_verified);
|
||||
|
||||
// Scrape verified profiles
|
||||
for (const user of verified) {
|
||||
const profile = await scrapeProfile(page, user.username);
|
||||
|
||||
// Custom logic: save only if business account
|
||||
if (profile.is_business_account) {
|
||||
console.log(`Business: ${profile.username} - ${profile.email}`);
|
||||
}
|
||||
}
|
||||
}
|
||||
} finally {
|
||||
await browser.close();
|
||||
}
|
||||
}
|
||||
|
||||
myCustomWorkflow();
|
||||
```
|
||||
|
||||
## 🔍 Output Format
|
||||
|
||||
### Followings Data
|
||||
|
||||
```json
|
||||
{
|
||||
"targetUsername": "instagram",
|
||||
"scrapedAt": "2025-10-31T12:00:00.000Z",
|
||||
"totalFollowings": 20,
|
||||
"followings": [
|
||||
{
|
||||
"pk": "123456",
|
||||
"username": "user1",
|
||||
"full_name": "User One",
|
||||
"is_verified": true,
|
||||
...
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Profiles Data
|
||||
|
||||
```json
|
||||
{
|
||||
"targetUsername": "instagram",
|
||||
"scrapedAt": "2025-10-31T12:00:00.000Z",
|
||||
"totalProfiles": 5,
|
||||
"profiles": [
|
||||
{
|
||||
"username": "user1",
|
||||
"followerCount": 50000,
|
||||
"email": "contact@user1.com",
|
||||
...
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## ⚡ Performance Tips
|
||||
|
||||
### 1. Optimize Delays
|
||||
|
||||
```javascript
|
||||
// Faster (more aggressive, higher block risk)
|
||||
await randomSleep(1000, 2000);
|
||||
|
||||
// Balanced (recommended)
|
||||
await randomSleep(2500, 6000);
|
||||
|
||||
// Safer (slower but less likely to be blocked)
|
||||
await randomSleep(5000, 10000);
|
||||
```
|
||||
|
||||
### 2. Batch Processing
|
||||
|
||||
Scrape in batches to avoid overwhelming Instagram:
|
||||
|
||||
```javascript
|
||||
const batchSize = 10;
|
||||
for (let i = 0; i < usernames.length; i += batchSize) {
|
||||
const batch = usernames.slice(i, i + batchSize);
|
||||
// Scrape batch
|
||||
// Long break between batches
|
||||
await randomSleep(60000, 120000); // 1-2 minutes
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Session Reuse
|
||||
|
||||
Reuse cookies to avoid logging in repeatedly:
|
||||
|
||||
```javascript
|
||||
const savedCookies = JSON.parse(fs.readFileSync("session_cookies.json"));
|
||||
await page.setCookie(...savedCookies.cookies);
|
||||
```
|
||||
|
||||
## 🚨 Common Issues
|
||||
|
||||
### "Rate limited (429)"
|
||||
|
||||
✅ **Solution**: Exponential backoff is automatic. If persistent:
|
||||
|
||||
- Reduce MAX_FOLLOWING and MAX_PROFILES
|
||||
- Increase delays
|
||||
- Add residential proxies
|
||||
|
||||
### "Login failed"
|
||||
|
||||
- Check credentials
|
||||
- Instagram may require verification
|
||||
- Try from your home IP first
|
||||
|
||||
### "No data captured"
|
||||
|
||||
- Instagram changed their API structure
|
||||
- Check if `doc_id` values need updating
|
||||
- DOM fallback should still work
|
||||
|
||||
### Blocked on cloud servers
|
||||
|
||||
❌ **Problem**: Using datacenter IPs
|
||||
✅ **Solution**: Get residential proxies (see ANTI-BOT-RECOMMENDATIONS.md)
|
||||
|
||||
## 📊 Best Practices
|
||||
|
||||
1. **Start Small**: Test with MAX_FOLLOWING=5, MAX_PROFILES=2
|
||||
2. **Use Residential Proxies**: Critical for production use
|
||||
3. **Respect Rate Limits**: ~200 requests/hour per IP
|
||||
4. **Save Sessions**: Reuse cookies to avoid repeated logins
|
||||
5. **Monitor Logs**: Watch for 429 errors
|
||||
6. **Add Randomness**: Vary delays and patterns
|
||||
7. **Take Breaks**: Schedule longer breaks every N profiles
|
||||
|
||||
## 🎓 Learning Path
|
||||
|
||||
1. **Start**: Run `MODE=simple` with small numbers
|
||||
2. **Understand**: Read the logs and output files
|
||||
3. **Customize**: Modify `MAX_FOLLOWING` and `MAX_PROFILES`
|
||||
4. **Advanced**: Use `MODE=full` for complete control
|
||||
5. **Production**: Add proxies and session management
|
||||
|
||||
---
|
||||
|
||||
**Need help?** Check:
|
||||
|
||||
- [ANTI-BOT-RECOMMENDATIONS.md](./ANTI-BOT-RECOMMENDATIONS.md)
|
||||
- [EXPONENTIAL-BACKOFF.md](./EXPONENTIAL-BACKOFF.md)
|
||||
- Test script: `node test-retry.js`
|
||||
Reference in New Issue
Block a user