Files
instagram-scraper/USAGE-GUIDE.md

408 lines
10 KiB
Markdown

# Instagram Scraper - Usage Guide
Complete guide to using the Instagram scraper with all available workflows.
## 🚀 Quick Start
### 1. Full Workflow (Recommended)
The most comprehensive workflow that uses all scraper functions:
```bash
# Windows PowerShell
$env:INSTAGRAM_USERNAME="your_username"
$env:INSTAGRAM_PASSWORD="your_password"
$env:TARGET_USERNAME="instagram"
$env:MAX_FOLLOWING="20"
$env:MAX_PROFILES="5"
$env:MODE="full"
node server.js
```
**What happens:**
1. 🔐 **Login** - Logs into Instagram with human-like behavior
2. 💾 **Save Session** - Extracts and saves cookies to `session_cookies.json`
3. 🌐 **Browse** - Simulates random mouse movements and scrolling
4. 👥 **Fetch Followings** - Gets following list using API interception
5. 👤 **Scrape Profiles** - Scrapes detailed data for each profile
6. 📁 **Save Data** - Creates JSON files with all collected data
**Output files:**
- `followings_[username]_[timestamp].json` - Full following list
- `profiles_[username]_[timestamp].json` - Detailed profile data
- `session_cookies.json` - Reusable session cookies
### 2. Simple Workflow
Uses the built-in `scrapeWorkflow()` function:
```bash
$env:MODE="simple"
node server.js
```
**What it does:**
- Combines login + following fetch + profile scraping
- Single output file with all data
- Less granular control but simpler
### 3. Scheduled Workflow
Runs scraping on a schedule using `cronJobs()`:
```bash
$env:MODE="scheduled"
$env:SCRAPE_INTERVAL="60" # Minutes between runs
$env:MAX_RUNS="5" # Stop after 5 runs
node server.js
```
**Use case:** Monitor a profile's followings over time
## 📋 Environment Variables
| Variable | Description | Default | Example |
| -------------------- | ------------------------------------- | --------------- | --------------------- |
| `INSTAGRAM_USERNAME` | Your Instagram username | `your_username` | `john_doe` |
| `INSTAGRAM_PASSWORD` | Your Instagram password | `your_password` | `MySecureP@ss` |
| `TARGET_USERNAME` | Profile to scrape | `instagram` | `cristiano` |
| `MAX_FOLLOWING` | Max followings to fetch | `20` | `100` |
| `MAX_PROFILES` | Max profiles to scrape | `5` | `50` |
| `PROXY` | Proxy server | `None` | `proxy.com:8080` |
| `MODE` | Workflow type | `full` | `simple`, `scheduled` |
| `SCRAPE_INTERVAL` | Minutes between runs (scheduled mode) | `60` | `30` |
| `MAX_RUNS` | Max runs (scheduled mode) | `5` | `10` |
## 🎯 Workflow Details
### Full Workflow Step-by-Step
```javascript
async function fullScrapingWorkflow() {
// Step 1: Login
const { browser, page } = await login(credentials, proxy);
// Step 2: Extract session
const session = await extractSession(page);
// Step 3: Simulate browsing
await simulateHumanBehavior(page, { mouseMovements: 5, scrolls: 3 });
// Step 4: Get followings list
const followingsData = await getFollowingsList(
page,
targetUsername,
maxFollowing
);
// Step 5: Scrape individual profiles
for (const username of followingsData.usernames) {
const profileData = await scrapeProfile(page, username);
// ... takes breaks every 3 profiles
}
// Step 6: Save all data
// ... creates JSON files
}
```
### What Each Function Does
#### `login(credentials, proxy)`
- Launches browser with stealth mode
- Sets anti-detection headers
- Simulates human login behavior
- Returns `{ browser, page }`
#### `extractSession(page)`
- Gets all cookies from current session
- Returns `{ cookies: [...] }`
- Save for session reuse
#### `simulateHumanBehavior(page, options)`
- Random mouse movements
- Random scrolling
- Mimics real user behavior
- Options: `{ mouseMovements, scrolls, randomClicks }`
#### `getFollowingsList(page, username, maxUsers)`
- Navigates to profile
- Clicks "following" button
- Intercepts Instagram API responses
- Returns `{ usernames: [...], fullData: [...] }`
**Full data includes:**
```json
{
"pk": "310285748",
"username": "example_user",
"full_name": "Example User",
"profile_pic_url": "https://...",
"is_verified": true,
"is_private": false,
"fbid_v2": "...",
"latest_reel_media": 1761853039
}
```
#### `scrapeProfile(page, username)`
- Navigates to profile
- Intercepts API endpoint
- Falls back to DOM scraping if needed
- Returns detailed profile data
**Profile data includes:**
```json
{
"username": "example_user",
"full_name": "Example User",
"bio": "Biography text...",
"followerCount": 15000,
"followingCount": 500,
"postsCount": 100,
"is_verified": true,
"is_private": false,
"is_business_account": true,
"email": "contact@example.com",
"phone": "+1234567890"
}
```
#### `scrapeWorkflow(creds, targetUsername, proxy, maxFollowing)`
- Complete workflow in one function
- Combines all steps above
- Returns aggregated results
#### `cronJobs(fn, intervalSec, stopAfter)`
- Runs function on interval
- Returns stop function
- Used for scheduled scraping
## 💡 Usage Examples
### Example 1: Scrape Top Influencer's Followers
```bash
$env:INSTAGRAM_USERNAME="your_account"
$env:INSTAGRAM_PASSWORD="your_password"
$env:TARGET_USERNAME="cristiano"
$env:MAX_FOLLOWING="100"
$env:MAX_PROFILES="20"
node server.js
```
### Example 2: Monitor Competitor Every Hour
```bash
$env:TARGET_USERNAME="competitor_account"
$env:MODE="scheduled"
$env:SCRAPE_INTERVAL="60"
$env:MAX_RUNS="24" # Run for 24 hours
node server.js
```
### Example 3: Scrape Multiple Accounts
Create `scrape-multiple.js`:
```javascript
const { fullScrapingWorkflow } = require("./server.js");
const targets = ["account1", "account2", "account3"];
async function scrapeAll() {
for (const target of targets) {
process.env.TARGET_USERNAME = target;
await fullScrapingWorkflow();
// Wait between accounts
await new Promise((r) => setTimeout(r, 300000)); // 5 minutes
}
}
scrapeAll();
```
### Example 4: Custom Workflow with Your Logic
```javascript
const { login, getFollowingsList, scrapeProfile } = require("./scraper.js");
async function myCustomWorkflow() {
// Login once
const { browser, page } = await login({
username: "your_username",
password: "your_password",
});
try {
// Get followings from multiple accounts
const accounts = ["account1", "account2"];
for (const account of accounts) {
const followings = await getFollowingsList(page, account, 50);
// Filter verified users only
const verified = followings.fullData.filter((u) => u.is_verified);
// Scrape verified profiles
for (const user of verified) {
const profile = await scrapeProfile(page, user.username);
// Custom logic: save only if business account
if (profile.is_business_account) {
console.log(`Business: ${profile.username} - ${profile.email}`);
}
}
}
} finally {
await browser.close();
}
}
myCustomWorkflow();
```
## 🔍 Output Format
### Followings Data
```json
{
"targetUsername": "instagram",
"scrapedAt": "2025-10-31T12:00:00.000Z",
"totalFollowings": 20,
"followings": [
{
"pk": "123456",
"username": "user1",
"full_name": "User One",
"is_verified": true,
...
}
]
}
```
### Profiles Data
```json
{
"targetUsername": "instagram",
"scrapedAt": "2025-10-31T12:00:00.000Z",
"totalProfiles": 5,
"profiles": [
{
"username": "user1",
"followerCount": 50000,
"email": "contact@user1.com",
...
}
]
}
```
## ⚡ Performance Tips
### 1. Optimize Delays
```javascript
// Faster (more aggressive, higher block risk)
await randomSleep(1000, 2000);
// Balanced (recommended)
await randomSleep(2500, 6000);
// Safer (slower but less likely to be blocked)
await randomSleep(5000, 10000);
```
### 2. Batch Processing
Scrape in batches to avoid overwhelming Instagram:
```javascript
const batchSize = 10;
for (let i = 0; i < usernames.length; i += batchSize) {
const batch = usernames.slice(i, i + batchSize);
// Scrape batch
// Long break between batches
await randomSleep(60000, 120000); // 1-2 minutes
}
```
### 3. Session Reuse
Reuse cookies to avoid logging in repeatedly:
```javascript
const savedCookies = JSON.parse(fs.readFileSync("session_cookies.json"));
await page.setCookie(...savedCookies.cookies);
```
## 🚨 Common Issues
### "Rate limited (429)"
**Solution**: Exponential backoff is automatic. If persistent:
- Reduce MAX_FOLLOWING and MAX_PROFILES
- Increase delays
- Add residential proxies
### "Login failed"
- Check credentials
- Instagram may require verification
- Try from your home IP first
### "No data captured"
- Instagram changed their API structure
- Check if `doc_id` values need updating
- DOM fallback should still work
### Blocked on cloud servers
**Problem**: Using datacenter IPs
**Solution**: Get residential proxies (see ANTI-BOT-RECOMMENDATIONS.md)
## 📊 Best Practices
1. **Start Small**: Test with MAX_FOLLOWING=5, MAX_PROFILES=2
2. **Use Residential Proxies**: Critical for production use
3. **Respect Rate Limits**: ~200 requests/hour per IP
4. **Save Sessions**: Reuse cookies to avoid repeated logins
5. **Monitor Logs**: Watch for 429 errors
6. **Add Randomness**: Vary delays and patterns
7. **Take Breaks**: Schedule longer breaks every N profiles
## 🎓 Learning Path
1. **Start**: Run `MODE=simple` with small numbers
2. **Understand**: Read the logs and output files
3. **Customize**: Modify `MAX_FOLLOWING` and `MAX_PROFILES`
4. **Advanced**: Use `MODE=full` for complete control
5. **Production**: Add proxies and session management
---
**Need help?** Check:
- [ANTI-BOT-RECOMMENDATIONS.md](./ANTI-BOT-RECOMMENDATIONS.md)
- [EXPONENTIAL-BACKOFF.md](./EXPONENTIAL-BACKOFF.md)
- Test script: `node test-retry.js`