feat: Instagram scraper with GraphQL API integration - Automated followings list extraction via API interception - Profile scraping using GraphQL endpoint interception - DOM fallback for edge cases - Performance timing for all operations - Anti-bot measures and human-like behavior simulation
This commit is contained in:
6
.gitignore
vendored
6
.gitignore
vendored
@@ -136,3 +136,9 @@ dist
|
||||
.yarn/install-state.gz
|
||||
.pnp.*
|
||||
|
||||
# Instagram scraper sensitive files
|
||||
session_cookies.json
|
||||
*.json
|
||||
!package.json
|
||||
!package-lock.json
|
||||
|
||||
|
||||
179
ANTI-BOT-RECOMMENDATIONS.md
Normal file
179
ANTI-BOT-RECOMMENDATIONS.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# Instagram Scraper - Anti-Bot Detection Recommendations
|
||||
|
||||
Based on [Scrapfly's Instagram Scraping Guide](https://scrapfly.io/blog/posts/how-to-scrape-instagram)
|
||||
|
||||
## ✅ Already Implemented
|
||||
|
||||
1. **Puppeteer Stealth Plugin** - Bypasses basic browser detection
|
||||
2. **Random User Agents** - Different browser signatures
|
||||
3. **Human-like behaviors**:
|
||||
- Mouse movements
|
||||
- Random scrolling
|
||||
- Variable delays (2.5-6 seconds between profiles)
|
||||
- Typing delays
|
||||
- Breaks every 10 profiles
|
||||
4. **Variable viewport sizes** - Randomized window dimensions
|
||||
5. **Network payload interception** - Capturing API responses instead of DOM scraping
|
||||
6. **Critical headers** - Including `x-ig-app-id: 936619743392459`
|
||||
|
||||
## ⚠️ Critical Improvements Needed
|
||||
|
||||
### 1. **Residential Proxies** (MOST IMPORTANT)
|
||||
|
||||
**Status**: ❌ Not implemented
|
||||
|
||||
**Issue**:
|
||||
|
||||
- Datacenter IPs (AWS, Google Cloud, etc.) are **blocked instantly** by Instagram
|
||||
- Your current setup will be detected as soon as you deploy to any cloud server
|
||||
|
||||
**Solution**:
|
||||
|
||||
```javascript
|
||||
const browser = await puppeteer.launch({
|
||||
headless: true,
|
||||
args: [
|
||||
"--proxy-server=residential-proxy-provider.com:port",
|
||||
// Residential proxies required - NOT datacenter
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
**Recommended Proxy Providers**:
|
||||
|
||||
- Bright Data (formerly Luminati)
|
||||
- Oxylabs
|
||||
- Smartproxy
|
||||
- GeoSurf
|
||||
|
||||
**Requirements**:
|
||||
|
||||
- Must be residential IPs (from real ISPs like Comcast, AT&T)
|
||||
- Rotate IPs every 5-10 minutes (sticky sessions)
|
||||
- Each IP allows ~200 requests/hour
|
||||
- Cost: ~$10-15 per GB
|
||||
|
||||
### 2. **Rate Limit Handling with Exponential Backoff**
|
||||
|
||||
**Status**: ⚠️ Partial - needs improvement
|
||||
|
||||
**Current**: Random delays exist
|
||||
**Needed**: Proper 429 error handling
|
||||
|
||||
```javascript
|
||||
async function makeRequest(fn, retries = 3) {
|
||||
for (let i = 0; i < retries; i++) {
|
||||
try {
|
||||
return await fn();
|
||||
} catch (error) {
|
||||
if (error.status === 429 && i < retries - 1) {
|
||||
const delay = Math.pow(2, i) * 2000; // 2s, 4s, 8s
|
||||
console.log(`Rate limited, waiting ${delay}ms...`);
|
||||
await new Promise((res) => setTimeout(res, delay));
|
||||
continue;
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. **Session Cookies Management**
|
||||
|
||||
**Status**: ⚠️ Partial - extractSession exists but not reused
|
||||
|
||||
**Issue**: Creating new sessions repeatedly looks suspicious
|
||||
|
||||
**Solution**:
|
||||
|
||||
- Save cookies after login
|
||||
- Reuse cookies across multiple scraping sessions
|
||||
- Rotate sessions periodically
|
||||
|
||||
```javascript
|
||||
// Save cookies after login
|
||||
const cookies = await extractSession(page);
|
||||
fs.writeFileSync("session.json", JSON.stringify(cookies));
|
||||
|
||||
// Reuse cookies in next session
|
||||
const savedCookies = JSON.parse(fs.readFileSync("session.json"));
|
||||
await page.setCookie(...savedCookies.cookies);
|
||||
```
|
||||
|
||||
### 4. **Realistic Browsing Patterns**
|
||||
|
||||
**Status**: ✅ Implemented but can improve
|
||||
|
||||
**Additional improvements**:
|
||||
|
||||
- Visit homepage before going to target profile
|
||||
- Occasionally view posts/stories during following list scraping
|
||||
- Don't always scrape in the same order (randomize)
|
||||
- Add occasional "browsing breaks" of 30-60 seconds
|
||||
|
||||
### 5. **Monitor doc_id Changes**
|
||||
|
||||
**Status**: ❌ Not monitoring
|
||||
|
||||
**Issue**: Instagram changes GraphQL `doc_id` values every 2-4 weeks
|
||||
|
||||
**Current doc_ids** (as of article):
|
||||
|
||||
- Profile posts: `9310670392322965`
|
||||
- Post details: `8845758582119845`
|
||||
- Reels: `25981206651899035`
|
||||
|
||||
**Solution**:
|
||||
|
||||
- Monitor Instagram's GraphQL requests in browser DevTools
|
||||
- Update when API calls start failing
|
||||
- Or use a service like Scrapfly that auto-updates
|
||||
|
||||
## 📊 Instagram's Blocking Layers
|
||||
|
||||
1. **IP Quality Check** → Blocks datacenter IPs instantly
|
||||
2. **TLS Fingerprinting** → Detects non-browser tools (Puppeteer Stealth helps)
|
||||
3. **Rate Limiting** → ~200 requests/hour per IP
|
||||
4. **Behavioral Detection** → Flags unnatural patterns
|
||||
|
||||
## 🎯 Priority Implementation Order
|
||||
|
||||
1. **HIGH PRIORITY**: Add residential proxy support
|
||||
2. **HIGH PRIORITY**: Implement exponential backoff for 429 errors
|
||||
3. **MEDIUM**: Improve session cookie reuse
|
||||
4. **MEDIUM**: Add doc_id monitoring system
|
||||
5. **LOW**: Additional browsing pattern randomization
|
||||
|
||||
## 💰 Cost Estimates (for 10,000 profiles)
|
||||
|
||||
- **Proxy bandwidth**: ~750 MB
|
||||
- **Cost**: $7.50-$11.25 in residential proxy fees
|
||||
- **With Proxy Saver**: $5.25-$7.88 (30-50% savings)
|
||||
|
||||
## 🚨 Legal Considerations
|
||||
|
||||
- Only scrape **publicly available** data
|
||||
- Respect rate limits
|
||||
- Don't store PII of EU citizens without GDPR compliance
|
||||
- Add delays to avoid damaging Instagram's servers
|
||||
- Check Instagram's Terms of Service
|
||||
|
||||
## 📚 Additional Resources
|
||||
|
||||
- [Scrapfly Instagram Scraper](https://github.com/scrapfly/scrapfly-scrapers/tree/main/instagram-scraper) - Open source reference
|
||||
- [Instagram GraphQL Endpoint Documentation](https://scrapfly.io/blog/posts/how-to-scrape-instagram#how-instagrams-scraping-api-works)
|
||||
- [Proxy comparison guide](https://scrapfly.io/blog/best-proxy-providers-for-web-scraping)
|
||||
|
||||
## ⚡ Quick Wins
|
||||
|
||||
Things you can implement immediately:
|
||||
|
||||
1. ✅ Critical headers added (x-ig-app-id)
|
||||
2. ✅ Human simulation functions integrated
|
||||
3. ✅ Exponential backoff added (see EXPONENTIAL-BACKOFF.md)
|
||||
4. Implement cookie persistence (15 min)
|
||||
5. Research residential proxy providers (1 hour)
|
||||
|
||||
---
|
||||
|
||||
**Bottom Line**: Without residential proxies, this scraper will be blocked immediately on any cloud infrastructure. That's the #1 priority to address.
|
||||
407
USAGE-GUIDE.md
Normal file
407
USAGE-GUIDE.md
Normal file
@@ -0,0 +1,407 @@
|
||||
# Instagram Scraper - Usage Guide
|
||||
|
||||
Complete guide to using the Instagram scraper with all available workflows.
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
### 1. Full Workflow (Recommended)
|
||||
|
||||
The most comprehensive workflow that uses all scraper functions:
|
||||
|
||||
```bash
|
||||
# Windows PowerShell
|
||||
$env:INSTAGRAM_USERNAME="your_username"
|
||||
$env:INSTAGRAM_PASSWORD="your_password"
|
||||
$env:TARGET_USERNAME="instagram"
|
||||
$env:MAX_FOLLOWING="20"
|
||||
$env:MAX_PROFILES="5"
|
||||
$env:MODE="full"
|
||||
|
||||
node server.js
|
||||
```
|
||||
|
||||
**What happens:**
|
||||
|
||||
1. 🔐 **Login** - Logs into Instagram with human-like behavior
|
||||
2. 💾 **Save Session** - Extracts and saves cookies to `session_cookies.json`
|
||||
3. 🌐 **Browse** - Simulates random mouse movements and scrolling
|
||||
4. 👥 **Fetch Followings** - Gets following list using API interception
|
||||
5. 👤 **Scrape Profiles** - Scrapes detailed data for each profile
|
||||
6. 📁 **Save Data** - Creates JSON files with all collected data
|
||||
|
||||
**Output files:**
|
||||
|
||||
- `followings_[username]_[timestamp].json` - Full following list
|
||||
- `profiles_[username]_[timestamp].json` - Detailed profile data
|
||||
- `session_cookies.json` - Reusable session cookies
|
||||
|
||||
### 2. Simple Workflow
|
||||
|
||||
Uses the built-in `scrapeWorkflow()` function:
|
||||
|
||||
```bash
|
||||
$env:MODE="simple"
|
||||
node server.js
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
|
||||
- Combines login + following fetch + profile scraping
|
||||
- Single output file with all data
|
||||
- Less granular control but simpler
|
||||
|
||||
### 3. Scheduled Workflow
|
||||
|
||||
Runs scraping on a schedule using `cronJobs()`:
|
||||
|
||||
```bash
|
||||
$env:MODE="scheduled"
|
||||
$env:SCRAPE_INTERVAL="60" # Minutes between runs
|
||||
$env:MAX_RUNS="5" # Stop after 5 runs
|
||||
node server.js
|
||||
```
|
||||
|
||||
**Use case:** Monitor a profile's followings over time
|
||||
|
||||
## 📋 Environment Variables
|
||||
|
||||
| Variable | Description | Default | Example |
|
||||
| -------------------- | ------------------------------------- | --------------- | --------------------- |
|
||||
| `INSTAGRAM_USERNAME` | Your Instagram username | `your_username` | `john_doe` |
|
||||
| `INSTAGRAM_PASSWORD` | Your Instagram password | `your_password` | `MySecureP@ss` |
|
||||
| `TARGET_USERNAME` | Profile to scrape | `instagram` | `cristiano` |
|
||||
| `MAX_FOLLOWING` | Max followings to fetch | `20` | `100` |
|
||||
| `MAX_PROFILES` | Max profiles to scrape | `5` | `50` |
|
||||
| `PROXY` | Proxy server | `None` | `proxy.com:8080` |
|
||||
| `MODE` | Workflow type | `full` | `simple`, `scheduled` |
|
||||
| `SCRAPE_INTERVAL` | Minutes between runs (scheduled mode) | `60` | `30` |
|
||||
| `MAX_RUNS` | Max runs (scheduled mode) | `5` | `10` |
|
||||
|
||||
## 🎯 Workflow Details
|
||||
|
||||
### Full Workflow Step-by-Step
|
||||
|
||||
```javascript
|
||||
async function fullScrapingWorkflow() {
|
||||
// Step 1: Login
|
||||
const { browser, page } = await login(credentials, proxy);
|
||||
|
||||
// Step 2: Extract session
|
||||
const session = await extractSession(page);
|
||||
|
||||
// Step 3: Simulate browsing
|
||||
await simulateHumanBehavior(page, { mouseMovements: 5, scrolls: 3 });
|
||||
|
||||
// Step 4: Get followings list
|
||||
const followingsData = await getFollowingsList(
|
||||
page,
|
||||
targetUsername,
|
||||
maxFollowing
|
||||
);
|
||||
|
||||
// Step 5: Scrape individual profiles
|
||||
for (const username of followingsData.usernames) {
|
||||
const profileData = await scrapeProfile(page, username);
|
||||
// ... takes breaks every 3 profiles
|
||||
}
|
||||
|
||||
// Step 6: Save all data
|
||||
// ... creates JSON files
|
||||
}
|
||||
```
|
||||
|
||||
### What Each Function Does
|
||||
|
||||
#### `login(credentials, proxy)`
|
||||
|
||||
- Launches browser with stealth mode
|
||||
- Sets anti-detection headers
|
||||
- Simulates human login behavior
|
||||
- Returns `{ browser, page }`
|
||||
|
||||
#### `extractSession(page)`
|
||||
|
||||
- Gets all cookies from current session
|
||||
- Returns `{ cookies: [...] }`
|
||||
- Save for session reuse
|
||||
|
||||
#### `simulateHumanBehavior(page, options)`
|
||||
|
||||
- Random mouse movements
|
||||
- Random scrolling
|
||||
- Mimics real user behavior
|
||||
- Options: `{ mouseMovements, scrolls, randomClicks }`
|
||||
|
||||
#### `getFollowingsList(page, username, maxUsers)`
|
||||
|
||||
- Navigates to profile
|
||||
- Clicks "following" button
|
||||
- Intercepts Instagram API responses
|
||||
- Returns `{ usernames: [...], fullData: [...] }`
|
||||
|
||||
**Full data includes:**
|
||||
|
||||
```json
|
||||
{
|
||||
"pk": "310285748",
|
||||
"username": "example_user",
|
||||
"full_name": "Example User",
|
||||
"profile_pic_url": "https://...",
|
||||
"is_verified": true,
|
||||
"is_private": false,
|
||||
"fbid_v2": "...",
|
||||
"latest_reel_media": 1761853039
|
||||
}
|
||||
```
|
||||
|
||||
#### `scrapeProfile(page, username)`
|
||||
|
||||
- Navigates to profile
|
||||
- Intercepts API endpoint
|
||||
- Falls back to DOM scraping if needed
|
||||
- Returns detailed profile data
|
||||
|
||||
**Profile data includes:**
|
||||
|
||||
```json
|
||||
{
|
||||
"username": "example_user",
|
||||
"full_name": "Example User",
|
||||
"bio": "Biography text...",
|
||||
"followerCount": 15000,
|
||||
"followingCount": 500,
|
||||
"postsCount": 100,
|
||||
"is_verified": true,
|
||||
"is_private": false,
|
||||
"is_business_account": true,
|
||||
"email": "contact@example.com",
|
||||
"phone": "+1234567890"
|
||||
}
|
||||
```
|
||||
|
||||
#### `scrapeWorkflow(creds, targetUsername, proxy, maxFollowing)`
|
||||
|
||||
- Complete workflow in one function
|
||||
- Combines all steps above
|
||||
- Returns aggregated results
|
||||
|
||||
#### `cronJobs(fn, intervalSec, stopAfter)`
|
||||
|
||||
- Runs function on interval
|
||||
- Returns stop function
|
||||
- Used for scheduled scraping
|
||||
|
||||
## 💡 Usage Examples
|
||||
|
||||
### Example 1: Scrape Top Influencer's Followers
|
||||
|
||||
```bash
|
||||
$env:INSTAGRAM_USERNAME="your_account"
|
||||
$env:INSTAGRAM_PASSWORD="your_password"
|
||||
$env:TARGET_USERNAME="cristiano"
|
||||
$env:MAX_FOLLOWING="100"
|
||||
$env:MAX_PROFILES="20"
|
||||
node server.js
|
||||
```
|
||||
|
||||
### Example 2: Monitor Competitor Every Hour
|
||||
|
||||
```bash
|
||||
$env:TARGET_USERNAME="competitor_account"
|
||||
$env:MODE="scheduled"
|
||||
$env:SCRAPE_INTERVAL="60"
|
||||
$env:MAX_RUNS="24" # Run for 24 hours
|
||||
node server.js
|
||||
```
|
||||
|
||||
### Example 3: Scrape Multiple Accounts
|
||||
|
||||
Create `scrape-multiple.js`:
|
||||
|
||||
```javascript
|
||||
const { fullScrapingWorkflow } = require("./server.js");
|
||||
|
||||
const targets = ["account1", "account2", "account3"];
|
||||
|
||||
async function scrapeAll() {
|
||||
for (const target of targets) {
|
||||
process.env.TARGET_USERNAME = target;
|
||||
await fullScrapingWorkflow();
|
||||
|
||||
// Wait between accounts
|
||||
await new Promise((r) => setTimeout(r, 300000)); // 5 minutes
|
||||
}
|
||||
}
|
||||
|
||||
scrapeAll();
|
||||
```
|
||||
|
||||
### Example 4: Custom Workflow with Your Logic
|
||||
|
||||
```javascript
|
||||
const { login, getFollowingsList, scrapeProfile } = require("./scraper.js");
|
||||
|
||||
async function myCustomWorkflow() {
|
||||
// Login once
|
||||
const { browser, page } = await login({
|
||||
username: "your_username",
|
||||
password: "your_password",
|
||||
});
|
||||
|
||||
try {
|
||||
// Get followings from multiple accounts
|
||||
const accounts = ["account1", "account2"];
|
||||
|
||||
for (const account of accounts) {
|
||||
const followings = await getFollowingsList(page, account, 50);
|
||||
|
||||
// Filter verified users only
|
||||
const verified = followings.fullData.filter((u) => u.is_verified);
|
||||
|
||||
// Scrape verified profiles
|
||||
for (const user of verified) {
|
||||
const profile = await scrapeProfile(page, user.username);
|
||||
|
||||
// Custom logic: save only if business account
|
||||
if (profile.is_business_account) {
|
||||
console.log(`Business: ${profile.username} - ${profile.email}`);
|
||||
}
|
||||
}
|
||||
}
|
||||
} finally {
|
||||
await browser.close();
|
||||
}
|
||||
}
|
||||
|
||||
myCustomWorkflow();
|
||||
```
|
||||
|
||||
## 🔍 Output Format
|
||||
|
||||
### Followings Data
|
||||
|
||||
```json
|
||||
{
|
||||
"targetUsername": "instagram",
|
||||
"scrapedAt": "2025-10-31T12:00:00.000Z",
|
||||
"totalFollowings": 20,
|
||||
"followings": [
|
||||
{
|
||||
"pk": "123456",
|
||||
"username": "user1",
|
||||
"full_name": "User One",
|
||||
"is_verified": true,
|
||||
...
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Profiles Data
|
||||
|
||||
```json
|
||||
{
|
||||
"targetUsername": "instagram",
|
||||
"scrapedAt": "2025-10-31T12:00:00.000Z",
|
||||
"totalProfiles": 5,
|
||||
"profiles": [
|
||||
{
|
||||
"username": "user1",
|
||||
"followerCount": 50000,
|
||||
"email": "contact@user1.com",
|
||||
...
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## ⚡ Performance Tips
|
||||
|
||||
### 1. Optimize Delays
|
||||
|
||||
```javascript
|
||||
// Faster (more aggressive, higher block risk)
|
||||
await randomSleep(1000, 2000);
|
||||
|
||||
// Balanced (recommended)
|
||||
await randomSleep(2500, 6000);
|
||||
|
||||
// Safer (slower but less likely to be blocked)
|
||||
await randomSleep(5000, 10000);
|
||||
```
|
||||
|
||||
### 2. Batch Processing
|
||||
|
||||
Scrape in batches to avoid overwhelming Instagram:
|
||||
|
||||
```javascript
|
||||
const batchSize = 10;
|
||||
for (let i = 0; i < usernames.length; i += batchSize) {
|
||||
const batch = usernames.slice(i, i + batchSize);
|
||||
// Scrape batch
|
||||
// Long break between batches
|
||||
await randomSleep(60000, 120000); // 1-2 minutes
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Session Reuse
|
||||
|
||||
Reuse cookies to avoid logging in repeatedly:
|
||||
|
||||
```javascript
|
||||
const savedCookies = JSON.parse(fs.readFileSync("session_cookies.json"));
|
||||
await page.setCookie(...savedCookies.cookies);
|
||||
```
|
||||
|
||||
## 🚨 Common Issues
|
||||
|
||||
### "Rate limited (429)"
|
||||
|
||||
✅ **Solution**: Exponential backoff is automatic. If persistent:
|
||||
|
||||
- Reduce MAX_FOLLOWING and MAX_PROFILES
|
||||
- Increase delays
|
||||
- Add residential proxies
|
||||
|
||||
### "Login failed"
|
||||
|
||||
- Check credentials
|
||||
- Instagram may require verification
|
||||
- Try from your home IP first
|
||||
|
||||
### "No data captured"
|
||||
|
||||
- Instagram changed their API structure
|
||||
- Check if `doc_id` values need updating
|
||||
- DOM fallback should still work
|
||||
|
||||
### Blocked on cloud servers
|
||||
|
||||
❌ **Problem**: Using datacenter IPs
|
||||
✅ **Solution**: Get residential proxies (see ANTI-BOT-RECOMMENDATIONS.md)
|
||||
|
||||
## 📊 Best Practices
|
||||
|
||||
1. **Start Small**: Test with MAX_FOLLOWING=5, MAX_PROFILES=2
|
||||
2. **Use Residential Proxies**: Critical for production use
|
||||
3. **Respect Rate Limits**: ~200 requests/hour per IP
|
||||
4. **Save Sessions**: Reuse cookies to avoid repeated logins
|
||||
5. **Monitor Logs**: Watch for 429 errors
|
||||
6. **Add Randomness**: Vary delays and patterns
|
||||
7. **Take Breaks**: Schedule longer breaks every N profiles
|
||||
|
||||
## 🎓 Learning Path
|
||||
|
||||
1. **Start**: Run `MODE=simple` with small numbers
|
||||
2. **Understand**: Read the logs and output files
|
||||
3. **Customize**: Modify `MAX_FOLLOWING` and `MAX_PROFILES`
|
||||
4. **Advanced**: Use `MODE=full` for complete control
|
||||
5. **Production**: Add proxies and session management
|
||||
|
||||
---
|
||||
|
||||
**Need help?** Check:
|
||||
|
||||
- [ANTI-BOT-RECOMMENDATIONS.md](./ANTI-BOT-RECOMMENDATIONS.md)
|
||||
- [EXPONENTIAL-BACKOFF.md](./EXPONENTIAL-BACKOFF.md)
|
||||
- Test script: `node test-retry.js`
|
||||
1648
package-lock.json
generated
Normal file
1648
package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load Diff
9
package.json
Normal file
9
package.json
Normal file
@@ -0,0 +1,9 @@
|
||||
{
|
||||
"dependencies": {
|
||||
"dotenv": "^17.2.3",
|
||||
"puppeteer": "^24.27.0",
|
||||
"puppeteer-extra": "^3.3.6",
|
||||
"puppeteer-extra-plugin-stealth": "^2.11.2",
|
||||
"random-useragent": "^0.5.0"
|
||||
}
|
||||
}
|
||||
723
scraper.js
Normal file
723
scraper.js
Normal file
@@ -0,0 +1,723 @@
|
||||
const puppeteer = require("puppeteer-extra");
|
||||
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
|
||||
const randomUseragent = require("random-useragent");
|
||||
const fs = require("fs");
|
||||
const {
|
||||
randomSleep,
|
||||
simulateHumanBehavior,
|
||||
handleRateLimitedRequest,
|
||||
} = require("./utils.js");
|
||||
|
||||
puppeteer.use(StealthPlugin());
|
||||
|
||||
const INSTAGRAM_URL = "https://www.instagram.com";
|
||||
const SESSION_FILE = "session_cookies.json";
|
||||
|
||||
async function loginWithSession(
|
||||
{ username, password },
|
||||
proxy = null,
|
||||
useExistingSession = true
|
||||
) {
|
||||
const browserArgs = [];
|
||||
if (proxy) browserArgs.push(`--proxy-server=${proxy}`);
|
||||
const userAgent = randomUseragent.getRandom();
|
||||
|
||||
const browser = await puppeteer.launch({
|
||||
headless: false,
|
||||
args: browserArgs,
|
||||
});
|
||||
const page = await browser.newPage();
|
||||
await page.setUserAgent(userAgent);
|
||||
|
||||
// Set a large viewport to ensure modal behavior (Instagram shows modals on desktop/large screens)
|
||||
await page.setViewport({
|
||||
width: 1920, // Standard desktop width
|
||||
height: 1080, // Standard desktop height
|
||||
});
|
||||
|
||||
// Set browser timezone
|
||||
await page.evaluateOnNewDocument(() => {
|
||||
Object.defineProperty(Intl.DateTimeFormat.prototype, "resolvedOptions", {
|
||||
value: function () {
|
||||
return { timeZone: "America/New_York" };
|
||||
},
|
||||
});
|
||||
});
|
||||
|
||||
// Monitor for rate limit responses
|
||||
page.on("response", (response) => {
|
||||
if (response.status() === 429) {
|
||||
console.log(
|
||||
`WARNING: Rate limit detected (429) on ${response
|
||||
.url()
|
||||
.substring(0, 80)}...`
|
||||
);
|
||||
}
|
||||
});
|
||||
|
||||
// Try to load existing session if available
|
||||
if (useExistingSession && fs.existsSync(SESSION_FILE)) {
|
||||
try {
|
||||
console.log("Found existing session, attempting to reuse...");
|
||||
const sessionData = JSON.parse(fs.readFileSync(SESSION_FILE, "utf-8"));
|
||||
|
||||
if (sessionData.cookies && sessionData.cookies.length > 0) {
|
||||
await page.setCookie(...sessionData.cookies);
|
||||
console.log(
|
||||
`Loaded ${sessionData.cookies.length} cookies from session`
|
||||
);
|
||||
|
||||
// Navigate to Instagram to check if session is valid
|
||||
await page.goto(INSTAGRAM_URL, { waitUntil: "networkidle2" });
|
||||
await randomSleep(2000, 3000);
|
||||
|
||||
// Check if we're logged in by looking for profile link or login page
|
||||
const isLoggedIn = await page.evaluate(() => {
|
||||
// If we see login/signup links, we're not logged in
|
||||
const loginLink = document.querySelector(
|
||||
'a[href="/accounts/login/"]'
|
||||
);
|
||||
return !loginLink;
|
||||
});
|
||||
|
||||
if (isLoggedIn) {
|
||||
console.log("Session is valid! Skipping login.");
|
||||
return { browser, page, sessionReused: true };
|
||||
} else {
|
||||
console.log("Session expired, proceeding with fresh login...");
|
||||
}
|
||||
}
|
||||
} catch (error) {
|
||||
console.log("Failed to load session, proceeding with fresh login...");
|
||||
}
|
||||
}
|
||||
|
||||
// Fresh login flow
|
||||
return await performLogin(page, { username, password }, browser);
|
||||
}
|
||||
|
||||
async function performLogin(page, { username, password }, browser) {
|
||||
// Navigate to login page
|
||||
await handleRateLimitedRequest(
|
||||
page,
|
||||
async () => {
|
||||
await page.goto(`${INSTAGRAM_URL}/accounts/login/`, {
|
||||
waitUntil: "networkidle2",
|
||||
});
|
||||
},
|
||||
"during login page load"
|
||||
);
|
||||
|
||||
console.log("Waiting for login form to appear...");
|
||||
|
||||
// Wait for the actual login form to load
|
||||
await page.waitForSelector('input[name="username"]', {
|
||||
visible: true,
|
||||
timeout: 60000,
|
||||
});
|
||||
|
||||
console.log("Login form loaded!");
|
||||
|
||||
// Simulate human behavior
|
||||
await simulateHumanBehavior(page, { mouseMovements: 3, scrolls: 1 });
|
||||
await randomSleep(500, 1000);
|
||||
|
||||
await page.type('input[name="username"]', username, { delay: 130 });
|
||||
await randomSleep(300, 700);
|
||||
await page.type('input[name="password"]', password, { delay: 120 });
|
||||
|
||||
await simulateHumanBehavior(page, { mouseMovements: 2, scrolls: 0 });
|
||||
await randomSleep(500, 1000);
|
||||
|
||||
await Promise.all([
|
||||
page.click('button[type="submit"]'),
|
||||
page.waitForNavigation({ waitUntil: "networkidle2" }),
|
||||
]);
|
||||
|
||||
await randomSleep(1000, 2000);
|
||||
|
||||
return { browser, page, sessionReused: false };
|
||||
}
|
||||
|
||||
async function extractSession(page) {
|
||||
// Return cookies/session tokens for reuse
|
||||
const cookies = await page.cookies();
|
||||
return { cookies };
|
||||
}
|
||||
|
||||
async function getFollowingsList(page, targetUsername, maxUsers = 100) {
|
||||
const followingData = [];
|
||||
const followingUsernames = [];
|
||||
let requestCount = 0;
|
||||
const requestsPerBatch = 12; // Instagram typically returns ~12 users per request
|
||||
|
||||
// Set up response listener to capture API responses (no need for request interception)
|
||||
page.on("response", async (response) => {
|
||||
const url = response.url();
|
||||
|
||||
// Intercept the following list API endpoint
|
||||
if (url.includes("/friendships/") && url.includes("/following/")) {
|
||||
try {
|
||||
const json = await response.json();
|
||||
|
||||
// Check for rate limit in response
|
||||
if (json.status === "fail" || json.message?.includes("rate limit")) {
|
||||
console.log("WARNING: Rate limit detected in API response");
|
||||
return;
|
||||
}
|
||||
|
||||
if (json.users && Array.isArray(json.users)) {
|
||||
json.users.forEach((user) => {
|
||||
if (followingData.length < maxUsers) {
|
||||
followingData.push({
|
||||
pk: user.pk,
|
||||
pk_id: user.pk_id,
|
||||
username: user.username,
|
||||
full_name: user.full_name,
|
||||
profile_pic_url: user.profile_pic_url,
|
||||
is_verified: user.is_verified,
|
||||
is_private: user.is_private,
|
||||
fbid_v2: user.fbid_v2,
|
||||
latest_reel_media: user.latest_reel_media,
|
||||
account_badges: user.account_badges,
|
||||
});
|
||||
followingUsernames.push(user.username);
|
||||
}
|
||||
});
|
||||
|
||||
requestCount++;
|
||||
console.log(
|
||||
`Captured ${followingData.length} users so far (Request #${requestCount})...`
|
||||
);
|
||||
}
|
||||
} catch (err) {
|
||||
// Not JSON or parsing error, ignore
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
await handleRateLimitedRequest(
|
||||
page,
|
||||
async () => {
|
||||
await page.goto(`${INSTAGRAM_URL}/${targetUsername}/`, {
|
||||
waitUntil: "networkidle2",
|
||||
});
|
||||
},
|
||||
`while loading profile @${targetUsername}`
|
||||
);
|
||||
|
||||
// Simulate browsing the profile before clicking following
|
||||
await simulateHumanBehavior(page, { mouseMovements: 4, scrolls: 2 });
|
||||
await randomSleep(1000, 2000);
|
||||
|
||||
await page.waitForSelector('a[href$="/following/"]', { timeout: 10000 });
|
||||
|
||||
// Hover over the following link before clicking
|
||||
await page.hover('a[href$="/following/"]');
|
||||
await randomSleep(300, 600);
|
||||
|
||||
await page.click('a[href$="/following/"]');
|
||||
|
||||
// Wait for either modal or page navigation
|
||||
await randomSleep(1500, 2500);
|
||||
|
||||
// Detect if modal opened or if we navigated to a new page
|
||||
const layoutType = await page.evaluate(() => {
|
||||
const hasModal = !!document.querySelector('div[role="dialog"]');
|
||||
const urlHasFollowing = window.location.pathname.includes("/following");
|
||||
return { hasModal, urlHasFollowing };
|
||||
});
|
||||
|
||||
if (layoutType.hasModal) {
|
||||
console.log("Following modal opened (desktop layout)");
|
||||
} else if (layoutType.urlHasFollowing) {
|
||||
console.log("Navigated to following page (mobile/small viewport layout)");
|
||||
} else {
|
||||
console.log("Warning: Could not detect following list layout");
|
||||
}
|
||||
|
||||
// Wait for the list content to load
|
||||
await randomSleep(1500, 2500);
|
||||
|
||||
// Verify we can see the list items
|
||||
const hasListItems = await page.evaluate(() => {
|
||||
return (
|
||||
document.querySelectorAll('div.x1qnrgzn, a[href*="following"]').length > 0
|
||||
);
|
||||
});
|
||||
|
||||
if (hasListItems) {
|
||||
console.log("Following list loaded successfully");
|
||||
} else {
|
||||
console.log("Warning: List items not detected, but continuing...");
|
||||
}
|
||||
|
||||
// Scroll to load more users while simulating human behavior
|
||||
const totalRequests = Math.ceil(maxUsers / requestsPerBatch);
|
||||
let scrollAttempts = 0;
|
||||
const maxScrollAttempts = Math.min(totalRequests * 3, 50000); // Cap at 50k attempts
|
||||
let lastDataLength = 0;
|
||||
let noNewDataCount = 0;
|
||||
|
||||
console.log(
|
||||
`Will attempt to scroll up to ${maxScrollAttempts} times to reach ${maxUsers} users...`
|
||||
);
|
||||
|
||||
while (
|
||||
followingData.length < maxUsers &&
|
||||
scrollAttempts < maxScrollAttempts
|
||||
) {
|
||||
// Check if we're still getting new data
|
||||
if (followingData.length === lastDataLength) {
|
||||
noNewDataCount++;
|
||||
// If no new data after 8 consecutive scroll attempts, we've reached the end
|
||||
if (noNewDataCount >= 8) {
|
||||
console.log(
|
||||
`No new data after ${noNewDataCount} attempts. Reached end of list.`
|
||||
);
|
||||
break;
|
||||
}
|
||||
if (noNewDataCount % 3 === 0) {
|
||||
console.log(
|
||||
`Still at ${followingData.length} users after ${noNewDataCount} scrolls...`
|
||||
);
|
||||
}
|
||||
} else {
|
||||
if (noNewDataCount > 0) {
|
||||
console.log(
|
||||
`Got new data! Now at ${followingData.length} users (was stuck for ${noNewDataCount} attempts)`
|
||||
);
|
||||
}
|
||||
noNewDataCount = 0; // Reset counter when we get new data
|
||||
lastDataLength = followingData.length;
|
||||
}
|
||||
|
||||
// Every ~12 users loaded (one request completed), simulate human behavior
|
||||
if (
|
||||
requestCount > 0 &&
|
||||
requestCount % Math.max(1, Math.ceil(totalRequests / 5)) === 0
|
||||
) {
|
||||
await simulateHumanBehavior(page, {
|
||||
mouseMovements: 2,
|
||||
scrolls: 0, // We're manually controlling scroll below
|
||||
});
|
||||
}
|
||||
|
||||
// Occasionally move mouse while scrolling
|
||||
if (scrollAttempts % 5 === 0) {
|
||||
const viewport = await page.viewport();
|
||||
await page.mouse.move(
|
||||
Math.floor(Math.random() * viewport.width),
|
||||
Math.floor(Math.random() * viewport.height),
|
||||
{ steps: 10 }
|
||||
);
|
||||
}
|
||||
|
||||
// Scroll the dialog's scrollable container - comprehensive approach
|
||||
const scrollResult = await page.evaluate(() => {
|
||||
// Find the scrollable container inside the dialog
|
||||
const dialog = document.querySelector('div[role="dialog"]');
|
||||
if (!dialog) {
|
||||
return { success: false, error: "No dialog found", scrolled: false };
|
||||
}
|
||||
|
||||
// Look for the scrollable div - it has overflow: hidden auto
|
||||
const scrollableElements = dialog.querySelectorAll("div");
|
||||
let scrollContainer = null;
|
||||
|
||||
for (const elem of scrollableElements) {
|
||||
const style = window.getComputedStyle(elem);
|
||||
const overflow = style.overflow || style.overflowY;
|
||||
|
||||
// Check if element is scrollable
|
||||
if (
|
||||
(overflow === "auto" || overflow === "scroll") &&
|
||||
elem.scrollHeight > elem.clientHeight
|
||||
) {
|
||||
scrollContainer = elem;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if (!scrollContainer) {
|
||||
// Fallback: try specific class from your HTML
|
||||
scrollContainer =
|
||||
dialog.querySelector("div.x6nl9eh") ||
|
||||
dialog.querySelector('div[style*="overflow"]');
|
||||
}
|
||||
|
||||
if (!scrollContainer) {
|
||||
return {
|
||||
success: false,
|
||||
error: "No scrollable container found",
|
||||
scrolled: false,
|
||||
};
|
||||
}
|
||||
|
||||
const oldScrollTop = scrollContainer.scrollTop;
|
||||
const scrollHeight = scrollContainer.scrollHeight;
|
||||
const clientHeight = scrollContainer.clientHeight;
|
||||
|
||||
// Scroll down
|
||||
scrollContainer.scrollTop += 400 + Math.floor(Math.random() * 200);
|
||||
|
||||
const newScrollTop = scrollContainer.scrollTop;
|
||||
const actuallyScrolled = newScrollTop > oldScrollTop;
|
||||
const atBottom = scrollHeight - newScrollTop - clientHeight < 50;
|
||||
|
||||
return {
|
||||
success: true,
|
||||
scrolled: actuallyScrolled,
|
||||
atBottom: atBottom,
|
||||
scrollTop: newScrollTop,
|
||||
scrollHeight: scrollHeight,
|
||||
};
|
||||
});
|
||||
|
||||
if (!scrollResult.success) {
|
||||
console.log(`Scroll error: ${scrollResult.error}`);
|
||||
// Try alternative: scroll the page itself
|
||||
await page.evaluate(() => window.scrollBy(0, 300));
|
||||
} else if (!scrollResult.scrolled) {
|
||||
console.log("Reached scroll bottom - cannot scroll further");
|
||||
}
|
||||
|
||||
// Check if we've reached the bottom and loading indicator is visible
|
||||
const loadingStatus = await page.evaluate(() => {
|
||||
const loader = document.querySelector('svg[aria-label="Loading..."]');
|
||||
|
||||
if (!loader) {
|
||||
return { exists: false, visible: false, reachedBottom: true };
|
||||
}
|
||||
|
||||
// Check if loader is in viewport (visible)
|
||||
const rect = loader.getBoundingClientRect();
|
||||
const isVisible =
|
||||
rect.top >= 0 &&
|
||||
rect.left >= 0 &&
|
||||
rect.bottom <= window.innerHeight &&
|
||||
rect.right <= window.innerWidth;
|
||||
|
||||
return { exists: true, visible: isVisible, reachedBottom: isVisible };
|
||||
});
|
||||
|
||||
if (!loadingStatus.exists) {
|
||||
// No loading indicator at all - might have reached the actual end
|
||||
console.log("No loading indicator found - may have reached end of list");
|
||||
} else if (loadingStatus.visible) {
|
||||
// Loader is visible, meaning we've scrolled to it
|
||||
console.log("Loading indicator visible, waiting for more data...");
|
||||
await randomSleep(2500, 3500); // Wait longer for Instagram to load more
|
||||
} else {
|
||||
// Loader exists but not visible yet, keep scrolling
|
||||
await randomSleep(1500, 2500);
|
||||
}
|
||||
|
||||
scrollAttempts++;
|
||||
|
||||
// Progress update every 50 scrolls
|
||||
if (scrollAttempts % 50 === 0) {
|
||||
console.log(
|
||||
`Progress: ${followingData.length} users captured after ${scrollAttempts} scroll attempts...`
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
console.log(`Total users captured: ${followingData.length}`);
|
||||
|
||||
return {
|
||||
usernames: followingUsernames.slice(0, maxUsers),
|
||||
fullData: followingData.slice(0, maxUsers),
|
||||
};
|
||||
}
|
||||
|
||||
async function scrapeProfile(page, username) {
|
||||
console.log(`Scraping profile: @${username}`);
|
||||
|
||||
let profileData = { username };
|
||||
let dataCapture = false;
|
||||
|
||||
// Set up response listener to intercept API calls
|
||||
const responseHandler = async (response) => {
|
||||
const url = response.url();
|
||||
|
||||
try {
|
||||
// Check for GraphQL or REST API endpoints
|
||||
if (
|
||||
url.includes("/api/v1/users/web_profile_info/") ||
|
||||
url.includes("/graphql/query")
|
||||
) {
|
||||
const contentType = response.headers()["content-type"] || "";
|
||||
if (!contentType.includes("json")) return;
|
||||
|
||||
const json = await response.json();
|
||||
|
||||
// Handle web_profile_info endpoint (REST API)
|
||||
if (url.includes("web_profile_info") && json.data?.user) {
|
||||
if (dataCapture) return; // Already captured, skip duplicate
|
||||
|
||||
const user = json.data.user;
|
||||
profileData = {
|
||||
username: user.username,
|
||||
full_name: user.full_name,
|
||||
bio: user.biography || "",
|
||||
followerCount: user.edge_followed_by?.count || 0,
|
||||
followingCount: user.edge_follow?.count || 0,
|
||||
profile_pic_url:
|
||||
user.hd_profile_pic_url_info?.url || user.profile_pic_url,
|
||||
is_verified: user.is_verified,
|
||||
is_private: user.is_private,
|
||||
is_business: user.is_business_account,
|
||||
category: user.category_name,
|
||||
external_url: user.external_url,
|
||||
email: null,
|
||||
phone: null,
|
||||
};
|
||||
|
||||
// Extract email/phone from bio
|
||||
if (profileData.bio) {
|
||||
const emailMatch = profileData.bio.match(
|
||||
/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/
|
||||
);
|
||||
profileData.email = emailMatch ? emailMatch[0] : null;
|
||||
|
||||
const phoneMatch = profileData.bio.match(
|
||||
/(\+\d{1,3}[- ]?)?\d{10,14}/
|
||||
);
|
||||
profileData.phone = phoneMatch ? phoneMatch[0] : null;
|
||||
}
|
||||
|
||||
dataCapture = true;
|
||||
}
|
||||
// Handle GraphQL endpoint
|
||||
else if (url.includes("graphql") && json.data?.user) {
|
||||
if (dataCapture) return; // Already captured, skip duplicate
|
||||
|
||||
const user = json.data.user;
|
||||
profileData = {
|
||||
username: user.username,
|
||||
full_name: user.full_name,
|
||||
bio: user.biography || "",
|
||||
followerCount: user.follower_count || 0,
|
||||
followingCount: user.following_count || 0,
|
||||
profile_pic_url:
|
||||
user.hd_profile_pic_url_info?.url || user.profile_pic_url,
|
||||
is_verified: user.is_verified,
|
||||
is_private: user.is_private,
|
||||
is_business: user.is_business_account || user.is_business,
|
||||
category: user.category_name || user.category,
|
||||
external_url: user.external_url,
|
||||
email: null,
|
||||
phone: null,
|
||||
};
|
||||
|
||||
// Extract email/phone from bio
|
||||
if (profileData.bio) {
|
||||
const emailMatch = profileData.bio.match(
|
||||
/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/
|
||||
);
|
||||
profileData.email = emailMatch ? emailMatch[0] : null;
|
||||
|
||||
const phoneMatch = profileData.bio.match(
|
||||
/(\+\d{1,3}[- ]?)?\d{10,14}/
|
||||
);
|
||||
profileData.phone = phoneMatch ? phoneMatch[0] : null;
|
||||
}
|
||||
|
||||
dataCapture = true;
|
||||
}
|
||||
}
|
||||
} catch (e) {
|
||||
// Ignore errors from parsing non-JSON responses
|
||||
}
|
||||
};
|
||||
|
||||
page.on("response", responseHandler);
|
||||
|
||||
// Navigate to profile page
|
||||
await handleRateLimitedRequest(
|
||||
page,
|
||||
async () => {
|
||||
await page.goto(`${INSTAGRAM_URL}/${username}/`, {
|
||||
waitUntil: "domcontentloaded",
|
||||
});
|
||||
},
|
||||
`while loading profile @${username}`
|
||||
);
|
||||
|
||||
// Wait for API calls to complete
|
||||
await randomSleep(2000, 3000);
|
||||
|
||||
// Remove listener
|
||||
page.off("response", responseHandler);
|
||||
|
||||
// If API capture worked, return the data
|
||||
if (dataCapture) {
|
||||
return profileData;
|
||||
}
|
||||
|
||||
// Otherwise, fall back to DOM scraping
|
||||
console.log(`⚠️ API capture failed for @${username}, using DOM fallback...`);
|
||||
return await scrapeProfileFallback(page, username);
|
||||
}
|
||||
|
||||
// Fallback function using DOM scraping
|
||||
async function scrapeProfileFallback(page, username) {
|
||||
console.log(`Using DOM scraping for @${username}...`);
|
||||
|
||||
const domData = await page.evaluate(() => {
|
||||
// Try multiple selectors for bio
|
||||
let bio = "";
|
||||
const bioSelectors = [
|
||||
"span._ap3a._aaco._aacu._aacx._aad7._aade", // Updated bio class (2025)
|
||||
"span._ap3a._aaco._aacu._aacx._aad6._aade", // Previous bio class
|
||||
"div._aacl._aaco._aacu._aacx._aad7._aade", // Alternative bio with _aad7
|
||||
"div._aacl._aaco._aacu._aacx._aad6._aade", // Alternative bio with _aad6
|
||||
"h1 + div span", // Bio after username
|
||||
"header section div span", // Generic header bio
|
||||
'div.x7a106z span[dir="auto"]', // Bio container with dir attribute
|
||||
];
|
||||
|
||||
for (const selector of bioSelectors) {
|
||||
const elem = document.querySelector(selector);
|
||||
if (elem && elem.innerText && elem.innerText.length > 3) {
|
||||
bio = elem.innerText;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// Get follower/following counts using href-based selectors (stable)
|
||||
let followerCount = 0;
|
||||
let followingCount = 0;
|
||||
|
||||
// Method 1: Find by href (most reliable)
|
||||
const followersLink = document.querySelector('a[href*="/followers/"]');
|
||||
const followingLink = document.querySelector('a[href*="/following/"]');
|
||||
|
||||
if (followersLink) {
|
||||
const text = followersLink.innerText || followersLink.textContent || "";
|
||||
const match = text.match(/[\d,\.]+/);
|
||||
if (match) {
|
||||
followerCount = match[0].replace(/,/g, "").replace(/\./g, "");
|
||||
}
|
||||
}
|
||||
|
||||
if (followingLink) {
|
||||
const text = followingLink.innerText || followingLink.textContent || "";
|
||||
const match = text.match(/[\d,\.]+/);
|
||||
if (match) {
|
||||
followingCount = match[0].replace(/,/g, "").replace(/\./g, "");
|
||||
}
|
||||
}
|
||||
|
||||
// Alternative: Look in meta tags if href method fails
|
||||
if (!followerCount) {
|
||||
const metaContent =
|
||||
document.querySelector('meta[property="og:description"]')?.content ||
|
||||
"";
|
||||
const followerMatch = metaContent.match(/([\d,\.KMB]+)\s+Followers/i);
|
||||
const followingMatch = metaContent.match(/([\d,\.KMB]+)\s+Following/i);
|
||||
|
||||
if (followerMatch) followerCount = followerMatch[1].replace(/,/g, "");
|
||||
if (followingMatch) followingCount = followingMatch[1].replace(/,/g, "");
|
||||
}
|
||||
|
||||
// Extract email/phone from bio
|
||||
let emailMatch = bio.match(
|
||||
/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/
|
||||
);
|
||||
let email = emailMatch ? emailMatch[0] : null;
|
||||
let phoneMatch = bio.match(/(\+\d{1,3}[- ]?)?\d{10,14}/);
|
||||
let phone = phoneMatch ? phoneMatch[0] : null;
|
||||
|
||||
return {
|
||||
bio,
|
||||
followerCount: parseInt(followerCount) || 0,
|
||||
followingCount: parseInt(followingCount) || 0,
|
||||
email,
|
||||
phone,
|
||||
};
|
||||
});
|
||||
|
||||
return {
|
||||
username,
|
||||
...domData,
|
||||
};
|
||||
}
|
||||
|
||||
async function cronJobs(fn, intervalSec, stopAfter = 0) {
|
||||
let runCount = 0;
|
||||
let stop = false;
|
||||
const timer = setInterval(async () => {
|
||||
if (stop || (stopAfter && runCount >= stopAfter)) {
|
||||
clearInterval(timer);
|
||||
return;
|
||||
}
|
||||
await fn();
|
||||
runCount++;
|
||||
}, intervalSec * 1000);
|
||||
return () => {
|
||||
stop = true;
|
||||
};
|
||||
}
|
||||
|
||||
async function scrapeWorkflow(
|
||||
creds,
|
||||
targetUsername,
|
||||
proxy = null,
|
||||
maxFollowingToScrape = 10
|
||||
) {
|
||||
const { browser, page } = await login(creds, proxy);
|
||||
try {
|
||||
// Extract current session details for persistence
|
||||
const session = await extractSession(page);
|
||||
|
||||
// Grab followings with full data
|
||||
const followingsData = await getFollowingsList(
|
||||
page,
|
||||
targetUsername,
|
||||
maxFollowingToScrape
|
||||
);
|
||||
|
||||
console.log(
|
||||
`Processing ${followingsData.usernames.length} following accounts...`
|
||||
);
|
||||
|
||||
for (let i = 0; i < followingsData.usernames.length; i++) {
|
||||
// Add occasional longer breaks to simulate human behavior
|
||||
if (i > 0 && i % 10 === 0) {
|
||||
console.log(`Taking a human-like break after ${i} profiles...`);
|
||||
await simulateHumanBehavior(page, { mouseMovements: 5, scrolls: 3 });
|
||||
await randomSleep(5000, 10000); // Longer break every 10 profiles
|
||||
}
|
||||
|
||||
const profileInfo = await scrapeProfile(
|
||||
page,
|
||||
followingsData.usernames[i]
|
||||
);
|
||||
console.log(JSON.stringify(profileInfo));
|
||||
// Implement rate limiting + anti-bot sleep
|
||||
await randomSleep(2500, 6000);
|
||||
}
|
||||
|
||||
// Optionally return the full data for further processing
|
||||
return {
|
||||
session,
|
||||
followingsFullData: followingsData.fullData,
|
||||
scrapedProfiles: followingsData.usernames.length,
|
||||
};
|
||||
} catch (err) {
|
||||
console.error("Scrape error:", err);
|
||||
} finally {
|
||||
await browser.close();
|
||||
}
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
loginWithSession,
|
||||
extractSession,
|
||||
scrapeWorkflow,
|
||||
getFollowingsList,
|
||||
scrapeProfile,
|
||||
cronJobs,
|
||||
};
|
||||
356
server.js
Normal file
356
server.js
Normal file
@@ -0,0 +1,356 @@
|
||||
const {
|
||||
loginWithSession,
|
||||
extractSession,
|
||||
scrapeWorkflow,
|
||||
getFollowingsList,
|
||||
scrapeProfile,
|
||||
cronJobs,
|
||||
} = require("./scraper.js");
|
||||
const { randomSleep, simulateHumanBehavior } = require("./utils.js");
|
||||
const fs = require("fs");
|
||||
require("dotenv").config();
|
||||
|
||||
// Full workflow: Login, browse, scrape followings and profiles
|
||||
async function fullScrapingWorkflow() {
|
||||
console.log("Starting Instagram Full Scraping Workflow...\n");
|
||||
|
||||
// Start total timer
|
||||
const totalStartTime = Date.now();
|
||||
|
||||
const credentials = {
|
||||
username: process.env.INSTAGRAM_USERNAME || "your_username",
|
||||
password: process.env.INSTAGRAM_PASSWORD || "your_password",
|
||||
};
|
||||
|
||||
const targetUsername = process.env.TARGET_USERNAME || "instagram";
|
||||
const maxFollowing = parseInt(process.env.MAX_FOLLOWING || "20", 10);
|
||||
const maxProfilesToScrape = parseInt(process.env.MAX_PROFILES || "5", 10);
|
||||
const proxy = process.env.PROXY || null;
|
||||
|
||||
let browser, page;
|
||||
|
||||
try {
|
||||
console.log("Configuration:");
|
||||
console.log(` Target: @${targetUsername}`);
|
||||
console.log(` Max following to fetch: ${maxFollowing}`);
|
||||
console.log(` Max profiles to scrape: ${maxProfilesToScrape}`);
|
||||
console.log(` Proxy: ${proxy || "None"}\n`);
|
||||
|
||||
// Step 1: Login (with session reuse)
|
||||
console.log("Step 1: Logging in to Instagram...");
|
||||
const loginResult = await loginWithSession(credentials, proxy, true);
|
||||
browser = loginResult.browser;
|
||||
page = loginResult.page;
|
||||
|
||||
if (loginResult.sessionReused) {
|
||||
console.log("Reused existing session!\n");
|
||||
} else {
|
||||
console.log("Fresh login successful!\n");
|
||||
}
|
||||
|
||||
// Step 2: Extract and save session
|
||||
console.log("Step 2: Extracting session cookies...");
|
||||
const session = await extractSession(page);
|
||||
fs.writeFileSync("session_cookies.json", JSON.stringify(session, null, 2));
|
||||
console.log(`Session saved (${session.cookies.length} cookies)\n`);
|
||||
|
||||
// Step 3: Simulate browsing before scraping
|
||||
console.log("Step 3: Simulating human browsing behavior...");
|
||||
await simulateHumanBehavior(page, { mouseMovements: 5, scrolls: 3 });
|
||||
await randomSleep(2000, 4000);
|
||||
console.log("Browsing simulation complete\n");
|
||||
|
||||
// Step 4: Get followings list
|
||||
console.log(`👥 Step 4: Fetching following list for @${targetUsername}...`);
|
||||
const followingsStartTime = Date.now();
|
||||
|
||||
const followingsData = await getFollowingsList(
|
||||
page,
|
||||
targetUsername,
|
||||
maxFollowing
|
||||
);
|
||||
|
||||
const followingsEndTime = Date.now();
|
||||
const followingsTime = (
|
||||
(followingsEndTime - followingsStartTime) /
|
||||
1000
|
||||
).toFixed(2);
|
||||
|
||||
console.log(
|
||||
`✓ Captured ${followingsData.fullData.length} followings in ${followingsTime}s\n`
|
||||
);
|
||||
|
||||
// Save followings data
|
||||
const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
|
||||
const followingsFile = `followings_${targetUsername}_${timestamp}.json`;
|
||||
fs.writeFileSync(
|
||||
followingsFile,
|
||||
JSON.stringify(
|
||||
{
|
||||
targetUsername,
|
||||
scrapedAt: new Date().toISOString(),
|
||||
totalFollowings: followingsData.fullData.length,
|
||||
followings: followingsData.fullData,
|
||||
},
|
||||
null,
|
||||
2
|
||||
)
|
||||
);
|
||||
console.log(`Followings data saved to: ${followingsFile}\n`);
|
||||
|
||||
// Step 5: Scrape individual profiles
|
||||
console.log(
|
||||
`📊 Step 5: Scraping ${maxProfilesToScrape} individual profiles...`
|
||||
);
|
||||
const profilesStartTime = Date.now();
|
||||
const profilesData = [];
|
||||
const usernamesToScrape = followingsData.usernames.slice(
|
||||
0,
|
||||
maxProfilesToScrape
|
||||
);
|
||||
|
||||
for (let i = 0; i < usernamesToScrape.length; i++) {
|
||||
const username = usernamesToScrape[i];
|
||||
console.log(
|
||||
` [${i + 1}/${usernamesToScrape.length}] Scraping @${username}...`
|
||||
);
|
||||
|
||||
try {
|
||||
const profileData = await scrapeProfile(page, username);
|
||||
profilesData.push(profileData);
|
||||
console.log(` @${username}: ${profileData.followerCount} followers`);
|
||||
|
||||
// Human-like delay between profiles
|
||||
await randomSleep(3000, 6000);
|
||||
|
||||
// Take a longer break every 3 profiles
|
||||
if ((i + 1) % 3 === 0 && i < usernamesToScrape.length - 1) {
|
||||
console.log(" ⏸ Taking a human-like break...");
|
||||
await simulateHumanBehavior(page, { mouseMovements: 4, scrolls: 2 });
|
||||
await randomSleep(8000, 12000);
|
||||
}
|
||||
} catch (error) {
|
||||
console.log(` Failed to scrape @${username}: ${error.message}`);
|
||||
}
|
||||
}
|
||||
|
||||
const profilesEndTime = Date.now();
|
||||
const profilesTime = ((profilesEndTime - profilesStartTime) / 1000).toFixed(
|
||||
2
|
||||
);
|
||||
|
||||
console.log(
|
||||
`\n✓ Scraped ${profilesData.length} profiles in ${profilesTime}s\n`
|
||||
);
|
||||
|
||||
// Step 6: Save profiles data
|
||||
console.log("Step 6: Saving profile data...");
|
||||
const profilesFile = `profiles_${targetUsername}_${timestamp}.json`;
|
||||
fs.writeFileSync(
|
||||
profilesFile,
|
||||
JSON.stringify(
|
||||
{
|
||||
targetUsername,
|
||||
scrapedAt: new Date().toISOString(),
|
||||
totalProfiles: profilesData.length,
|
||||
profiles: profilesData,
|
||||
},
|
||||
null,
|
||||
2
|
||||
)
|
||||
);
|
||||
console.log(`Profiles data saved to: ${profilesFile}\n`);
|
||||
|
||||
// Calculate total time
|
||||
const totalEndTime = Date.now();
|
||||
const totalTime = ((totalEndTime - totalStartTime) / 1000).toFixed(2);
|
||||
const totalMinutes = Math.floor(totalTime / 60);
|
||||
const totalSeconds = (totalTime % 60).toFixed(2);
|
||||
|
||||
// Step 7: Summary
|
||||
console.log("=".repeat(60));
|
||||
console.log("📊 SCRAPING SUMMARY");
|
||||
console.log("=".repeat(60));
|
||||
console.log(`✓ Logged in successfully`);
|
||||
console.log(`✓ Session cookies saved`);
|
||||
console.log(
|
||||
`✓ ${followingsData.fullData.length} followings captured in ${followingsTime}s`
|
||||
);
|
||||
console.log(
|
||||
`✓ ${profilesData.length} profiles scraped in ${profilesTime}s`
|
||||
);
|
||||
console.log(`\n📁 Files created:`);
|
||||
console.log(` • ${followingsFile}`);
|
||||
console.log(` • ${profilesFile}`);
|
||||
console.log(` • session_cookies.json`);
|
||||
console.log(
|
||||
`\n⏱️ Total execution time: ${totalMinutes}m ${totalSeconds}s`
|
||||
);
|
||||
console.log("=".repeat(60) + "\n");
|
||||
|
||||
return {
|
||||
success: true,
|
||||
followingsCount: followingsData.fullData.length,
|
||||
profilesCount: profilesData.length,
|
||||
followingsData: followingsData.fullData,
|
||||
profilesData,
|
||||
session,
|
||||
timings: {
|
||||
followingsTime: parseFloat(followingsTime),
|
||||
profilesTime: parseFloat(profilesTime),
|
||||
totalTime: parseFloat(totalTime),
|
||||
},
|
||||
};
|
||||
} catch (error) {
|
||||
console.error("\nScraping workflow failed:");
|
||||
console.error(error.message);
|
||||
console.error(error.stack);
|
||||
throw error;
|
||||
} finally {
|
||||
if (browser) {
|
||||
console.log("Closing browser...");
|
||||
await browser.close();
|
||||
console.log("Browser closed\n");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Alternative: Use the built-in scrapeWorkflow function
|
||||
async function simpleWorkflow() {
|
||||
console.log("Starting Simple Scraping Workflow (using scrapeWorkflow)...\n");
|
||||
|
||||
const credentials = {
|
||||
username: process.env.INSTAGRAM_USERNAME || "your_username",
|
||||
password: process.env.INSTAGRAM_PASSWORD || "your_password",
|
||||
};
|
||||
|
||||
const targetUsername = process.env.TARGET_USERNAME || "instagram";
|
||||
const maxFollowing = parseInt(process.env.MAX_FOLLOWING || "20", 10);
|
||||
const proxy = process.env.PROXY || null;
|
||||
|
||||
try {
|
||||
console.log(`Target: @${targetUsername}`);
|
||||
console.log(`Max following to scrape: ${maxFollowing}`);
|
||||
console.log(`Using proxy: ${proxy || "None"}\n`);
|
||||
|
||||
const result = await scrapeWorkflow(
|
||||
credentials,
|
||||
targetUsername,
|
||||
proxy,
|
||||
maxFollowing
|
||||
);
|
||||
|
||||
console.log("\nScraping completed successfully!");
|
||||
console.log(`Total profiles scraped: ${result.scrapedProfiles}`);
|
||||
console.log(
|
||||
`Full following data captured: ${result.followingsFullData.length} users`
|
||||
);
|
||||
|
||||
// Save the data
|
||||
if (result.followingsFullData.length > 0) {
|
||||
const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
|
||||
const filename = `scraped_data_${targetUsername}_${timestamp}.json`;
|
||||
|
||||
fs.writeFileSync(
|
||||
filename,
|
||||
JSON.stringify(
|
||||
{
|
||||
targetUsername,
|
||||
scrapedAt: new Date().toISOString(),
|
||||
totalUsers: result.followingsFullData.length,
|
||||
data: result.followingsFullData,
|
||||
},
|
||||
null,
|
||||
2
|
||||
)
|
||||
);
|
||||
|
||||
console.log(`Data saved to: ${filename}`);
|
||||
}
|
||||
|
||||
return result;
|
||||
} catch (error) {
|
||||
console.error("\nScraping failed:");
|
||||
console.error(error.message);
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
// Scheduled scraping with cron
|
||||
async function scheduledScraping() {
|
||||
console.log("Starting Scheduled Scraping...\n");
|
||||
|
||||
const credentials = {
|
||||
username: process.env.INSTAGRAM_USERNAME || "your_username",
|
||||
password: process.env.INSTAGRAM_PASSWORD || "your_password",
|
||||
};
|
||||
|
||||
const targetUsername = process.env.TARGET_USERNAME || "instagram";
|
||||
const intervalMinutes = parseInt(process.env.SCRAPE_INTERVAL || "60", 10);
|
||||
const maxRuns = parseInt(process.env.MAX_RUNS || "5", 10);
|
||||
|
||||
console.log(
|
||||
`Will scrape @${targetUsername} every ${intervalMinutes} minutes`
|
||||
);
|
||||
console.log(`Maximum runs: ${maxRuns}\n`);
|
||||
|
||||
let runCount = 0;
|
||||
|
||||
const stopCron = await cronJobs(
|
||||
async () => {
|
||||
runCount++;
|
||||
console.log(`\n${"=".repeat(60)}`);
|
||||
console.log(
|
||||
`📅 Scheduled Run #${runCount} - ${new Date().toLocaleString()}`
|
||||
);
|
||||
console.log("=".repeat(60));
|
||||
|
||||
try {
|
||||
await simpleWorkflow();
|
||||
} catch (error) {
|
||||
console.error(`Run #${runCount} failed:`, error.message);
|
||||
}
|
||||
|
||||
if (runCount >= maxRuns) {
|
||||
console.log(`\nCompleted ${maxRuns} scheduled runs. Stopping...`);
|
||||
process.exit(0);
|
||||
}
|
||||
},
|
||||
intervalMinutes * 60, // Convert to seconds
|
||||
maxRuns
|
||||
);
|
||||
|
||||
console.log("Cron job started. Press Ctrl+C to stop.\n");
|
||||
}
|
||||
|
||||
// Main entry point
|
||||
if (require.main === module) {
|
||||
const mode = process.env.MODE || "full"; // full, simple, or scheduled
|
||||
|
||||
console.log(`Mode: ${mode}\n`);
|
||||
|
||||
let workflow;
|
||||
if (mode === "simple") {
|
||||
workflow = simpleWorkflow();
|
||||
} else if (mode === "scheduled") {
|
||||
workflow = scheduledScraping();
|
||||
} else {
|
||||
workflow = fullScrapingWorkflow();
|
||||
}
|
||||
|
||||
workflow
|
||||
.then(() => {
|
||||
console.log("All done!");
|
||||
process.exit(0);
|
||||
})
|
||||
.catch((err) => {
|
||||
console.error("\nFatal error:", err);
|
||||
process.exit(1);
|
||||
});
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
fullScrapingWorkflow,
|
||||
simpleWorkflow,
|
||||
scheduledScraping,
|
||||
};
|
||||
146
utils.js
Normal file
146
utils.js
Normal file
@@ -0,0 +1,146 @@
|
||||
function randomSleep(minMs = 2000, maxMs = 5000) {
|
||||
const delay = Math.floor(Math.random() * (maxMs - minMs + 1)) + minMs;
|
||||
return new Promise((res) => setTimeout(res, delay));
|
||||
}
|
||||
|
||||
async function humanLikeMouseMovement(page, steps = 10) {
|
||||
// Simulate human-like mouse movements across the page
|
||||
const viewport = await page.viewport();
|
||||
const width = viewport.width;
|
||||
const height = viewport.height;
|
||||
|
||||
for (let i = 0; i < steps; i++) {
|
||||
const x = Math.floor(Math.random() * width);
|
||||
const y = Math.floor(Math.random() * height);
|
||||
|
||||
await page.mouse.move(x, y, { steps: Math.floor(Math.random() * 10) + 5 });
|
||||
await randomSleep(100, 500);
|
||||
}
|
||||
}
|
||||
|
||||
async function randomScroll(page, scrollCount = 3) {
|
||||
// Perform random scrolling to simulate human behavior
|
||||
for (let i = 0; i < scrollCount; i++) {
|
||||
const scrollAmount = Math.floor(Math.random() * 300) + 100;
|
||||
await page.evaluate((amount) => {
|
||||
window.scrollBy(0, amount);
|
||||
}, scrollAmount);
|
||||
await randomSleep(800, 1500);
|
||||
}
|
||||
}
|
||||
|
||||
async function simulateHumanBehavior(page, options = {}) {
|
||||
// Combined function to simulate various human-like behaviors
|
||||
const { mouseMovements = 5, scrolls = 2, randomClicks = false } = options;
|
||||
|
||||
// Random mouse movements
|
||||
if (mouseMovements > 0) {
|
||||
await humanLikeMouseMovement(page, mouseMovements);
|
||||
}
|
||||
|
||||
// Random scrolling
|
||||
if (scrolls > 0) {
|
||||
await randomScroll(page, scrolls);
|
||||
}
|
||||
|
||||
// Optional: Random clicks on non-interactive elements
|
||||
if (randomClicks) {
|
||||
try {
|
||||
await page.evaluate(() => {
|
||||
const elements = document.querySelectorAll("div, span, p");
|
||||
if (elements.length > 0) {
|
||||
const randomElement =
|
||||
elements[Math.floor(Math.random() * elements.length)];
|
||||
const rect = randomElement.getBoundingClientRect();
|
||||
// Just move to it, don't actually click to avoid triggering actions
|
||||
}
|
||||
});
|
||||
} catch (err) {
|
||||
// Ignore errors from random element selection
|
||||
}
|
||||
}
|
||||
|
||||
await randomSleep(500, 1000);
|
||||
}
|
||||
|
||||
async function withRetry(fn, options = {}) {
|
||||
const {
|
||||
maxRetries = 3,
|
||||
initialDelay = 2000,
|
||||
maxDelay = 30000,
|
||||
shouldRetry = (error) => true,
|
||||
} = options;
|
||||
|
||||
for (let attempt = 0; attempt < maxRetries; attempt++) {
|
||||
try {
|
||||
return await fn();
|
||||
} catch (error) {
|
||||
const isLastAttempt = attempt === maxRetries - 1;
|
||||
|
||||
// Check if we should retry this error
|
||||
if (!shouldRetry(error) || isLastAttempt) {
|
||||
throw error;
|
||||
}
|
||||
|
||||
// Calculate exponential backoff delay: 2s, 4s, 8s, 16s, 30s (capped)
|
||||
const exponentialDelay = Math.min(
|
||||
initialDelay * Math.pow(2, attempt),
|
||||
maxDelay
|
||||
);
|
||||
|
||||
// Add jitter (randomize ±20%) to avoid thundering herd
|
||||
const jitter = exponentialDelay * (0.8 + Math.random() * 0.4);
|
||||
const delay = Math.floor(jitter);
|
||||
|
||||
console.log(
|
||||
`Retry attempt ${attempt + 1}/${maxRetries} after ${delay}ms delay...`
|
||||
);
|
||||
console.log(`Error: ${error.message || error}`);
|
||||
|
||||
await randomSleep(delay, delay);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async function handleRateLimitedRequest(page, requestFn, context = "") {
|
||||
return withRetry(requestFn, {
|
||||
maxRetries: 5,
|
||||
initialDelay: 2000,
|
||||
maxDelay: 60000,
|
||||
shouldRetry: (error) => {
|
||||
// Retry on rate limit (429) or temporary errors
|
||||
if (error.status === 429 || error.statusCode === 429) {
|
||||
console.log(`Rate limited (429) ${context}. Backing off...`);
|
||||
return true;
|
||||
}
|
||||
|
||||
// Retry on 5xx server errors
|
||||
if (error.status >= 500 || error.statusCode >= 500) {
|
||||
console.log(
|
||||
`Server error (${
|
||||
error.status || error.statusCode
|
||||
}) ${context}. Retrying...`
|
||||
);
|
||||
return true;
|
||||
}
|
||||
|
||||
// Retry on network errors
|
||||
if (error.code === "ECONNRESET" || error.code === "ETIMEDOUT") {
|
||||
console.log(`Network error (${error.code}) ${context}. Retrying...`);
|
||||
return true;
|
||||
}
|
||||
|
||||
// Don't retry on client errors (4xx except 429)
|
||||
return false;
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
randomSleep,
|
||||
humanLikeMouseMovement,
|
||||
randomScroll,
|
||||
simulateHumanBehavior,
|
||||
withRetry,
|
||||
handleRateLimitedRequest,
|
||||
};
|
||||
Reference in New Issue
Block a user