Let's be honest, the line between our "online" and "offline" lives has pretty much disappeared. In the last few minutes, you’ve probably glanced at your phone while walking down the street, checked the reviews for a cafe you were about to enter, or sent a friend a...
MORE NEWS
DIGITAL MARKETING
SEO
SEM
The audience is the author how user-generated content redefined marketing’s golden rule
In the deafening, chaotic bazaar of the digital world, where every brand shouts to be heard and attention is the most fleeting of commodities, an old truth has been given a radical, transformative new meaning. The phrase "Content is King," famously penned by Bill...
Semrush Social Media Poster vs. Hootsuite – Which one actually works?
Both Semrush Social Media Poster and Hootsuite promise to simplify social media management, but they are built for different types of users and needs. Semrush Social Media Poster is tightly integrated with SEO tools and appeals mainly to marketers looking to align...
Invisible watermarking in AI content with Google SynthID
Invisible watermarking is a key innovation in authenticating and protecting content created by generative AI. Google SynthID is a state-of-the-art watermarking system designed to embed imperceptible digital signatures directly into AI-generated images, videos, text,...
How to prepare your company for Google, YouTube, TikTok, Voice Assistants, and ChatGPT
The traditional model of digital visibility, where companies focused 90% of their efforts on Google SEO, is no longer sufficient. Today’s customers use a variety of search tools: they watch tutorials on YouTube, verify opinions on TikTok, ask Siri or Alexa for nearby...
Google Search API – A technical deep dive into ranking logic
📑 Key Takeaways from the API Leak If you don't have time to analyze 2,500 pages of documentation, here are the 3 most important facts that reshape our understanding of SEO: 1. Clicks are a ranking factor (End of Debate): The leak confirmed the existence of the...
Information gain in the age of AI
The digital information ecosystem stands at a precipice of transformation that is arguably more significant than the introduction of the hyperlink. For the past twenty-five years, the fundamental contract of the web was navigational. Users queried a search engine, and...
Google Discover optimization – technical guide
We have moved from a query-based retrieval model to a predictive push architecture. In this new environment, Google Discover is no longer a secondary traffic source. It is a primary engine for organic growth. The rise of zero-click searches, which now account for...
Parasite SEO strategy for weak domains
The barrier to entry for new digital entities has reached unprecedented heights in this year. For professionals entering competitive verticals, such as SaaS or finance, the mathematical reality of ranking algorithms presents a formidable challenge....
The resurrection protocol of toxic expired domains
The digital economy is littered with the remnants of abandoned web properties, often referred to in the cybersecurity sector as zombie domains. These are domain names that have expired, been dropped by their original registrants, and subsequently re-registered or...
Beyond the walled garden silo – true ROAS across platforms
Google says your campaign generated 150 sales. Amazon claims 200. Meta swears it drove 180. Add them up and you get 530 conversions. Check your actual revenue and you'll find you sold 250 units total. This is the walled garden nightmare every e-commerce marketer...
Data-driven CRO for PPC landing pages
In paid search campaigns, exceptional Quality Scores and high conversion rates don’t happen by accident—they’re the result of rigorous, data-driven optimization that blends user behavior insights with systematic testing. By combining visual tools like heatmaps and...
Integrating first-party and third-party data to optimize advertising
In today's data-driven marketing landscape, the ability to seamlessly blend first-party and third-party data has become a critical competitive advantage. While first-party data provides unparalleled accuracy and compliance, third-party data offers...
New YouTube Shorts campaign features in Google Ads
YouTube Shorts advertising has undergone significant transformation in 2025, introducing groundbreaking features that revolutionize how advertisers can target, optimize, and monetize short-form video content. The most notable advancement is the introduction...
The latest changes to Google Ads in 2025
Google Ads has undergone its most significant transformation in 2025, with artificial intelligence taking center stage in nearly every aspect of campaign management and optimization. The platform has evolved from a traditional keyword-based advertising system into a...
Jacek Białas
Server log analysis for SEO
The reports in Google Search Console are useful, but they only show Google’s interpretation of the data. If you want to see the uncensored truth about how Googlebot (and other bots) sees and crawls your site, you have to go to the source, raw server logs. This is the only place you’ll find every single bot request, every server response it encounters, and every wasted byte of your crawl budget.
This 5-step guide will show you exactly how to conduct such an analysis to make SEO decisions based on hard data, not guesswork.
Step 1. Get access to raw server logs
Server logs are text files that record every single request made to the server. Before you can begin your analysis, you need to get these files.
- Where to find them? The location depends on your server configuration, but you will most commonly find them in these paths:
- How to download them?
You need the raw access logs (access.log), not the error logs (error.log). These files can be very large (several gigabytes for popular sites), so ensure you have enough disk space.
Choose and configure an analysis tool
Manually reading millions of lines in a text file is impossible. You need specialized software to process this data and present it in a readable format.
- Screaming Frog SEO Log File Analyser (Paid, with a free version) – this is the industry standard. It’s relatively inexpensive and incredibly powerful. The free version allows you to analyze up to 1,000 log lines, which is enough to get familiar with the tool.
- Other options:
Example of script in Python
Here’s the Python code. You can save this as a .py file e.g., seo_log_analyzer.py
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
import re import socket from collections import Counter import concurrent.futures import os # Added for path handling and file existence checks # --- Configuration --- # Specify the path to your log file # Example: LOG_FILE_PATH = 'C:/logs/access.log' or '/var/log/apache2/access.log' LOG_FILE_PATH = 'access.log' # Number of top URLs to display in the report TOP_URLS_COUNT = 20 # Regular expression to parse standard Apache/Nginx log format # Format: IP - - [timestamp] "Method URL Protocol" Status_Code Size "Referer" "User-Agent" LOG_REGEX = re.compile( r'(?P<ip>[\d\.]+) - - \[(?P<timestamp>.*?)\] ' r'"(?P<method>\S+) (?P<url>\S+) (?P<protocol>.*?)" ' r'(?P<status_code>\d{3}) (?P<size>\S+) ' r'"(?P<referer>.*?)" "(?P<user_agent>.*?)"' ) # Cache to store IP verification results to avoid repeated DNS queries verified_ips = {} def is_verified_googlebot(ip): """ Checks if a given IP address belongs to Googlebot via reverse DNS lookup. This is critical to avoid analyzing requests from spoofed bots. """ if ip in verified_ips: return verified_ips[ip] try: # Perform reverse DNS lookup to get the hostname hostname = socket.gethostbyaddr(ip)[0] # Perform a forward DNS lookup on the hostname to get IP addresses # This double check ensures the IP and hostname are legitimate if hostname.endswith(('.googlebot.com', '.google.com')): # Get all IPs for the hostname and check if the original IP is among them for resolved_ip in socket.gethostbyname_ex(hostname)[2]: if resolved_ip == ip: verified_ips[ip] = True return True verified_ips[ip] = False return False except (socket.herror, socket.gaierror): # Handle cases where reverse DNS lookup fails (e.g., non-existent domain) verified_ips[ip] = False return False def analyze_log_file(log_path): """ Main function to analyze the log file, parse lines, filter for Googlebot, verify its authenticity, and aggregate key data. """ googlebot_requests = [] if not os.path.exists(log_path): print(f"ERROR: File '{log_path}' not found. Please check the path.") return None print(f"Starting analysis of file: {log_path}...") try: with open(log_path, 'r', encoding='utf-8', errors='ignore') as f: # 'errors=ignore' handles malformed characters for line in f: # Preliminary filter for lines containing "Googlebot" for efficiency if 'Googlebot' in line: match = LOG_REGEX.match(line) if match: googlebot_requests.append(match.groupdict()) except Exception as e: print(f"An unexpected error occurred while reading the file: {e}") return None if not googlebot_requests: print("No requests from Googlebot (or lines containing 'Googlebot') found in the file.") return None print(f"Found {len(googlebot_requests)} potential Googlebot requests. Starting IP verification (this may take some time for large files)...") verified_requests = [] # Use multiple threads to speed up DNS queries, which can be slow for many IPs with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor: # Increased max_workers for potentially faster processing # Map each IP verification task to an executor, associating it with the original request future_to_ip = {executor.submit(is_verified_googlebot, req['ip']): req for req in googlebot_requests} for future in concurrent.futures.as_completed(future_to_ip): req = future_to_ip[future] try: if future.result(): # If verification is successful, add to verified requests verified_requests.append(req) except Exception as exc: print(f"IP verification generated an exception for {req['ip']}: {exc}") print(f"Successfully verified {len(verified_requests)} requests from authentic Googlebot IPs.") if not verified_requests: print("No verified Googlebot requests found after IP verification.") return None # Data analysis on verified requests crawled_urls = Counter(req['url'] for req in verified_requests) status_codes = Counter(req['status_code'] for req in verified_requests) return { "total_googlebot_hits": len(googlebot_requests), "verified_googlebot_hits": len(verified_requests), "top_urls": crawled_urls.most_common(TOP_URLS_COUNT), "status_codes": status_codes } def print_report(results): """ Prints a formatted, human-readable report from the analysis results. """ if not results: return print("\n" + "=" * 40) print(" SEO LOG ANALYSIS REPORT ") print("=" * 40) print(f"Total 'Googlebot' User-Agent hits: {results['total_googlebot_hits']}") print(f"Number of **verified** Googlebot hits: {results['verified_googlebot_hits']}") print("-" * 40) print("\n## Googlebot Response Status Codes Breakdown:") status_summary = Counter() for code, count in sorted(results['status_codes'].items()): category = f"{code[0]}xx" # e.g., 2xx (Success), 4xx (Client Error) status_summary[category] += count print(f" - Code {code}: {count} times") print("\nStatus code category summary:") for category, count in sorted(status_summary.items()): print(f" - Category {category}: {count} times") print("-" * 40) print(f"\n## TOP {TOP_URLS_COUNT} Most Crawled URLs by Googlebot:") if results['top_urls']: for i, (url, count) in enumerate(results['top_urls'], 1): print(f"{i:2d}. {url} ({count} times)") else: print("No specific URLs found in verified Googlebot requests.") print("-" * 40) print("\nActionable Insights:") print(" - High 4xx/5xx codes suggest broken internal links or server issues. Investigate these URLs.") print(" - If Googlebot frequently crawls low-value URLs (e.g., filtered results, old content), consider using robots.txt or canonical tags to manage crawl budget.") print(" - Pages with low crawl frequency but high importance might need improved internal linking or sitemap updates.") print("=" * 40) # --- Script Execution --- if __name__ == "__main__": # The following block is for testing purposes. # When using with your actual log file, ensure this block is removed or commented out. # It creates a dummy access.log file to demonstrate functionality. dummy_log_data = """ 66.249.75.132 - - [26/Sep/2025:12:05:11 +0200] "GET /home-page HTTP/1.1" 200 15214 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 123.123.123.123 - - [26/Sep/2025:12:05:12 +0200] "GET /secret-data HTTP/1.1" 200 800 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.135 - - [26/Sep/2025:12:05:13 +0200] "GET /products/new-model HTTP/1.1" 200 2345 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.132 - - [26/Sep/2025:12:05:14 +0200] "GET /old-article HTTP/1.1" 404 300 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 192.168.1.1 - - [26/Sep/2025:12:05:15 +0200] "GET /home-page HTTP/1.1" 200 15214 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" 66.249.75.135 - - [26/Sep/2025:12:05:16 +0200] "GET /products/new-model HTTP/1.1" 200 2345 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.132 - - [26/Sep/2025:12:05:17 +0200] "GET /home-page HTTP/1.1" 200 15214 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.132 - - [26/Sep/2025:12:05:18 +0200] "GET /assets/style.css HTTP/1.1" 200 5000 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.132 - - [26/Sep/2025:12:05:19 +0200] "GET /api/data HTTP/1.1" 503 100 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" """ # Write the dummy data to the specified log file path with open(LOG_FILE_PATH, 'w') as f: f.write(dummy_log_data) analysis_results = analyze_log_file(LOG_FILE_PATH) print_report(analysis_results) # Clean up the dummy log file after analysis try: os.remove(LOG_FILE_PATH) print(f"\nCleaned up dummy log file: {LOG_FILE_PATH}") except OSError as e: print(f"Error removing dummy log file: {e}") |
How to use the script
Prerequisites:
- Python – ensure Python 3.x is installed on your system.
- Log file – obtain your
access.logfile(s) from your web server (Apache, Nginx). You can usually find these in/var/log/apache2/or/var/log/nginx/via SSH/FTP, or download them from your hosting panel’s “Raw Access Logs” section. - Placement – place your
access.logfile(s) in the same directory as your Python script, or update theLOG_FILE_PATHvariable to point to its exact location.
Save the Script – save the code above as a Python file (e.g., seo_log_analyzer.py).
Prepare your log file:
- If your log file is compressed (e.g.,
access.log.gz), you’ll need to decompress it first. You can use tools likegunzipon Linux/macOS or 7-Zip on Windows. - If you have multiple log files (e.g.,
access.log.1,access.log.2), you can concatenate them into one large file for a comprehensive analysis usingcat access.log.* > combined_access.log(Linux/macOS) or manually combine them. Then, updateLOG_FILE_PATHto point to this combined file.
Configure LOG_FILE_PATH:
- Open
seo_log_analyzer.pyin a text editor. - Locate the
LOG_FILE_PATHvariable. - Change
'access.log'to the actual name/path of your log file. For example:LOG_FILE_PATH = 'my_website_access_logs.log'orLOG_FILE_PATH = '/path/to/your/logs/access.log'.
Remove dummy data block:
- The
if __name__ == "__main__"– block at the bottom of the script contains dummy data creation for testing. For analyzing your real logs, you must remove or comment out the lines that create thedummy_log_dataand write it toLOG_FILE_PATH. Keep only theanalysis_results = analyze_log_file(LOG_FILE_PATH)andprint_report(analysis_results)lines. - The cleanup
os.remove(LOG_FILE_PATH)at the end should also be removed if you’re working with your actual log files, as you don’t want to delete them.
Run the script:
- Open your terminal or command prompt.
- Navigate to the directory where you saved
seo_log_analyzer.pyand your log file. - Execute the script using:
python seo_log_analyzer.py
Interpreting the report & taking action
The script will output a detailed report directly to your console, similar to this:
Starting analysis of file: access.log...
Found 8 potential Googlebot requests. Starting IP verification (this may take some time for large files)...
Successfully verified 7 requests from authentic Googlebot IPs.
========================================
SEO LOG ANALYSIS REPORT
========================================
Total 'Googlebot' User-Agent hits: 8
Number of **verified** Googlebot hits: 7
----------------------------------------
## Googlebot Response Status Codes Breakdown:
- Code 200: 5 times
- Code 404: 1 times
- Code 503: 1 times
Status code category summary:
- Category 2xx: 5 times
- Category 4xx: 1 times
- Category 5xx: 1 times
----------------------------------------
## TOP 20 Most Crawled URLs by Googlebot:
1. /home-page (2 times)
2. /products/new-model (2 times)
3. /old-article (1 times)
4. /assets/style.css (1 times)
5. /api/data (1 times)
----------------------------------------
Actionable Insights:
- High 4xx/5xx codes suggest broken internal links or server issues. Investigate these URLs.
- If Googlebot frequently crawls low-value URLs (e.g., filtered results, old content), consider using robots.txt or canonical tags to manage crawl budget.
- Pages with low crawl frequency but high importance might need improved internal linking or sitemap updates.
========================================
Configuration in Screaming Frog SEO Log File Analyser:
- Launch the program and create a new project.
- Drag and drop your
.logor.gzfile(s) into the program window. - The tool will automatically start processing the data. In the “User Agents” tab, you’ll see a list of all bots that have visited your site.
Identify and Verify Googlebot’s Activity
Not every request with a “Googlebot” User-Agent actually comes from Google. Malicious bots often spoof their user agent to bypass security measures. That’s why verification is critical.
In the Screaming Frog Log File Analyser, navigate to the “Bots” tab. The tool automatically performs a reverse DNS lookup to verify if the request’s IP address truly belongs to Google. You will see a breakdown of “Verified Bots” and “Spoofed Bots.” For your analysis, only consider the verified bots.
Analyze key data – What is Googlebot actually doing?
This is the core of the entire process. You must now interpret the data to understand the bot’s behavior. Focus on the following reports and metrics:
- Most Crawled URLs (
URLs->All URLs): - Server Response Codes (
Response Codes): - Crawl Waste:
- Crawl Frequency (
Events):
Take concrete actions based on your analysis
The analysis itself is worthless without implementation. Here is a table of common problems and their solutions:
| Problem Identified in Logs | Specific Action to Take |
| Googlebot frequently hits pages that return a 404 error. | 1. Identify these URLs. 2. If they have valuable replacements, set up 301 redirects. 3. Fix the internal links that point to these broken pages. |
The bot wastes time on URLs with parameters (e.g., ?sort=price). | 1. Block these parameters in your robots.txt file using the Disallow directive. 2. Use the rel="canonical" tag to point to the “clean” version of the URL. |
| The most important business pages are rarely crawled. | 1. Increase the number of internal links pointing to these pages. 2. Ensure they are in your sitemap.xml with a high priority. |
| 5xx server errors appear in the logs. | 1.Immediately contact your server administrator or hosting company. 2. Analyze the error.log files to diagnose the cause. |
The bot is crawling non-canonical versions of pages (e.g., with and without www). | 1. Implement server-level 301 redirects to force a single, preferred version of your domain. |
Server log analysis is one of the most powerful techniques in a technical SEO’s toolkit. It allows you to stop guessing and start acting based on hard, undeniable evidence. It shows you where you’re losing money, where Google is encountering problems, and which elements of your site require immediate attention. Dedicate one day to this analysis, and you’ll get a concrete to-do list that will yield far better results than months of “creative” marketing.
Related News



