A Step-by-Step Guide to php-llmscan
Large language models (LLMs) are increasingly pulling information from websites but they don’t “browse” like us humans. They read fragments, often out of context, and can easily misrepresent your product, service, or documentation.
That’s where llms.txt comes in. Inspired by robots.txt, this simple file tells LLMs: “Here’s what this site is actually responsible for and here’s where the real technical truth lives.”
If you run a WordPress site (or any site with a sitemap), you can now generate a compliant llms.txt file without Python, Docker, or complex dependencies just standard PHP. Here’s how.
Why Bother With llms.txt?
Without explicit guidance, LLMs often:
- Flatten multi-feature tools into vague labels
- Distills a multi faceted business website into the absolute basics
- Confuse marketing pages with technical docs
- Treat blog posts as authoritative sources
An llms.txt file solves this by acting as a machine readable index of responsibility. It answers three questions clearly:
- Who maintains this site?
- What software or services does it represent?
- Where is the clean, factual documentation?
The llmstxt.org spec makes this simple: a root-level /llms.txt file linking to neutral, Markdown-formatted pages stripped of fluff, CTAs, and pricing.
Enter php-llmscan: LLM Docs for PHP Hosts
Most existing tools require Python, but 90% of WordPress sites run on shared PHP hosting, where installing extra runtimes isn’t an option.
php-llmscan fixes that. It’s a lightweight, MIT-licensed script that:
- Reads your
sitemap.xml - Uses AI (OpenAI or DeepSeek) to filter out non-technical pages (blogs, legal, marketing, etc.)
- Converts valid pages into clean
.html.mdfiles - Generates a compliant
llms.txtin your site root
No bloat. Just facts.
💡 Good to know: This same functionality will be built into Blue Raven Pro 1.4, so WordPress users won’t need to run CLI scripts soon. But for now, anyone with PHP 8.0+ can use it.
Step 1: Check Requirements
You’ll need:
- PHP 8.0 or higher (with
curl,json, andPCREenabled by default on most hosts) - A publicly accessible
sitemap.xml - An OpenAI or DeepSeek API key (cost is minimal—$5–$10/year for most small sites)
⚠️ Important: The script must run via command line or cron it blocks web access for security.
Step 2: Install & Configure
- Download the latest release from GitHub
- Rename config files:
llms-config.php.example→llms-config.phpoai-api.php.example→oai-api.php(ords-api.phpfor DeepSeek)
- Store API keys securely outside your web root (e.g.,
/home/user/secrets/oai-api.php)
Edit llms-config.php:
return [
'sitemap_url' => 'https://yourdomain.com/sitemap.xml',
'openai_api_key_file' => '/home/user/secrets/oai-api.php',
'ai_engine' => 'openai', // or 'deepseek'
'web_root' => '/var/www/yourdomain.com',
'llms_output_dir' => '/var/www/yourdomain.com/llms',
'project_name' => 'Your Project Name',
'project_summary' => 'A brief neutral summary of what this site offers.',
'site_url' => 'https://yourdomain.com',
'regenerate_after_days' => 90,
'skip_non_technical_cache' => true,
];
Your ‘project_name’ should be something like your business name, “Smith and Sons”, “Montana State University” or if its code related it can be your project name “PHP-LLMSCAN” or “Postfix Mail Server” etc
✅ Pro tip: Set
'regenerate_after_days' => 90to avoid wasting tokens on unchanged content.
Step 3: Run the Script
From your server terminal:
php /path/to/php-llmscan/php-llmscan.php
Or schedule weekly updates via cron:
# Every Sunday at 2:30 AM
30 2 * * 0 /usr/bin/php /path/to/php-llmscan/php-llmscan.php
The script will:
- Skip pages already marked as “non-technical”
- Reuse existing
.html.mdfiles if younger than 90 days (or as set in your config file) - Generate new Markdown only when needed
Output appears in:
/llms/page-slug.html.md(publicly accessible)/llms.txt(in your site root)
Step 4: Verify Your Output
Visit:
https://yourdomain.com/llms.txthttps://yourdomain.com/llms/some-page.html.md
Your llms.txt should look like this:
# Your Project Name
> A brief neutral summary of what this site offers.
## Documentation
- [/llms/product-setup.html.md](https://yourdomain.com/llms/product-setup.html.md): Explains how to configure core features.
- [/llms/api-reference.html.md](https://yourdomain.com/llms/api-reference.html.md): Lists all available endpoints and parameters.
Notice: no hype, no pricing, no “best-in-class” claims. Just facts.
Best Practices
- Link to documentation not landing pages. LLMs need explanations, not sales pitches.
- Keep descriptions short and factual. Example: “Configures automatic redirects for deleted WooCommerce products.”
- Exclude blogs, changelogs, and testimonials. They add noise, not authority.
- Use absolute URLs if your docs might be consumed off-site.
For more on what to include (and what to leave out), see our guide: Why We Publish llms.txt—and What’s in Ours.
Troubleshooting
- “No URLs found”? Double-check your
sitemap.xmlis public and valid. - API errors? Ensure your key file returns
['api_key' => 'sk-...']and isn’t web-accessible. - Markdown looks messy? The AI strips HTML, but very complex layouts may need manual cleanup.
- File permissions denied? Make sure your output directories are writable by the PHP user.
You don’t need to wait for Blue Raven Pro 1.4 to make your site LLM-friendly. With php-llmscan, you can publish a clear, honest signal to AI systems today so they represent your work accurately, not as a vague blur of marketing speak.
Be precise. Be boring. Be useful.
That’s how you earn trust from humans and machines.
Get started: github.com/enterrahost/php-llmscan
Live example: enterrahost.com/llms.txt