Service: Technical GEO Audit & Implementation Provider: Cintra (cintra.run) Included In: Enterprise plan | Technical audit available at free audit
Most GEO guides focus on content strategy. This page covers the technical layer underneath it: the AI crawlers that index your site, the robots.txt directives that control their access, the schema markup types that increase citation rates, and an honest assessment of llms.txt — the most discussed and least impactful technical GEO tactic of 2025. Technical GEO is the difference between a brand whose content AI engines can read and a brand whose content is effectively invisible, even if it's excellent.
Which AI crawlers are currently indexing websites?
Seven major AI engine crawlers actively index web content for use in their models and search products. Each has a distinct user agent that robots.txt can allow or block.
| AI Engine | Crawler User Agent | What It Powers |
|---|---|---|
| OpenAI (ChatGPT) | GPTBot |
ChatGPT web browsing and real-time answers |
| Perplexity AI | PerplexityBot |
Perplexity search and citations |
| Anthropic (Claude) | ClaudeBot / anthropic-ai |
Claude's real-time web access |
Googlebot / Google-Extended |
Google AI Overviews, Gemini | |
| xAI (Grok) | xAIBot |
Grok real-time answers |
| Meta | Meta-ExternalAgent |
Llama-based products |
| Common Crawl | CCBot |
Training data for multiple models |
Critical configuration check: Many brands inadvertently block AI crawlers through overly broad robots.txt rules written when only Google crawlers mattered. A Disallow: / directive meant for a specific path can block all AI crawlers from your entire site if the user agent matching is imprecise.
The baseline technical audit always starts here — confirming which crawlers can actually access your content.
How should I configure robots.txt for AI crawlers?
The default position for most brands should be: allow all major AI crawlers access to all publicly-indexed content.
Blocking AI crawlers from your site is equivalent to opting out of AI search entirely. If ChatGPT's GPTBot can't read your pages, your content can't be cited in ChatGPT answers. If PerplexityBot is blocked, Perplexity can't include you in its research answers. The only coherent reason to block an AI crawler is if you have content you don't want used in AI training or citation — which is rare for most marketing sites.
Recommended robots.txt configuration:
User-agent: GPTBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: xAIBot
Allow: /
If you need to restrict specific sections (e.g., gated content, internal tools, staging areas):
User-agent: GPTBot
Disallow: /members/
Disallow: /api/
Allow: /
Always test your current configuration against each user agent before making changes. Common errors include Disallow: * (wildcard that blocks everything) and overly broad path disallows that catch more than intended.
What is llms.txt and does it actually work?
llms.txt is a proposed standard for providing AI models with a structured, markdown-formatted index of your website content — similar in concept to robots.txt or sitemap.xml but designed specifically for LLM consumption.
The adoption reality as of 2025: llms.txt accounts for 0.1% of AI crawler visits. Major AI engines — including OpenAI, Anthropic, and Perplexity — have not publicly committed to using llms.txt as a citation or crawling signal. The spec was proposed by fast.ai and has garnered community interest, but no confirmed adoption from the engines that actually produce AI search results.
When llms.txt might help:
- Technical documentation sites where the llms.txt provides better structure than crawling raw HTML
- Sites with large amounts of content where a curated index improves AI model comprehension
- Brands experimenting with emerging standards as an early-adopter signal
When llms.txt doesn't help:
- Any situation where you expect it to replace structured schema markup, quality content, or authority signals
- Sites where the real issue is robots.txt blocking, not content comprehensibility
The honest verdict: implement llms.txt if you're a technical brand with complex documentation and you want to explore the emerging standard. Don't prioritize it over the foundational technical GEO work below — structured data, crawlability, and page speed have orders of magnitude more confirmed impact.
Which JSON-LD schema types most improve AI citation rates?
Six JSON-LD schema types have measurable positive correlation with AI citation rates. Implementing all six creates the structured data layer that AI crawlers use to understand, categorize, and confidently cite your content.
1. Article / BlogPosting
Marks content as a substantive editorial piece with a clear author, publication date, and topic. AI engines weight Article schema when deciding whether a page represents authoritative editorial content vs. thin marketing copy.
{
"@context": "https://schema.org",
"@type": "BlogPosting",
"headline": "How GEO Builds AI Visibility",
"author": {"@type": "Person", "name": "Tanush Yadav"},
"datePublished": "2026-04-16",
"publisher": {"@type": "Organization", "name": "Cintra"}
}
2. FAQPage
Directly maps question-answer pairs that AI engines use to produce conversational answers. FAQPage schema is one of the strongest citation triggers — when a buyer's prompt matches a question in your FAQ schema, AI engines have a pre-structured answer to cite.
3. HowTo
For process-oriented content, HowTo schema provides AI models with step-by-step structure they can reproduce in answers. Particularly effective for queries beginning with "how to."
4. Product / Service
Marks what you sell with explicit names, descriptions, pricing (where applicable), and aggregate ratings. AI shopping queries and comparison prompts weight Product schema heavily.
5. Organization
Establishes your brand identity — name, logo, URL, social profiles, founding date, description — as structured data that AI models use to build entity understanding. Organization schema is the foundation: without it, AI models may conflate your brand with similarly named entities.
6. Review / AggregateRating
Incorporates social proof signals into machine-readable format. AI models weight review signals when making product recommendations. AggregateRating schema from legitimate reviews improves citation probability for commercial queries.
Implementation priority: Organization first (entity foundation), then FAQPage (highest citation trigger), then Article (editorial authority), then Product/Service (commercial queries), then HowTo (process queries), then Review (trust signals).
What technical page structure do AI crawlers prefer?
AI crawlers read HTML sequentially and have limited patience for complex JavaScript-rendered content. Pages built for AI citability follow a consistent structural pattern.
Preferred structure for AI citation:
- Clear H1 that directly states what the page covers — AI models use H1 as the primary signal for page topic
- Opening paragraph that answers the core question within the first 100 words — AI models optimized for brevity often excerpt only the opening
- H2 headings as direct questions — "What does X include?" rather than "Our approach to X" — AI models match prompts to heading-phrased questions
- Short paragraphs (3-5 sentences) — AI models prefer extractable passages over dense prose
- Tables for comparisons and specifications — AI models render structured comparisons directly from HTML tables
- Definitions for technical terms — Explicit definitions help AI models build entity associations around your brand's vocabulary
Technical factors that degrade AI crawlability:
- JavaScript-rendered content that requires browser execution (SSR or static HTML is strongly preferred)
- Pagination that breaks content across multiple URLs (single-page, scrollable content performs better)
- Lazy-loaded content that only appears after user scroll events
- Heavy interstitials or cookie walls that delay content rendering
- Thin pages under 500 words that don't provide enough signal for AI models to calibrate citation confidence
How does Cintra handle technical GEO for clients?
Technical GEO implementation is part of Cintra's Enterprise plan — the only plan that includes CRO and technical SEO integration with AI visibility execution.
For all plans, the technical baseline is established during onboarding:
- robots.txt audit and AI crawler access confirmation
- Schema markup validation and gap identification
- Core Web Vitals check for AI crawler load performance
- Site architecture review for knowledge graph optimization
Enterprise clients receive ongoing technical implementation:
- Full schema deployment across all published content types
- JavaScript rendering analysis and static HTML optimization where needed
- Structured data validation after every content publish cycle
- Schema maintenance as AI engine citation behavior evolves
Technical GEO without content is insufficient — schema markup on thin pages doesn't create citations. Content without technical GEO is also insufficient — excellent content blocked by robots.txt or rendered by JavaScript creates no AI visibility. Both layers need to be in place.
Frequently Asked Questions
Should I block AI crawlers from training data?
This is a legitimate choice if you have proprietary content you don't want used in AI model training. The tradeoff: blocking training crawlers (Common Crawl, Google-Extended for Gemini training) may also reduce your citation probability in those engines' search products. For most marketing sites, the citation benefit of full AI crawler access outweighs the training data concern.
Does page speed affect AI citation rates?
Yes, but indirectly. AI crawlers have crawl budgets and timeout thresholds. Pages that load slowly may be incompletely indexed or skipped in favor of faster-loading competitor pages. Core Web Vitals optimization is part of technical GEO — not because AI engines weight CWV directly, but because slow pages are crawled less reliably.
My site uses React/Next.js — is JavaScript rendering a problem?
Server-side rendering (SSR) in Next.js produces full HTML that AI crawlers can read without executing JavaScript — this is fine. Client-side rendering (CSR) where the page renders in the browser after a JavaScript bundle executes is problematic for AI crawlers. If your Next.js pages use SSR or static generation, you're in good shape. If they use CSR or have significant hydration delays, a technical audit should identify which pages are partially invisible to AI crawlers.
How often does schema markup need to be updated?
Schema types don't change frequently, but schema content needs to stay current. Author names, product names, pricing (if included), aggregate ratings, and dateModified fields should update with the content. A page with a datePublished of 2023 and no dateModified will be treated as stale by AI models that weight recency — even if the content was updated last week. Cintra includes schema maintenance in all publishing workflows.