Parse Fandom Wiki HTML with PyQuery in Python

To scrape a Fandom wiki page with PyQuery, you fetch the page HTML using requests, load it into a PyQuery document object, and then query elements using CSS selectors through .find() and .children(). PyQuery is a Python library that lets you parse and traverse HTML documents using jQuery-style CSS selectors and DOM methods. This guide walks you through the full workflow, from fetching a wiki page to extracting infobox fields, article text, and internal links, with copy-paste code at every step.

Why Fandom Wiki HTML Needs a Structured Parsing Approach

Fandom wiki pages are not simple HTML documents. They layer MediaWiki-generated markup with Fandom’s own portable infobox system, producing deeply nested, class-heavy HTML where a single character stat might sit three or four levels inside aside.portable-infobox. Naive tag searches like “find all td elements” will pull hundreds of unrelated cells.

PyQuery’s CSS selector syntax gives you precise control. You can target .pi-data-value inside aside.portable-infobox without touching the rest of the page. That specificity is what makes PyQuery a better fit than a generic loop over parsed tags.

Before writing any selectors, open your browser’s DevTools (F12) on the target wiki page and inspect the Elements panel. You’ll see the actual class names, which vary between wikis. The patterns in this guide apply across most Fandom properties.

Setup: Installing PyQuery and Fetching a Fandom Page

Install both libraries with one command:

pip install pyquery requests

Then fetch a page and load it into PyQuery:

import requests
from pyquery import PyQuery as pq

url = "https://leagueoflegends.fandom.com/wiki/Jinx"
headers = {"User-Agent": "Mozilla/5.0 (compatible; WikiScraper/1.0)"}

response = requests.get(url, headers=headers)
doc = pq(response.text)

Setting a User-Agent header matters here. Fandom’s servers return different content or block requests without one. Always space out your requests when scraping multiple pages, and check the wiki’s robots.txt before building a crawler. One request per second is a reasonable starting rate.

The doc object is now your PyQuery root, equivalent to $(document) in jQuery. Every method below chains from it.

Using find() to Locate Article Content Elements

What does PyQuery find() do?

.find() searches all descendants of the current selection matching a CSS selector. It doesn’t stop at direct children. It traverses the entire subtree, which makes it your primary tool for reaching deeply nested wiki content.

# Get the main article body
content = doc.find("div.mw-parser-output")

# Extract all paragraph text
for p in content.find("p").items():
    print(p.text())

# Extract section headings
for h2 in content.find("h2").items():
    print(h2.find("span.mw-headline").text())

What this does: doc.find("div.mw-parser-output") selects the main article container that MediaWiki generates for all wiki pages. Chaining .find("p") on that result then searches only within that container, keeping you out of the navigation and sidebar HTML.

You can chain attribute selectors directly into .find(). For example, content.find("a[href^='/wiki/']") selects only anchor tags whose href starts with /wiki/, filtering out external links in a single selector string.

Using children() to Traverse Direct Wiki Section Elements

What is the difference between find() and children() in PyQuery?

.children() returns only the immediate child nodes of the current selection, not all descendants. Use it when you want to iterate over the top-level structure of a wiki section without accidentally pulling nested content from deeper levels.

content = doc.find("div.mw-parser-output")

for child in content.children().items():
    tag = child[0].tag
    text_preview = child.text()[:60]
    print(f"<{tag}>: {text_preview}")

What this does: looping over .children() gives you each direct child element of the article body, one by one. You’ll see h2, p, div, and aside elements in sequence, matching the visual reading order of the wiki page.

One edge case to watch: Fandom sometimes wraps article sections inside an extra div with a class like mw-collapsible. If .children() returns a wrapper div instead of the content you expect, add one more .find() call to step inside it. The combination of both methods handles most nesting patterns you’ll encounter.

Extracting Fandom Infobox Data with CSS Selectors

What CSS selectors work with Fandom wiki HTML?

Fandom’s Portable Infobox system generates a consistent HTML structure across wikis. The outer container is aside.portable-infobox, individual data rows use div.pi-item, labels sit in h3.pi-data-label, and values sit in div.pi-data-value. The data-source attribute on each div.pi-item tells you the field name.

infobox = doc.find("aside.portable-infobox")
data = {}

for item in infobox.find("div.pi-item[data-source]").items():
    label = item.find("h3.pi-data-label").text().strip()
    value = item.find("div.pi-data-value").text().strip()
    if label and value:
        data[label] = value

print(data)
# Output: {'Role': 'Marksman', 'Range type': 'Ranged', 'Resource': 'Mana', ...}

What this does: div.pi-item[data-source] uses an attribute selector to grab only rows that have a data-source attribute, which filters out decorative infobox elements like image containers and section headers. The result is a clean Python dictionary of field names to values.

Class names vary across wikis. A wiki built on an older Portable Infobox template might use .pi-item differently or add custom classes. Always inspect the actual HTML before writing selectors. The data-source attribute is the most reliable anchor point because it’s part of the Portable Infobox specification.

Extracting Internal Wiki Links from a Parsed Page

content = doc.find("div.mw-parser-output")
internal_links = []

for a in content.find("a[href^='/wiki/']").items():
    href = a.attr("href")
    # Skip file pages, categories, and edit links
    if any(skip in href for skip in ["/wiki/File:", "/wiki/Category:", "action=edit"]):
        continue
    full_url = "https://leagueoflegends.fandom.com" + href
    internal_links.append(full_url)

print(internal_links[:5])

What this does: a[href^='/wiki/'] selects anchor tags whose href starts with /wiki/, which covers all internal wiki article links. The filter loop drops file pages, category pages, and edit action links, leaving you with clean article URLs you can feed into a crawler loop.

PyQuery vs BeautifulSoup for Fandom Wiki Scraping

Here’s the same infobox extraction in both libraries:

# PyQuery
data = {}
for item in doc.find("div.pi-item[data-source]").items():
    label = item.find("h3.pi-data-label").text().strip()
    value = item.find("div.pi-data-value").text().strip()
    data[label] = value

# BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
data = {}
for item in soup.select("div.pi-item[data-source]"):
    label = item.select_one("h3.pi-data-label")
    value = item.select_one("div.pi-data-value")
    if label and value:
        data[label.get_text(strip=True)] = value.get_text(strip=True)

The PyQuery version is shorter and reads like CSS. If you have a front-end background or any jQuery experience, PyQuery’s chaining syntax will feel natural immediately. BeautifulSoup has broader community documentation and handles malformed HTML more gracefully, so it’s a reasonable choice for one-off scripts where you want more search results to reference.

For Fandom wiki scraping with attribute selectors and chained queries, PyQuery wins on readability. Pick BeautifulSoup if you’re working with a team that already uses it and consistency matters more than syntax preference.

A Complete Fandom Wiki Scraper You Can Run Now

import requests
import json
from pyquery import PyQuery as pq

def scrape_wiki_page(url):
    headers = {"User-Agent": "Mozilla/5.0 (compatible; WikiScraper/1.0)"}
    response = requests.get(url, headers=headers)
    doc = pq(response.text)
    content = doc.find("div.mw-parser-output")

    # Extract infobox key-value pairs
    infobox = {}
    for item in doc.find("div.pi-item[data-source]").items():
        label = item.find("h3.pi-data-label").text().strip()
        value = item.find("div.pi-data-value").text().strip()
        if label:
            infobox[label] = value

    # Extract section headings and paragraph text
    sections = []
    for h2 in content.find("h2").items():
        heading = h2.find("span.mw-headline").text()
        if heading:
            sections.append(heading)

    # Extract internal article links
    links = []
    for a in content.find("a[href^='/wiki/']").items():
        href = a.attr("href")
        if not any(s in href for s in ["/wiki/File:", "/wiki/Category:"]):
            links.append("https://leagueoflegends.fandom.com" + href)

    return {"infobox": infobox, "sections": sections, "links": links[:10]}

result = scrape_wiki_page("https://leagueoflegends.fandom.com/wiki/Jinx")
print(json.dumps(result, indent=2))

Run this script against any Fandom wiki page by swapping the URL. The output is structured JSON you can pipe directly into a pandas DataFrame or a database insert. The next logical step is adding .filter() to narrow results by tag type, or chaining .not_(".reference") to drop footnote links from your extracted content.

FAQ: Common Questions About PyQuery and Fandom Wiki Scraping

Does PyQuery work with JavaScript-rendered pages?

PyQuery parses static HTML only. Fandom loads most article content server-side, so requests plus PyQuery handles the majority of pages. If a section is blank in your output but visible in the browser, that content is JavaScript-rendered and requires Playwright or Selenium to access.

Is scraping Fandom wikis legal?

Fandom wikis are publicly accessible and most content is licensed under Creative Commons. Always check the specific wiki’s licensing terms and Fandom’s robots.txt. Rate-limit your requests and don’t republish extracted content without attribution.

What is the difference between find() and children() in PyQuery?

.find() searches all descendants at any depth. .children() returns only immediate child elements. Use .find() when the target element is nested deeply; use .children() when you want to iterate over the top-level structure of a container.

Can I use the fandom-py package instead?

The fandom-py package on PyPI wraps the Fandom API and returns pre-parsed data for common fields. It’s faster for simple lookups but gives you less control over the HTML structure. PyQuery is the better choice when you need to extract custom infobox fields, navbox data, or content patterns that the API doesn’t expose.

Ryan French

Ryan French is the driving force behind PyQuery.org, a leading platform dedicated to the PyQuery ecosystem. As the founder and chief editor, Ryan combines his extensive experience in the developer arena with a passion for sharing knowledge about PyQuery, a third-party Python package designed for parsing and extracting data from XML and HTML pages. Inspired by the jQuery JavaScript library, PyQuery boasts a similar syntax, enabling developers to manipulate document trees with ease and efficiency.

Traversing Game Wiki HTML with PyQuery: find(), children() and CSS Selectors

Why Fandom Wiki HTML Needs a Structured Parsing Approach

Setup: Installing PyQuery and Fetching a Fandom Page

Using find() to Locate Article Content Elements

What does PyQuery find() do?

Using children() to Traverse Direct Wiki Section Elements

What is the difference between find() and children() in PyQuery?

Extracting Fandom Infobox Data with CSS Selectors

What CSS selectors work with Fandom wiki HTML?

Extracting Internal Wiki Links from a Parsed Page

PyQuery vs BeautifulSoup for Fandom Wiki Scraping

A Complete Fandom Wiki Scraper You Can Run Now

FAQ: Common Questions About PyQuery and Fandom Wiki Scraping

Does PyQuery work with JavaScript-rendered pages?

Is scraping Fandom wikis legal?

What is the difference between find() and children() in PyQuery?

Can I use the fandom-py package instead?

Related Posts: