How to Extract Gaming Leaderboards and Player Stats with PyQuery

PyQuery is a Python library that lets you parse and query HTML documents using jQuery-style CSS selector syntax, making it a natural fit for extracting structured data like leaderboard tables. This guide walks you through pulling rank, player name, and score data from gaming sites, covering both static HTML tables and the JSON API endpoints that many modern leaderboards use under the hood. You’ll end up with a working Python script that outputs a clean list of player dictionaries, ready for analysis or storage.

HTML Tables vs JSON APIs: What You’re Actually Dealing With

Gaming leaderboards come in two formats, and knowing which one you’re facing determines your entire approach. Static HTML tables render the leaderboard data directly in the page source. JSON APIs load data asynchronously after the page loads, meaning the HTML source you get from a plain requests.get() call will be empty or show a loading spinner.

To tell the difference, open your browser DevTools and go to the Network tab. Reload the leaderboard page and filter by XHR or Fetch requests. If you see a request returning JSON with player names and scores, that’s your real data source. If the table data appears in the initial HTML response, PyQuery can parse it directly.

Setup: Installing PyQuery and requests

Install both libraries with a single pip command:

pip install pyquery requests

PyQuery wraps lxml under the hood, so you don’t need a separate lxml install in most environments. Your import block for every example in this guide looks like this:

import requests
from pyquery import PyQuery as pq
import json
import time

Fetching the Leaderboard Page

A bare requests.get() call will get you blocked on most gaming sites. Set a User-Agent header that looks like a real browser request:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}

response = requests.get("https://example-gaming-site.com/leaderboard", headers=headers)
doc = pq(response.text)
print(doc("title").text())  # Verify the page loaded correctly

Passing response.text directly into pq() creates a queryable document object. If doc("title").text() returns the page title you expect, your setup is working.

Selecting Leaderboard Rows with PyQuery

Targeting Table Rows with .find()

Most leaderboard tables use a standard <table> structure. You target rows with a CSS selector passed to .find():

rows = doc.find("table.leaderboard tbody tr")

The selector table.leaderboard tbody tr targets every row inside the <tbody> of a table with class leaderboard. Adjust the class name to match your target site’s actual HTML.

Extracting Rank, Name, and Score Per Row

players = []

for row in rows.items():
    try:
        rank = row.find("td.rank").text().strip()
        name = row.find("td.player-name").text().strip()
        score = row.find("td.score").text().strip()
        players.append({"rank": rank, "name": name, "score": score})
    except Exception:
        continue

print(players[:3])

In the code above, .items() iterates over each matched row as a separate PyQuery object. .find("td.rank") selects the rank cell within that row, and .text() pulls the visible text content. The try/except block skips malformed rows without crashing your script.

Expected output:

[
  {"rank": "1", "name": "ProSniper99", "score": "48200"},
  {"rank": "2", "name": "VoidRunner", "score": "47850"},
  {"rank": "3", "name": "NightOwl_X", "score": "46100"}
]

Handling Data Attributes Instead of Class Names

Some gaming sites skip semantic class names and use data-* attributes on cells. PyQuery handles attribute selectors cleanly:

rank = row.find("[data-col='rank']").text().strip()
name = row.find("[data-col='username']").text().strip()

This is where PyQuery’s jQuery-style syntax pays off. You’d write the same selector in a browser console, so there’s no mental translation required.

Handling Dynamic Leaderboards: JSON Endpoints

How do I get player stats when the HTML is empty?

If the leaderboard HTML source shows no data, the site is loading it via an XHR request. Go back to the Network tab, find the JSON call, copy the URL, and call it directly with requests. This approach is faster and more reliable than scraping rendered HTML.

api_url = "https://example-gaming-site.com/api/leaderboard?game=fps&page=1"
response = requests.get(api_url, headers=headers)
data = response.json()

players = []
for entry in data.get("entries", []):
    players.append({
        "rank": entry.get("rank"),
        "name": entry.get("username"),
        "score": entry.get("score")
    })

In the code above, response.json() parses the JSON response directly. data.get("entries", []) safely handles cases where the key doesn’t exist. For paginated APIs, loop over page numbers and append results until you hit an empty entries list:

all_players = []
page = 1

while True:
    response = requests.get(f"{api_url}&page={page}", headers=headers)
    data = response.json()
    entries = data.get("entries", [])
    if not entries:
        break
    all_players.extend(entries)
    page += 1
    time.sleep(1)

Scraping Per-Player Stats with PyQuery

Extracting Profile URLs from Leaderboard Rows

Many leaderboard rows link to individual player profile pages. You can pull those URLs with PyQuery’s .attr() method:

for row in rows.items():
    profile_url = row.find("td.player-name a").attr("href")
    players.append({"name": row.find("td.player-name").text(), "profile_url": profile_url})

Scraping Individual Player Stat Pages

With a list of profile URLs, you can loop over them and scrape per-player stats. Use .filter() when a profile page contains multiple stat sections and you only want one:

for player in players:
    time.sleep(1)
    resp = requests.get(player["profile_url"], headers=headers)
    profile = pq(resp.text)

    stat_block = profile.find(".stats-section").filter(".combat-stats")
    player["kd_ratio"] = stat_block.find("[data-stat='kd']").text().strip()
    player["win_rate"] = stat_block.find("[data-stat='winrate']").text().strip()

.filter(".combat-stats") narrows the matched set to only the combat stats section, ignoring other stat blocks on the page. This is the cleaner approach compared to writing a more specific selector that may break if the page layout shifts.

Structuring and Storing the Extracted Data

Convert your list of dicts to a pandas DataFrame for analysis:

import pandas as pd
from datetime import datetime

for player in players:
    player["scraped_at"] = datetime.utcnow().isoformat()

df = pd.DataFrame(players)
print(df.head())
df.to_csv("leaderboard.csv", index=False)

If you’d rather skip pandas, write directly to CSV with Python’s built-in module:

import csv

with open("leaderboard.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["rank", "name", "score", "scraped_at"])
    writer.writeheader()
    writer.writerows(players)

The scraped_at timestamp field lets you track leaderboard changes over time by comparing snapshots.

PyQuery vs Other Python Scraping Libraries

Comparison of PyQuery, BeautifulSoup, and Scrapy for leaderboard scraping
Library Selector Syntax Learning Curve Best For jQuery Familiarity
PyQuery CSS / jQuery-style Low (for front-end devs) HTML tables, attribute filtering High
BeautifulSoup Python-native find/select Low Flexible HTML parsing None
Scrapy XPath / CSS High Large-scale crawls Low

If you have any front-end background, PyQuery’s selector syntax will feel natural immediately. BeautifulSoup is a fine choice for one-off scrapes, but PyQuery’s chained selectors make leaderboard table traversal more concise. Scrapy is overkill for a single leaderboard endpoint.

Practical Tips for Reliable Leaderboard Scraping

  • Add time.sleep(1) between requests to avoid triggering rate limits. Two seconds is safer for sites with aggressive throttling.
  • Check the site’s robots.txt before scraping and respect any Crawl-delay directives you find there.
  • Wrap per-row extraction in try/except blocks so a single malformed row doesn’t stop your entire scrape.
  • Cache JSON API responses locally during development so you’re not hammering the endpoint while iterating on your selector logic.

FAQ: Gaming Leaderboard Scraping with PyQuery

Can PyQuery scrape dynamic JavaScript leaderboards?

PyQuery parses static HTML, so it can’t execute JavaScript. If the leaderboard loads via JavaScript rendering, find the underlying JSON API endpoint in the Network tab and call it directly with requests. This is faster anyway.

What is the difference between PyQuery and BeautifulSoup for gaming data?

PyQuery uses CSS selector syntax identical to jQuery, making it more intuitive for developers with front-end experience. BeautifulSoup uses Python-native methods like find() and find_all(). For leaderboard tables with class-based or attribute-based selectors, PyQuery’s chaining syntax is more concise.

How do I handle leaderboard pagination with PyQuery?

For HTML-rendered pagination, find the “next page” link with doc.find("a.pagination-next").attr("href") and loop until no next link exists. For JSON APIs, increment a page parameter in the URL until the response returns an empty results array.

Which PyQuery selectors work best for leaderboard tables?

Use tbody tr to target data rows, td:nth-child(n) for positional column selection, and [data-*] attribute selectors when the site uses data attributes instead of class names. Chain these with .find() for precise DOM traversal.

Start Building Your Leaderboard Pipeline

You now have a complete workflow for extracting gaming leaderboard data with Python: fetch the page with requests, parse HTML tables using PyQuery’s CSS selectors, fall back to JSON endpoints when the HTML is empty, scrape per-player stats by iterating over profile URLs, and store everything in a DataFrame or CSV. The next step is applying .filter() and .not() to refine your player stat extraction when profile pages contain multiple overlapping data sections. Both methods accept any CSS selector string, giving you surgical control over which elements you pull from complex gaming stat pages.