Parsing Match History Pages in Python Using PyQuery Selectors

To parse a match history page with Python, install PyQuery, load the HTML, and use CSS selectors to target table rows and stat containers. This guide walks you through a complete working pipeline: from loading a match history HTML file to outputting a clean list of Python dicts you can write to CSV or JSON. If you’ve already tried regex and watched it fall apart on nested tags, you’re in the right place.

Why Match History HTML Breaks Regex-Based Parsing

Match history pages are structured data wrapped in deeply nested HTML. A typical page has an outer container div, repeated row elements for each game, and nested cells holding kills, deaths, assists, duration, and result. That structure is exactly what makes regex a poor choice here.

Consider what a regex pattern for extracting kills looks like when the HTML reads <td class="stat kills" data-value="12">12</td>. Your pattern works until a developer adds a whitespace change, a new wrapper span, or a dynamic class suffix. Then it silently breaks and you get empty results with no error to trace.

The bigger problem is attribute-encoded data. Match history pages frequently store metadata in attributes like data-match-id, data-champion, or data-result. Regex can technically match those, but the patterns get fragile fast. PyQuery’s CSS selector syntax reads [data-match-id] and handles it cleanly because it understands HTML structure rather than treating the page as a flat string.

Setting Up PyQuery for HTML Parsing

Installation

Run this in your terminal to install PyQuery and its lxml dependency:

pip install pyquery lxml

Verify the install works by opening a Python shell and running from pyquery import PyQuery as pq. No errors means you’re ready.

Loading Match History HTML

PyQuery accepts HTML from three sources: a local file, a string variable, or a live URL. Here’s each approach:

from pyquery import PyQuery as pq

# From a local file
with open("match_history.html", "r", encoding="utf-8") as f:
    doc = pq(f.read())

# From a string
html_string = "<div class='match-list'>...</div>"
doc = pq(html_string)

# From a live URL (requires requests)
import requests
response = requests.get("https://example.com/match-history")
doc = pq(response.text)

Always respect the site’s robots.txt and add a delay between requests when scraping live pages. The doc object is now your entry point for all selector queries.

Understanding the Match History HTML Structure

Before writing selectors, you need to know what you’re targeting. A realistic match history page looks something like this:

<div class="match-list">
  <div class="match-row" data-match-id="9812" data-result="win">
    <div class="match-meta">
      <span class="champion-name">Jinx</span>
      <span class="match-duration">32:14</span>
    </div>
    <div class="match-stats">
      <span class="kills">12</span>
      <span class="deaths">3</span>
      <span class="assists">7</span>
    </div>
  </div>
  <!-- more .match-row elements -->
</div>

The key observation: every match entry shares the class match-row, and critical metadata lives in data-match-id and data-result attributes. Stat values sit inside consistently named child spans. This consistent class naming is what makes CSS selectors the right tool. You can target the entire set of matches with one selector string instead of writing loops around regex patterns.

Selecting Match Rows with PyQuery CSS Selectors

Grabbing All Match Entries

Use a class selector to grab every match row at once:

matches = doc(".match-list .match-row")
print(f"Found {len(matches)} matches")

Scoping the selector to .match-list .match-row rather than just .match-row protects you from false positives if other parts of the page reuse that class name. The len() call confirms you got the expected count before you start extracting data.

Drilling Into Nested Elements with .find()

The .find() method accepts any CSS selector string and searches within the matched element’s subtree. It’s your primary tool for reaching nested stat containers:

# Within a single match row element
row = pq(matches[0])
kills = row.find(".kills").text()
champion = row.find(".champion-name").text()

What this does: row.find(".kills") returns a new PyQuery object containing only the kills span within that row. Calling .text() on it gives you the visible text content as a string.

Extracting Text and Attribute Data

Pulling Visible Stat Values with .text()

Use .text() to get the visible content of any matched element. It strips tags and returns a plain string:

row = pq(matches[0])
kills   = row.find(".kills").text()
deaths  = row.find(".deaths").text()
assists = row.find(".assists").text()
duration = row.find(".match-duration").text()

Reading Attribute Values with .attr()

For data encoded in HTML attributes, use .attr():

match_id = row.attr("data-match-id")
result   = row.attr("data-result")

You can chain .find() and .attr() together to reach attributes on deeply nested elements in a single line:

champion_id = row.find(".champion-icon").attr("data-champion-id")

What this does: .find() locates the element, then .attr() reads the named attribute directly. No intermediate variable needed.

Iterating Over Match Rows to Build a Dataset

The .items() method is what you want when iterating. It yields each matched element as its own PyQuery object, so you can call .find() and .attr() directly without wrapping in pq() each time.

from pyquery import PyQuery as pq
import requests

response = requests.get("https://example.com/match-history")
doc = pq(response.text)

match_records = []

for row in doc(".match-list .match-row").items():
    record = {
        "match_id":  row.attr("data-match-id"),
        "result":    row.attr("data-result"),
        "champion":  row.find(".champion-name").text().strip(),
        "duration":  row.find(".match-duration").text().strip(),
        "kills":     row.find(".kills").text().strip(),
        "deaths":    row.find(".deaths").text().strip(),
        "assists":   row.find(".assists").text().strip(),
    }
    match_records.append(record)

print(match_records)

What this does: the loop visits each .match-row element, extracts text and attribute values into a dict, and appends it to match_records. After the loop, you have a list of dicts ready for analysis or export.

To write the results to CSV, add this after the loop:

import csv

with open("match_history.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=match_records[0].keys())
    writer.writeheader()
    writer.writerows(match_records)

PyQuery vs BeautifulSoup vs Regex

Here’s a direct comparison for the same task: extracting kills from a match row.

PyQuery Selector Quick Reference for Match History Parsing
Selector Pattern Use Case in Match History Parsing
.match-row Select all match entry containers
.match-list .match-row Scope to parent to avoid false positives
[data-match-id] Target elements with a specific data attribute
.find(".kills") Drill into nested stat spans within a row
.attr("data-result") Read win/loss metadata from HTML attributes
.filter(".win") Filter match rows to wins only
.text().strip() Get clean visible text from stat elements

Regex extracts kills with something like re.search(r'class="kills">(\d+)<', html). That breaks the moment the tag gains an extra attribute or whitespace. BeautifulSoup requires row.find("span", class_="kills").get_text(), which works but gets verbose when you’re chaining three or four levels deep. PyQuery’s row.find(".kills").text() is shorter, reads like a CSS selector you’d write in DevTools, and stays readable as the selector complexity grows.

Use regex only for single-value extractions from predictable, flat strings. Use PyQuery when your HTML has repeating rows and nested stat containers.

Handling Edge Cases in Match History Pages

Missing Elements

Check .length before calling .text() or .attr() on elements that might not exist:

kills_el = row.find(".kills")
kills = kills_el.text().strip() if kills_el.length else "N/A"

Inconsistent Class Names

When class names vary across rows, use attribute selectors instead:

row.find("[data-type='stat-kills']").text()

JavaScript-Rendered Pages

PyQuery parses static HTML only. If the match history page loads data via JavaScript after the initial page load, PyQuery will return empty results because the stat rows won’t exist in the raw HTML response. For those cases, use Playwright or Selenium to render the page first, then pass the rendered HTML to PyQuery.

Frequently Asked Questions

Can PyQuery handle JavaScript-rendered pages?

No. PyQuery parses static HTML. If your match history page loads data dynamically via JavaScript, use Playwright or Selenium to render the page, then pass the resulting HTML to PyQuery for parsing.

Is PyQuery faster than BeautifulSoup?

PyQuery uses lxml under the hood, which is generally faster than BeautifulSoup’s default html.parser. For large match history tables with hundreds of rows, the difference is measurable. For small pages, both are fast enough.

Why is my PyQuery selector returning empty results?

The most common cause is a class name mismatch. Open the page in browser DevTools and inspect the exact class on the element. Also check whether the page is JavaScript-rendered, which means the HTML you’re parsing won’t contain the rows you see in the browser.

How do I extract data from HTML attributes like data-match-id?

Use .attr("data-match-id") on the matched element. You can also select elements that have a specific attribute using the selector [data-match-id].

Can I parse a local HTML file with PyQuery?

Yes. Open the file with Python’s built-in open(), read the content as a string, and pass it to pq(). This is the recommended approach when working with exported match history HTML files.

Your next step is to adapt the CSS selector patterns in this guide to match the exact class names on your target page. Open DevTools, inspect a match row, and copy the class names directly into your .find() calls. From there, explore .filter() to narrow results by win/loss, and .not() to exclude specific row types from your dataset.