Skip to main content

Scraping my own Hacker News comments

I've left a lot of comments on Hacker News since 2009, when I moved over from Slashdot. HN Search is great for finding old comments, but it can't see comment scores, which are private to the author of the comment. I thought it would be interesting to find my highest-voted comments, so I had to scrape my comments from HN myself. Here's how I did it:

HN's HTML is very simple and doesn't change much, which is great for scraping. I tried a couple of web scraping extensions for Chrome, and found that none of them were very good. I ended up using this one which has a horrible UI but the scraping part actually worked.

The scraping code is called a "Sitemap" in this tool for some reason. Here's the "Sitemap" I made. You can use it by replacing YOUR_USERNAME_HERE with your own username.

{
"_id": "hnSelfCommentsAndScores",
"startUrl": ["https://news.ycombinator.com/threads?id=YOUR_USERNAME_HERE"],
"selectors": [
{
"id": "score",
"parentSelectors": ["comment"],
"type": "SelectorText",
"selector": ".score",
"multiple": false,
"delay": 0,
"regex": "-?\\d+"
},
{
"id": "comment",
"parentSelectors": ["more"],
"type": "SelectorElement",
"selector": "td.default:has(.score)",
"multiple": true,
"delay": 0
},
{
"id": "text",
"parentSelectors": ["comment"],
"type": "SelectorText",
"selector": ".commtext",
"multiple": false,
"delay": 0,
"regex": ""
},
{
"id": "link",
"parentSelectors": ["comment"],
"type": "SelectorElementAttribute",
"selector": ".age a",
"multiple": false,
"delay": 0,
"extractAttribute": "href"
},
{
"id": "more",
"parentSelectors": ["_root", "more"],
"paginationType": "linkFromHref",
"selector": "a.morelink",
"type": "SelectorPagination"
}
]
}

Since it's not obvious, you access the Web Scraper UI by opening dev tools (on any page, doesn't matter) and finding the new "Web Scraper" tab on the far right. Then you can "Import Sitemap", and then "Scrape". I used the default request timing which makes requests every four seconds, and didn't run into any throttling issues.

You can watch the scraper do its thing in its own popup window. Once that disappears in a few minutes, you can save the data with the "Export data" menu item. Now you should have the score, comment text, and comment link for every comment you've made including nested comments.

I found some interesting things in my comment history, like the time I had a conversation with a skeptical Andrej Karpathy who didn't think gradient descent on neural nets could produce anything like a thinking brain. He was then just starting a PhD at Stanford but now of course he is the director of AI at Tesla and I think he holds a higher opinion of stochastic gradient descent on neural nets these days :)


Share: