PyQuery for scraping the talkbass classifieds forum

The talkbass classified forum has been a go to source of data for my side projects (e.g., talkbassPricing, talkbassHistory, and talkbassFilter). As a data scientist and a bass player, talkbass has been an excellent resource for me to combine my profession and my hobby. I’m currently in the midst of my most ambitious pursuit yet, which I will be documenting in this blog spot. All good stories start at the beginning, and in this case, the beginning is data scraping.

Once you add web scraping to your toolbelt the world (wide web) becomes your oyster. Of course, many sites, e.g. Twitter, offer API’s that allow users to easily pull down data sets, but alas, talkbass is not one of them. I’ve always used PyQuery for my data scraping projects, and this choice seems somewhat rare among the data science community. Nearly everyone else I know uses Beautiful Soup, which, from what I hear, is a perfectly fine solution. I do not wish to make the argument for one package over the other. Rather, I intend to introduce PyQuery to those new to web scraping and to those Beautiful Soup users who may interested in an alternative.

Quickly, a word of caution, then, I promise, we’ll get to the code. Programmatically accessing web content is by design much faster than the human interactions sites were designed to serve. It is polite to include some time during which your program sleeps so as not to overwork the web servers. When scraping data, please be mindful of the other users of the site and especially of the folks who work hard to serve its content. If your scraper is using too much bandwidth, web admins can identify your IP address and ban you.

Finally, to the good stuff. PyQuery is a wrapper around the lxml python library and emulates the syntax of the javascript library jQuery. The PyQuery documentation itself is not exactly vast, but that’s ok because virtually all jQuery documentation is directly relevant. The reason I’ve always used PyQuery is that when I was getting started with web scraping, I was also pursuing a project using javascript and jQuery; it was a two birds, one stone situation. If you’d like to get familiar with jQuery before diving straight into web scraping, Codecademy is an excellent resource.

You can install PyQuery using
pip install pyquery.
Then load the library using
from pyquery import PyQuery as pq.
We can initialize a PyQuery object using the web url of interest using
d = pq(NAME_OF_URL),
where the site we want to load in this case is https://www.talkbass.com/forums/for-sale-bass-guitars.126/. Let’s open up this page in a web browser and have a look around. The typical layout is 30 basses per page, and on the first page, there are two sticky threads from which we do not wish to scrape data.

screen-shot-2016-12-31-at-12-48-41-pm

To be able to scrape the data we want to keep, we need to know what the html structure of the web page looks like. For this, I’ve found the Mozilla Firefox Web Console to be invaluable. The Google Chrome browser also has similar functionality, but I’ve never used it. To access the console from Firefox, the easiest thing to do is right click an element on the page, and select “Inspect Element”. The console will pop up in the bottom of your web browser and the html block that corresponds to where you clicked will be highlighted.

As you hover over the lines of html in the console, the corresponding items on the web page will be highlighted. Alternatively, if you click the cursor icon at the top left of the console, you can hover the web page and the corresponding lines of html will be highlighted. I’ve found this to be the easiest way to figure out where the data I want to scrape is embedded in the html. A final word on the console, you’ll see grey triangles next to the lines of html; you can click on these to expand or contract the html objects. If you’re not seeing the html code you are looking for, try expanding some of the html objects.

“The L. I. Mystique — you sneak to peak
A look, and then you know that we’re never weak.”
–Chuck D, “Timebomb”

We’ve already loaded the page into a PyQuery object using the code above. Next, let’s create a list of PyQuery objects that contain the thread data.

def get_threads(d):
     """
     d : a PyQuery object containing web page html
     returns: list of thread lis with id beginning with "thread-" and class not
     containing the string 'sticky'
     """

    return d('li[id^="thread-"]:not(.sticky)')

Each thread is identified by an html list item (li) with id attribute beginning with (^=) “thread-“. But as I mentioned, the first page has a couple of sticky threads regarding the rules of the forum that don’t contain information about basses for sale. The code :not(.sticky) tells PyQuery to exclude any threads that have the string “sticky” as part of their class (.) attribute.

Next, let’s feed this list to a function that will extract the data we wish to keep and create a list of dictionaries corresponding to the threads. The choice of dictionary objects and some of the naming conventions will become clearer in my next post when I’ll discuss MongoDB.

class ThreadDataExtractor(object):
    """
    Extracts thread data to be stored as MongoDB document
    Attributes
    ----------
        thread: lxml.html.HtmlElement
            contains a for sale thread link
        data: dictionary
            contains thread data
    Methods
    -------
        extract_data
            populates fields of data attribute
    """

    def __init__(self, thread):
        self.thread = thread
        self._d = self._parse_thread()
        self.data = {}

    def _parse_thread(self):
        return pq(self.thread)

    def extract_data(self):
        self.data['_id'] = self._extract_thread_id()
        self.data['username'] = self._extract_username()
        self.data['thread_title'] = self._extract_thread_title()
        self.data['image_url'] = self._extract_image_url()
        self.data['post_date'] = self._extract_post_date()

    def _extract_thread_id(self):
        return self._d('li').attr['id'][len('thread-'):]

    def _extract_username(self):
        return self._d('li').attr['data-author']

    def _extract_thread_title(self):
        return self._d('.PreviewTooltip').text()

    def _extract_image_url(self):
        return self._d('.thumb.Av1s.Thumbnail').attr['data-thumbnailurl']

    def _extract_post_date(self):
        post_date = self._d('span.DateTime').text()
        # if thread has been posted within the last week, date is contained
        # elsewhere
        if post_date == '':
            post_date = self._d('abbr.DateTime').attr['data-datestring']

        return post_date

The first element we want to keep is the thread id. As we saw above, all the threads can be identified by a list item with id attribute beginning with “thread-“. We want to keep the list of digits that follow in the id, and we’ll use them as a unique identifier for each thread. Also, we can easily navigate to the thread page itself using the url https://www.talkbass.com/threads/*. And yes, the * wildcard can be used in urls. We can also extract the thread author from the list item attribute “data-author”.

The thread title can be pulled from the text of a hyperlink tag with “PreviewTooltip” class (.) attribute. Another hyperlink tag with class “thumb.Av1s.Thumbnail” contains the image url in its “data-thumbnailurl” attribute. Finally, the post date is a little tricky. If the thread is more than a week old, the post date can be found in a span tag with class attribute “DateTime”. If the thread is more recent, talkbass will display the day of week of the post rather than as month, date, year. In this case, the date can be found in an abbreviation tag with class “DateTime” under the attribute “data-datestring”.

Hopefully this example will have you well on your way to using PyQuery for your web scraping. Here is the github link to where I’ve hosted the code copied above. Additional examples of how I’ve used PyQuery can be found here and here.

One thought on “PyQuery for scraping the talkbass classifieds forum

Leave a comment