AHK: How to Write a Web Scraper

webscraper1

I love AirBnB. But every now and then, I don’t want to scroll through multiple pages of listings to find a place to stay. I just want to be able to see a simple spreadsheet. That’s exactly what a web scraper is meant to do.

How Does It Work?

The first part of the code is actually ridiculously easy. UrlDownloadToFile allows you to download a webpage’s HTML into a Notepad document. The real difficulty comes in sifting through the HTML for the pieces you want.

Never heard of HTML? It’s the language that makes webpages look pretty. And readable. Wired has a marvelous HTML Cheatsheet here.

Unfortunately, AirBnB’s search results are absolute insanity. I wish I was kidding. This one page is 11,849 words long. That translates into 149 pages in Word.

Airbnb search results in HTML. It's not pretty.

I didn’t actually go through every line because I know I only want one thing from the initial search page—the listing ID. Every listing’s individual page (the page that has the full description, price, amenities info) has a set template to its URL.

https://www.airbnb.com/rooms/ROOM-ID#

What’s lovely is that if you search the HTML for the name of the first listing on the AirBnB search results, a data-hosting_ID follows soon after that matches the “About this listing” URL.

It's a match.

Part 1: Trim the Listing IDs

Ideally, we’re going to need to pull all those listing IDs into an array so we can scrape those pages for information in Part 2. Manipulating strings in AutoHotkey can be a bit tricky, but for our purposes today, there are really only a few commands to get it done.

The ID in the HTML starts off with data-hosting_id=”, so we’ll start off by parsing the text using the < as a delimiter. We only need the pieces with data-hosting_ID inside it—that’s where if InStr comes in.

if InStr(String, Search4ThisTextWithinString)
{do this}

Plug that into the parsing loop and viola—you’ll only be cycling through the lines that have data hosting ids.

Find the pieces you need.

You might have noticed that A_Loopfield is still far too much text. I’m sure there’s a simpler way of doing this, but since I couldn’t find it, I’m using StringTrim and RegExMatch. RegExMatch lets you search a string for text and assign its position to a variable.

FoundPos := RegExMatch(String, Search4ThisTextWithinString)

Then you can use StringTrimLeft and StringTrimRight to chop off the unnecessary pieces.

StringTrimLeft, Var4TrimmedString, String, # of Characters to cut from the left
StringTrimRight, Var4TrimmedString, String, # of Characters to cut from the right

regexplained

Here’s how it applies inside the parsing loop:

Trim it all down.

Part 2: Put the Data into a CSV File

Once you’ve got the listing ids into an array, RegExMatch, StringTrimRight, and StringTrimLeft will help hack away everything you want from the individual listings pages. The code below makes price, latitude, longitude, and description arrays for easy file appending.

But before putting those into a CSV file, it’s important to remember that CSV stands for comma separated values. Meaning, when it opens in Excel, it takes commas as a signal that you want that particular line of text in a new cell.

This is a problem because AirBnB descriptions have commas all over the place. My suggestion is to just replace them with spaces to make life easier

stringreplace, NewString, StringToSearch, CharactersToSearchFor, , All

The StringToSearch and NewString variables can be the same if you don’t want to create a new variable. Since we’re looking to replace a comma, we’ll have to escape the comma with a ` in the code so that it’ll work properly.

In this case, that’ll look like this:

stringreplace, description, description, `, , %A_space%,, All

If all has gone well, you should be able to get a nice spreadsheet with just these few last steps:

An AirBnB scraper at work.

The Code

;PART 1 - EXTRACT THE DATA

Filepath:= "C:\Change this to a File Path for TXT file"
cvFilePath:= "C:\Change this to a File Path for CSV file"

URLDownloadToFile, https://www.airbnb.com/Change this to Search Result URL, %FilePath%

;TRIM IT DOWN TO THE LISTING IDS
listing := []
ind :=0
FileRead, AllTheText, %FilePath%
Loop, Parse, AllTheText,<,>
{
If InStr(A_LoopField, "data-hosting_id")
{
foundPos := RegExMatch(A_LoopField, "data-hosting_id=")
StringTrimLeft, OutputVar, A_LoopField, foundPos+16
foundPos := RegExMatch(OutputVar, "data-review_count")
StringTrimRight, Finalvar, OutputVar, StrLen(OutputVar)-foundPos+3
listing[ind] := FinalVar
ind++
}
else
{
continue
}
}
filedelete, %FilePath%

;TRIM THE LISTINGS
;PREPARE THE ARRAYS
desc := []
lat := []
lon := []
price := []
dind :=0
latind:=0
lonind:=0
p:=0

;DOWNLOAD HTML FOR EACH LISTING PAGE
For index, value in listing
{
URLDownloadToFile, https://www.airbnb.com/rooms/%value%, %filepath%
FileRead, AllTheText, %filepath%

;TRIM THE LISTING HTML TEXT DOWN TO A MANAGEABLE SIZE
foundpos := RegExMatch(AllTheText, "tax_amount_usd")
StringTrimRight, TrimmedText, AllTheText, StrLen(AllTheText)-foundpos
FileDelete, %filepath%
FileAppend, %TrimmedText%, %filepath%

;TRIM OUT THE DESCRIPTION, PRICE, AND COORDINATES
FileRead, TrimmedText, %filepath%
Loop, Parse, TrimmedText,<,>
{
If InStr(A_LoopField, "og:description")
{
foundPos := RegExMatch(A_LoopField, "content")
StringTrimLeft, OutputVar, A_LoopField, foundPos+8 ; To delete the meta property bit
StringTrimRight, Finalvar, OutputVar, 9
desc[dind]:=FinalVar
dind++
}
else If Instr(A_LoopField, "latitude")
{
foundPos := RegExMatch(A_LoopField, "content")
StringTrimLeft, OutputVar, A_LoopField, foundPos+8 ; To delete the meta property bit
StringTrimRight, Finalvar, OutputVar, 9
lat[latind]:=FinalVar
latind++
}
else If Instr(A_LoopField, "longitude")
{
foundPos := RegExMatch(A_LoopField, "content")
StringTrimLeft, OutputVar, A_LoopField, foundPos+8 ; To delete the meta property bit
StringTrimRight, Finalvar, OutputVar, 9
lon[lonind]:=FinalVar
lonind++
}
else If Instr(A_LoopField,"base_price_usd")
{
foundPos := RegExMatch(AllTheText, "base_price_usd")
StringTrimLeft, OutputVar, AllTheText, foundPos+15 ; To delete the meta property bit
StringTrimRight, Finalvar, OutputVar, StrLen(OutputVar)-2
price[p]:= Finalvar
p++
}
}
}

; LOOP THROUGH EACH ARRAY AND APPEND CONTENT TO CSV FILE
For index, value in listing
{
description := desc[index]
stringreplace,description,description,`,,%A_space%,,All
li := listing[index]
pri := price[index]
latitude := lat[index]
longitude := lon[index]
fileappend, %li%`,%pri%`,%latitude%`,%longitude%`,%description%`n, %cvFilePath%
}

MsgBox Done
return

 

Advertisements
Tagged , , , , , , ,

4 thoughts on “AHK: How to Write a Web Scraper

  1. J. Rowda says:

    Interesting stuff! Thanks for sharing. I use something similar for ebay listings, but instead of having a fixed path, I use a quick GUI to ask the user to paste the URL or to enter a keyword, say like “red shoes”, to get results containing those terms only.

    Like

    • Yeah, this script could use a GUI. It’d be awesome to program something that could scan Airbnb listings for specific keywords, but it’s beyond my capabilities at this point. Any chance you’ve posted your ebay script?

      Like

      • J. Rowda says:

        What I usually do is just use IfInString to find the specific keywords you want. You need to run a loop that goes through every sentence, and if the keyword is included, just do something with that string, like add to a different file or just print it to the user. I’m no expert and I do a lot of trial and error, but that approach should work. I can’t post the ebay one as I developed it for work but can certainly share something somewhat similar!

        Like

  2. It’d be cool to see something similar. Out of curiosity, are you using URLDownloadToFile to pull each ebay listing’s HTML into a text file saved on your computer so you can scan it with IfInString? I’d really like to find a way to adapt this code to scan all airbnb listings in every city for keywords, but it’ll generate a ridiculous number of text files. My only solution thus far is having AHK create a file with the HTML, scan it for keywords, push any information with the keywords into a different folder, then delete the HTML file before moving onto the next listing, but well, it seems like there should be a more efficient way of scanning the page directly.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: