AHK: How to Fix an Airbnb Scraper

webscraper1

Just when I thought I was going to have a nice, relaxing Processing weekend, I realized my Airbnb scraper is broken. Fun fact: If they change their HTML, this script is WILDLY unlikely to work. I’m getting the sense that Airbnb is going to be the wild duck to my Sumatran tiger.

Since the scraper is how I’m gathering data for all the other scripts I’m trying to write, I spent the weekend fixing it.

Yeah, I’ve never had to debug a scraper before, but primarily, I learned it’s a task best undertaken with the Page Inspector.

What’s the Page Inspector?

Originally, I was violently rage clicking through notepad documents full of HTML. Turns out, there exists a marvelous Firefox tool called the Page Inspector. Right-click whatever you want to specific piece of page HTML you want to see, and it’ll show it to you COLOR CODED.

Why didn’t I use this bastard the first time around? Didn’t know it was a thing.

page-inspector

It’s a thing.

Hold up. This is longer. What else did you change?

Added a GUI because it was really necessary.

And the code to loop through several page results within Airbnb because a single page only has 18 listings. To get a decent data set to play with, Imma need more than that. Now… The Airbnb URL looks hellish, but it’s actually somewhat logical. Inside all the &’s and %’s, you’ll find the city, check-in date, check-out date, page number, and coordinates for the northeast and southwest map corners.

anairbnburl

The GUI needs the first search result page URL so it can be chopped in half—before and after “Page.” Then the script loops the page numbers to get all the data.

The Script Annotated to Within an Inch of its Life

Despite it being a relatively simple fix, I had to walk through the original script line by line to figure out where exactly the problem was. While I was doing that, I added more comments to help me along next time it breaks.

1

2

3

5

6

The Copy/Paste Version:


Gui, Add, Text, , Temporary txt Filepath
Gui, Add, Edit, vFilepath W400, C:\Users\Desktop\Airbnb.txt
Gui, Add, Text, , Temporary CSV Filepath
Gui, Add, Edit, vcvFilePath W400, C:\Users\Desktop\Airbnb.csv
Gui, Add, Text, , How many pages?
Gui, Add, Edit, vpagecount,
Gui, Add, Text, , Page URL
Gui, Add, Edit, vINITIALURL R8 W400, https://www.airbnb.com/s/San-Francisco--CA?guests=1&adults=1&children=0&infants=0&ss_id=x3k8ohu0&ss_preload=true&source=bb&page=1&allow_override%5B%5D=&ne_lat=37.912159028101655&ne_lng=-122.32165604091637&sw_lat=37.86051531470982&sw_lng=-122.56060867763512&zoom=11&search_by_map=true&s_tag=aB5-b2TQUnited-States

Gui, Add, Button, , OK
Gui, Show, ,The Scraper
Return

ButtonOK:
Gui, Submit

If Instr(INITIALURL, "page")
{
foundPos := InStr(INITIALURL, "page", false, 1,1)
StringTrimLeft, LEFTURL, INITIALURL, foundPos + 5
StringTrimRight, RIGHTURL, INITIALURL, StrLen(INITIALURL)-foundpos
}

listing := []
price := []
desc := []
lat := []
lon := []
dind :=0
latind:=0
lonind:=0
ind :=0

LoopCount:= 1
Loop, %pagecount%
{
AURL := RIGHTURL . "age=" . LoopCount . LEFTURL
URLDownloadToFile, %AURL%, %FilePath%
FileRead, AllTheText, %FilePath%
Loop, Parse, AllTheText,<,>
{
If InStr(A_LoopField, "data-hosting_id")
{
foundPos := InStr(A_LoopField, "data-hosting_id=")
StringTrimLeft, OutputVar, A_LoopField, foundPos+16
foundPos := RegExMatch(OutputVar, "data-review_count")
StringTrimRight, Finalvar, OutputVar, StrLen(OutputVar)-foundPos+3
listing[ind] := FinalVar
}
else If Instr(A_LoopField, "pricerate")
{
foundPos := InStr(A_LoopField, ">", false, 1, 1)
StringTrimLeft, OutputVar, A_LoopField, foundPos
price[ind] := OutputVar
ind++
}
{
continue
}
}
filedelete, %FilePath%
Loopcount++
}

For index, value in listing
{
URLDownloadToFile, https://www.airbnb.com/rooms/%value%, %filepath%
FileRead, AllTheText, %filepath%
foundPos := InStr(AllTheText, "Twitter",False,1,1)
StringTrimRight, TText, AllTheText, StrLen(AllTheText) - foundPos
fileDelete, %Filepath%
FileAppend, %TText%, %Filepath%
Loop, Parse, TText,<,>
{
If InStr(A_LoopField, "og:description")
{
foundPos := InStr(A_LoopField, "content")
StringTrimLeft, OutputVar, A_LoopField, foundPos+8
StringTrimRight, Finalvar, OutputVar, 9
desc[dind]:=FinalVar
dind++
}
else If Instr(A_LoopField, "latitude")
{
foundPos := InStr(A_LoopField, "content")
StringTrimLeft, OutputVar, A_LoopField, foundPos+8
StringTrimRight, Finalvar, OutputVar, 9
lat[latind]:=FinalVar
latind++
}
else If Instr(A_LoopField, "longitude")
{
foundPos := InStr(A_LoopField, "content")
StringTrimLeft, OutputVar, A_LoopField, foundPos+8
StringTrimRight, Finalvar, OutputVar, 9
lon[lonind]:=FinalVar
lonind++
}
}
}
filedelete, %FilePath%

U:=0
For index, value in listing
{
description := desc[U]
stringreplace,description,description,`,,%A_space%,,All
li := listing[U]
pri := price[U]
latitude := lat[U]
longitude := lon[U]
fileappend, %index%`,%li%`,%pri%`,%latitude%`,%longitude%`,%description%`n, %cvFilePath%
U++
}
MsgBox Done
return

I don’t know how many weeks I’ve been saying this now, but next weekend it’s back to Processing. Pretty graphics. Maybe even a gif or two.

 

Advertisements
Tagged , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: