In this post, I will describe the process of bringing tweets from over the past year into a single heat map with interactive (clickable) tweets displaying more information. Most importantly perhaps, in what follows I will also readily discuss the flaws and problems of the methodology, in order to hopefully gain some insights into how to change the methodology to make this part of my project more interesting.
- male strip1
- male striptease1
- male stripclub1
- Male Erotic Dancing1
- male revue1
All of those tweets were collected in (two separate) Google Docs spreadsheets. As of October 28, I have collected 22,683 tweets (9,511 + 13,172).
Note: some bugs in the script creates some double entries of the same tweets into the spreadsheet. They’re easy to identify as every tweet has an ID assigned from Twitter, which makes it easy to sort them out. More about that below.
Note: Some tweets overlap with Instagram posts but I haven’t implemented any other APIs at this point, on top of the Twitter API, which means that there will be some overlap in the future, should I choose to implement the Instagram API, for instance.
2. Export the tweets, import and mapping onto a mysql database
The tweet spreadsheet was exported as a .csv file into a MySQL database, set up by myself on servers hosted by Dreamhost. I chose Dreamhost as they allow for remote access, so I could use a desktop client to administer the databases.
Note: Step 1 and 2 could be combined into one, if I wrote my own cron script that would run every hour requesting the same information as TAGS. I chose to use Hawksey’s TAGS because of the expedience in setting it up. Rather than having to program the code myself, the script is already set up and ready to go in a Google Doc.
Note: I could have used phpMyAdmin to administer the database but prefer to use a desktop client as it’s faster to handle.
3. Clean up some of the data
I was in need to removing some duplicate data that existed, because of some programming error in Hawksey’s script, in my own data. I set up the MySQL database with a UNIQUE key for the tweet ID column, which means that the import feature of MySQL will automatically disregard any duplicate values for any entries. (Note: it doesn’t actually disregard but produce an error; in this case, however, the error is in fact desired.)
In this step, I also made sure to do some manual cleanup of the data that was necessary by going over the ~8,000 unique tweets I ended up with after the procedure with the UNIQUE key to see if any tweets were unintentionally cut off, or containing characters that were encoded in a different format, for instance. (Note: I made sure that both the CSV file and the database were UTF-16 encoded to be able to include a few tweets written in Japanese, Arabic, Hebrew, etc.)
Note: Hawksey’s script contains an option to wipe duplicate entries but that didn’t work so I needed a solution for this.
4. Pull user data from the database of tweets
One of the columns in my data contained the user ID for the tweeter behind the tweet. I wrote a PHP script to pull the unique user IDs, and then their respective affiliated tweet IDs from the MySQL database, and add these to a separate table to reduce the amount of queries necessary in step 5 below.
5. Request the user info from the Twitter API
For each of the unique user IDs, I submit 100 requests at a time to the Twitter API, since I am interested in pulling the location for the user who has tweeted
Note: this is a step I actually have had to do manually as of now because I haven’t written a cron script to perform this part of the process, since there needs to be some time in between the requests to the Twitter REST API.
Note: This step is where the ethical and the methodological aspects of my method become somewhat questionable, which I’d like to address here:
- Methodologically, here’s a related problem as well: There’s not really any way to anonymize data that’s publicly available. You can easily search for the text quoted, for instance. (This becomes a problem in studies such as Kim Price-Glynn’s where she doesn’t want to write the real name of the strip bar where she does her studies; yet, in her last chapter she analyzes and quotes online conversations about the club, which makes it easy for someone to search the web, or archive.org, for the literal quotes, and find the name of the bar that way.)
- Methodologically as well, it is problematic because it doesn’t indeed show the location of the tweet, per se, but rather the (self-professed) location of the user. That means that there will be many users layered on top of each other in a location such as New York, and few in New Orleans, despite the fact that many New Yorkers are indeed tweeting from New Orleans while traveling. (That is, the method doesn’t capture tweets from other locations than the user’s professed “home base.”) Moreover, some users have multiple locations on their accounts: New York, San Francisco, and New Orleans, for example. The method has no way, currently, of mapping such accounts. On the prototype, I have used the first city mentioned, while acknowledging that this is a flaw.
6. Pull all unique locations from user information
From all of the users locations, I have written a PHP script to, once again, pull all the locations into a list of unique locations, mapped onto a location ID (which will prove necessary in the step creating relationships between the database tables).
6. Manual clean-up of the locations
In an attempt at creating a consolidated list of locations, I needed to manually clean up misspellings (since this is information a user types in, New York can be spelled in a number of ways) or abbreviations of city names (change NYC and at times NY to New York).
Note: I haven’t figured out a way to do this automatically yet. Theoretically, I could write a script that would pull from a list of the occurrences of corrections that I’d like to do and perform them automatically but the list would still have to be managed manually.
7. Geolocate each of the locations
Batch geocoding of all the locations is done through the free and open Batch Geocoding website.
Note: There should be other ways of performing this step but I haven’t figured that out yet. I might use Google’s geocoding API. The problem with their API (and many others) is that it’s limited to 2,500 requests per day. I may need to apply for research funds for this part of the project.
8. Assign weights to locations
The locations are not very precise, which means that lots of tweets will be layered on top of each other. One way to deal with this for now is to assign weights to each location, in order to create a heatmap for the respective tweets.
Problem here: that means that the interactive aspect of the map is lost. I don’t believe CartoDB has a way to solve this yet which means that to maintain the interactive aspect of the map, I might need to migrate from the CartoDB platform.
9. Create relationships between all the tables in the database
10. Export the data to CartoDB
11. Creating the maps in CartoDB
A note on collaboration
As a goal in the project, I want to offer my package back to the community of DH scholars, by making all of it available via GitHub. This would probably entail the need for more research funding, unfortunately, as I will need to make my scripts a little less specific to my own research. The Graduate Center, CUNY offers Digital Fellowships which may cover this part of the project in the near future.
1 Since July 17 — before then, the project was going to focus explicitly and only on boylesque as a genre.