Want to share a PDF on the Internet? Not as trivial as you would think.

Big Data in Malaysia: Emerging Sector Profile 2014” was launched and released a month ago, but we only just removed the registration wall. There were four distinct challenges with getting to this point:

  1. Conceptualising this whole thing and doing the research and interviews and analysis and shepherding all the stakeholders including sponsors for the lucky draw prizes and probably a bunch of other things that I forgot about!
  2. Writing up the report.
  3. Working with a designer to put lipstick on the whole thing.
  4. Finding a place to host the doggamn file.

Point 4 was more work that I anticipated, which is why I wanted to document our experience. I should mention up front that paid hosting was out of the question: we didn’t make a cent off any of this work, and it had already cost us serious cash, not counting our time. So we were going to stick with FREE AS IN BEER for this one. There may have been other better solutions that we hadn’t considered; we had to come up with something in a very finite set of time. I’d love to hear about better solutions in the comments.

Hiding behind our registration wall

We had two distinct file hosting phases. Phase 1 was for a few weeks after we launched the report, and these were our requirements:

  • Reasonable availability; we didn’t want people to fail to access the report.
  • Data collection; we wanted some information on who was downloading the file (name, email address, plus some optional things like organisation name).
  • Traffic stats visibility.
  • Let people easily download the actual PDF file rather than presenting them some “value-added” view of it.

We considered a few alternatives. Google Drive on it’s own would have been reliable, but as far as I’m aware there’s no way to get stats out of it directly, nor any way to stick a registration wall on it. Most annoying of all was that (by default) Google Drive “helpfully” converts the file to it’s internal Google Docs format and presents it for webpage consumption, in the process breaking colour representation. This was a dealbreaker for us because we have some heatmaps in the report that we want to be reproduced as faithfully as possible. Also, embarassingly, the image of MDeC’s CEO (who gave us a foreword) was rendered like 1980s CGA graphics for some reason. Not good.

Scribd might have done a better job, but we got some first-hand feedback that Scribd makes it hard to access the native PDF file. It’s understandable that they try to keep viewers within their walled garden (in return for extra organic traffic to your doc), but we preferred not to do that, valuing the download ease of a few key individuals over mass download traffic. This HN thread simply reinforced how we felt about Scribd.

We looked briefly at something called GumRoad, but there were too many unknowns and at this point we just needed a fix. So here’s what we came up with:

  • Upload the file to Dropbox.
  • Stick a bitly link to our Dropbox file (we’ll call this bitly link A).
  • Set up a Google Form to collect the data we wanted, with bitly link A presented on the “thanks” screen.
  • Stick a bitly link to our Google Form page (we’ll call this bitly link B).
  • Point our canonical report download URL (survey.bigdatamalaysia.org) to bitly link B.

Despite looking rather ugly, this actually got us to where we needed to be. Thanks to bitly links we get the stats we wanted, and importantly we could compute the registration wall dropoff (A/B). There are some apparent downsides, such as the fact that the registration “wall” is thwarted simply by having people share bitly link A, but we received feedback that there was a lot more of people sending the PDF to each other, which we can do nothing about, nor were we really so super committed to preventing all that, we just wanted some idea of who was interested in our report.

So that’s all set up nicely, we tested it, then on launchday… we get emails that the download is broken. Predictable.

Somehow Dropbox decided it needed people to log in to download the file, which obviously wasn’t going to work for us. We had no time to look into what precisely went wrong. We needed a quick fix (as in, I had minutes to fix it while at a conference). So we turned back to Google Drive. But what to do about the broken Google Docs rendering?

They don’t make it obvious, but there is a way to create a Google Drive link that downloads the file instead of bringing it up in Google Docs. So we crafted such a link (https://docs.google.com/uc?id=0Byj0SwnDzFHWbzljU200a2pvQUU&export=download), swapped links around to point to this, and voila, we were back in business.

Tear down this wall

That’s what got us through Phase 1. For Phase 2 we had simpler requirements: make the PDF open to the Internet, bring in organic search traffic, and give us some idea how many people were coming in at least (if not proper traffic stats). Doesn’t sound like much, right?

The trouble with sticking with Google Drive would have been that in order to bring in organic traffic we’d need to open it up to the Internet. In doing so, we were concerned that Google would drive traffic to their bullshit sucky Google Docs rendering instead of the raw PDF which would make MDeC’s CEO look bad, and us even worse if our heatmaps were broken. So we ruled that out.

We weren’t left with a lot of options. Dropbox already failed us once. Scribd was still bad. What to do?

Then we discovered PDFy. It’s minimalism was refreshing. tl;dr it ticked all our boxes except that we had to settle for a basic hitcount which we’re not even sure counts direct PDF hits, which is precisely where Google sends people, which is what we wanted. Unfortunately it doesn’t look like PDFy links get much love right now in terms of Google search rankings. Hopefully that will change in time.

This entry was posted in Uncategorized. Bookmark the permalink.

1 Response to Want to share a PDF on the Internet? Not as trivial as you would think.

  1. admin says:

    Hi, PDFy admin here 🙂

    Happy to hear that it helped you out. As for the statistics; currently the hitcount is indeed just with regards to the embedded viewer (which includes the viewer page on pdf.yt itself), but a better statistics mechanism is underway that will separate out on-site views, embedded views, and direct downloads. There’s no real ETA for when that will be available, but it’s definitely on the roadmap.

    Also, a tip: currently the on-site viewer only displays the filename, and not the internal document title as defined in the PDF metadata (which is what Google displays in the results you linked). If you want the on-site viewer to show up as well, the easiest way is to simply rename the PDF to include the document title. I eventually want to show the document title and other metadata on the on-site viewer page as well, but that requires writing some basic PDF parsing code that I simply haven’t gotten around to yet.

    – Sven

Leave a comment