How to get the data off eonline

by Magni Onsoien

This short article tells the story about how we scraped data from the eonline.com webpage and were able to publish our own updated statistics during the polls. We did NOT hack the site. Our fathers did not hack the site. We did not have have access to any data that wasn't available to everyone. We just gathered the public data with openly available tools and then presented them in a userfriendly way. As simple as that.

The Evak fans have been accused of cheating and hacking, and this article is our way to show how we obtained the data we used during the polling periods.

A little background

Isak and Even were up for the "TV couple of the year" contest of eonline.com, an entertainment site that seemed kind of big, but that none of us actually knew. But it was probably big, right?

Turned out it was. The day after Norway didn't get any prizes in the prestigious 50 km male cross country skiing contest ("5-mila") in the World Championship in Lahti, a friend commented that "If we win this, it's ok even if we didn't win 5-mila". She later confirmed that, yes, this was big. Ok.

The polls went on. Nicolaj Paaske Holm Hansen put up a results page on his site kosegruppa.dk, with numbers that he copied from eonline.com's web page. They didn't publish proper statistics during the poll, but after voting the vote-button would show the percentage for each participant in the poll, so they clearly wanted to provide results to the poll participants.

He inspected the web page source code - a simple task you can do by choosing "view source" or a similar menu item (depending on the browser) when you visit a page. What you get then is the parsed html-code of the web page. It may look cryptic, but it's actually just HyperText Markup Language, which is how all web pages are ultimately presented. The webpage itself may be written different by the developer and may look completely different from that end. How all this works depends on what publishing system is used - from simple html to rather complex database based systems. In the end they all come out in the same way: they go throughh a web server, which serves content to web browsers (clients) - and tada, you can read the web page in your browser. And view the source code your browser is parsing to give you the content.

So, in this content Nicolaj was able to identify both the current percentage and the current total number of votes cast. It was actually not that difficult, but it was hidden in two different files: one in the desktop version and one in the mobile version.

Obtaining data

The percentage comes from the desktop version (line 2156):

	"choiceId":106265,
	"choiceText":"Isak and Even (Skam)", // escape special tag characters and quotes, etc.
	"choicePercentage":"55.026382"
      
while the total number of votes cast comes from the mobile version of the page (line 1406):
	<input class="total-votes" type="hidden" value="5626143">
      
So that's it. Percentage of Isak and Even, and total numbers - enough to do a lot of funny arthmetics, like finding the percentage of the competitor, dividing the votes between them, find the difference in votes cast between updates by comparing numbers etc.

In the middle of the Final Four round (from February 27 to March 2) Magni came around and made a script that fetched these values automatically. It was still a matter of getting the two versions of the web page, finding the number and providing them to Nicolaj, who would use it in his web page.

Magni wrote a very simple shell script that fetched the data:

	#! /bin/bash

	wget -q -O desktop_source --no-cache http://www.eonline.com/news/833507/tv-s-top-couple-2017-vote-in-the-top-2-now
	grep -A2 choiceText\":\"Isak desktop_source | grep choicePercentage|cut -d: -f2 | cut -d\" -f2

	wget -q -O mobile_source --user-agent "Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/5/35.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19" --no-cache http://m.eonline.com/news/833507/tv-s-top-couple-2017-vote-in-the-top-2-now
	grep -A3 "Isak and Even (Skam)" mobile_source | grep total-votes | cut -d" " -f4| cut -d\" -f2

	echo
	LANG=EN_us date -u +"%B %d, %H:%M %Z"
	LANG=EN_us date -u -d "20 minutes" +"%B %d, %H:%M %Z"
      
This script wrote to stdout, and was combined with a crontab-entry that copied the old data, then fetched the new and appended the old again, this provided Nicolaj with a file with just the numbers he needed to present the results. He'd also get accumulated historic data to use for comparision.

This script was run every 20 minutes. For the first days at 08, 28, 48 minutes past every hour, then at 10, 30, 50 minutes.

The complete datafile we gathered is available here: datafile.txt.
The input files used: desktop_version.final and mobile_version.final.

The data storage was not very elegant and probably not in a format neither of us would have chosen if we had time to actually choose something. But all this happened during an evening in the middle of the poll, and while we planned to rewrite it a bit for the final round we didn't want to do it during the poll.

In order to simplify other statistics a bit, Magni also provided another file with basically the same data. This data file was semicolon separated and suitable for importing into Excel to make graphs or other presentations. This was used as basis for the "giraffe graph". The code for making that file was

	#! /bin/bash

	TIME_EPOCH=$(date +%s)
	TIME=$(LANG=EN_us date -u +%Y-%m-%dT%H:%M)

	wget -q -O desktop_source2 --no-cache http://www.eonline.com/news/833507/tv-s-top-couple-2017-vote-in-the-top-2-now
	PERCENTAGE=$(grep -A2 choiceText\":\"Isak desktop_source2 | grep choicePercentage|cut -d: -f2 | cut -d\" -f2)

	wget -q -O mobile_source2 --user-agent "Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19" --no-cache http://m.eonline.com/news/833507/tv-s-top-couple-2017-vote-in-the-top-2-now
	TOTAL=$(grep -A3 "Isak and Even (Skam)" mobile_source2 | grep total-votes | cut -d" " -f4| cut -d\" -f2)

	echo $TIME_EPOCH";"$TIME";"$PERCENTAGE";"$TOTAL > ~/web-docs/datafile2.tmp
	cp  ~/web-docs/datafile2.txt  ~/web-docs/datafile2.bak
	cat ~/web-docs/datafile2.tmp ~/web-docs/datafile2.bak > ~/web-docs/datafile2.txt
      
Also a very simple script... it was run every 5 minutes (5, 10, ..., 55).

The complete datafile we gathered is available here: datafile2.txt.
The input files were the same as above.

Then the semi-final was over on Thursday afternoon (or Friday at 2 am in our time zone), and we went to sleep and planned to do some weekend work - have a look at the data files, maybe preprocess the files a bit so they were easier to use for others. But we all know what happened on Friday, only 17 hours after finishing the previous marathon, and then the rest is history...

Presenting the data

We later used the data to present results on our webpage. How we did that will be described later. It did however only use regular php code for simple arithmetics on the data provided from this gathering.

About the tools used

All the tools used were standard software available on a Debian server. They are all available under a GNU license.

The scripts and the contents of this summary can be shared under the Creative Commons Attribution License (CC-BY).

Author: Magni Onsoien