How AI can help you read the newspaper
An interview with Timo Kats
On November 11th, ACED has presented their latest work in the Reverb Channel Programme at BNAIC/BENELEARN. Timo Kats, Peter van der Putten and Jasper Schelling have successfully created models that can distinguish commercial content from editorial content in an instant.
Because it turns out that it is hard to distinguish between the two. In this study, done by Bartosz Wojdynski and Nathaniel Evans, only 7% of the participants could recognize advertorials as advertisements. Whereas the algorithm that has been made in the Reverb Channel Programme, can distinguish advertorials from editorials 90% of the time.
To investigate the impact of Artificial Intelligence in the media landscape, ACED has initiated the Reverb Channel programme, a trans-disciplinary programme where artists, data scientists and media professionals can carry out investigative experiments and share results to evoke debate with the field and the general public. In collaboration with dr. Peter van der Putten and Timo Kats of the LIACS institute of Leiden University, we investigated whether machines can aid you in distinguishing editorial content from advertorial content. It turns out that this is the case. Timo Kats reports on his investigation.
Creating a dataset
For this project, the websites of four major Dutch news publications were scraped for advertorials and regular articles. Scraping is a method of gathering information where a web crawler collects specific data from the web. This information is stored in a central database, where it is later analysed.
Timo Kats elaborates: “This was quite hard to do. I wanted to gather information from different newspapers, because I didn’t want a biased model. The problem is that lots of Dutch newspapers don’t use advertorials. Some newspapers were technically very hard to scrape. So I ended up with Nu.nl, NRC, Telegraaf and the Ondernemer -which is a bit of an entrepreneurial medium where you have quite a lot of advertorials.”
Timo gathered 1.000 advertorials and 1.000 articles in total. For each article, he had to label in the database whether it was an advertorial or an article.
The next step was to find an accurate machine learning algorithm that could perform the task of distinguishing between the two types of content. “There are lots of existing algorithms you can use. I just experimented with a lot of them and examined which ones had the most accurate results. I found some that worked, so I tweaked some parameters. I then just started ruling out the bad ones and I got one with really good results.”
Algorithms are the methods which are used to get a task done or to solve a problem. Models are computations formed as a result of an algorithm. They take a set of values as input and produce a set of values as output.
The training of models went as follows: Timo got all 2.000 articles and ‘threw them on a pile’. He then used 90 percent of those articles to train the models. To check if the models worked, he used the other 10 percent of the articles. On that 10 percent, he got a 90 percent accuracy.
Timo: “Now you have a number, but you want to say something useful about it. It had to be useful for people who aren’t really technical. So we thought: what can we say about the use of language? What are the differences and the similarities? Apparently the words are hard to separate for readers.”
To make it easy for the audience to see the difference in word use, Timo created a lexicon. “It’s funny, because there are some commercial words that I didn’t expect, like innovation. That’s the most commercial word we have.” Timo hopes this lexicon can be useful for readers, but for journalists, too.
A lexicon did not give Timo the insights he was looking for. So he made a web-based visualisation of word associations, where he connected words that were frequently used in the same sentence. “It’s cool to see which words are often used together. Like artificial and intelligence are used a lot together.” These words have a strong connection in the web. With the web, it is easy to zoom out and see which commercial and editorial words are used in different kinds of contexts. “I have a technical background, but I am a human, too. It’s nice to see the story behind the mountain of numbers that produced my paper.”
Interested in further exploring Timo's work ? The full project is available on Github.
And for a full writeup of Timo's research have a look at the paper that was published at the 33rd Benelux Conference on Artificial Intelligence and the 30th Belgian Dutch Conference on Machine Learning (BNAIC/BENELEARN 2021), Luxembourg, November 10-12, 2021