Film Classification & the Semantic Web

The following post is a slightly modified transcript of a paper I presented at the 2017 Alphaville Conference, “Cinema is Dead

In July 2015 at the San Diego Comic-Con the horror director Eli Roth gave an interview to the L.A. Times as part of the promotion of his exploitation horror film, The Green Inferno.  In the interview Roth encouraged audiences to see his film, about a group of American university activists captured and tortured by a cannibal tribe in the Amazon, as a criticism of online social activism.  This reading of the film was quickly seized upon by the right wing publication Breitbart, with its then-contributor Milo Yiannopoulos penning an endorsement of the film and positioning it within, what he saw, as a growing anti-political correctness movement in popular culture.  The Green Inferno‘s marketing department appeared to accept this endorsement with the movie’s official Twitter account posting a mocked-up version of the film’s poster, which showed a severed hand holding a smartphone displaying several faux activist hashtags, several days after the publication of the Breitbart article.

Social Justice Warriors – they’re what’s for dinner. #TheGreenInferno #StopSJW

— The Green Inferno (@TheGreenInferno) July 24, 2015

Although ostensibly a medium to low budget, trashy, grindhouse horror and an homage to the cycle of Italian cannibal horror movies from the late ‘70s and early ‘80s, Roth was clearly trying to position The Green Inferno as a piece of social commentary and publications like Breitbart were willing to read it as such.  The implication one might draw from this is that an audience exists for popular, ideologically driven cinema.  Perhaps The Green Inferno could even be categorized within a broader class of right wing popular cinema, alongside such aesthetically diverse films as the family comedy Ghostbusters, which has been read as an argument for small government and the virtues of the private sector and a more obvious right wing text like Dirty Harry where the individualistic detective has to battle bureaucracy and a bleeding heart liberal press in his quest to bring the serial killer, Scorpio, to justice.  But, outside of academia, how practical are such ideologically based classifications.  The idea of a movie streaming service having a Conservative Cannibal Horror Movies category seems unlikely.  That is, however, until one looks at some of the extensive microgenres on Netflix.

Everything is a Recommendation

According to a workshop presentation from the 2016 Association for Computing Machinery (ACM) Recommendation Systems conference, Netflix attributes 80% of the content watched by its users to recommendations.  Recommendations include many of the carousels and displays that appear on the user’s homepage such as the “Continue to Watch” feature, the “Top Picks” curated selection and the “Because You Watched” banner.

Another key feature of Netflix’s recommendation system is its oft-ridiculed selection of suggested “microgenres” such as –

  • Crime Auteur Cinema
  • Raunchy Sitcoms
  • Emotional Independent Dramas for Hopeless Romantics
  • Latin American Forbidden-Love Movies
  • Cerebral Military Movies based on Real Life
  • Cynical Comedies Featuring a Strong Female Lead

These recommendations are generated using algorithms, user behaviour data and also very detailed manual classifications.

The fact that Netflix catalogues its content manually is well publicised.  Every few years when Netflix announces one of its remote content tagger roles there are usually a slew of “Best Job in the World” puff pieces about how great it must be to get paid to sit around all day watching Netflix.  However, in a 2014 article published in The Atlantic, tech writer, Alexis Madrigal, investigated how Netflix generated its very specific microgenres.  As part of his investigation he interviewed Todd Yellin, Netflix’s VP of Product Innovation.  Yellin described how manual, human classification drove both the direct recommendations and the creation of these microgenres.

One aspect of the role of tagger is, unsurprisingly tagging content with a term from a Netflix created lexicon of subjects (indeed in a 2012 interview with the L.A. Times Yellin took credit for introducing the term “squirm factor” into the Netflix controlled vocabulary of subjects thereby linking such disparate works as the UK TV series The Office with the Todd Solondz black comedy Happiness).  Taggers are also required to assess certain thematic or aesthetic elements on a numeric scale based on a set of Netflix defined parameters.  Rather than classification being a facet of cataloguing, classification to many of these microgenres is automated once certain sets of combined features appear on a work.  For example, a happy ending is but one dimension of a work that may contribute to it being classified under “Feel Good Movies”.

Using this novel multifaceted approach to cataloguing Netflix can present obscure, esoteric collections of films and TV shows to its users.  Therefore the idea that a film could be rated ideologically is actually a lot less ridiculous then one might think.  Indeed, the “Featuring a Strong Female Lead” dimension of many of the microgenres could often be read as a synonym for feminist.

Although these niche connections may well influence how users classify or consider films on an individual, idiosyncratic level, the recommendations system of Netflix seems unlikely to become a driving force in genre or canon creation.  It eschews all but the broadest of existing genres in its microgenre syntax.  For instance, a search for what one might consider to be a fairly well known subgenre, Spaghetti Western, yields no results.  The wordiness and inelegance of Netflix’s microgenres betrays its omission of a key element of genuine genre creation which is the institutionalisation of a critically developed shorthand for labelling subcategories of films.

A more likely source of canon creation and genre development comes from the Linked Data model, the conceptual backbone of the Semantic Web.  In a paper from 2009 co-authored by Tim Berners-Lee Linked Data was referred to as “the set of best practices for publishing and connecting structured data on the web”.


Linked Data defines objects on the web according to their relationships to each other.  Using existing knowledge, such as established yet esoteric genres, it facilitates the possibility for creating connections between diverse films.

Traditionally in repositories of information, all of the metadata considered to be relevant to an object was contained within an exhaustive bibliographic record.  These records defined what dimensions of an object were worthy of cataloguing.

During the 20th century, the fundamental principle behind most shelf based classification schemes was the APUPA pattern.  Defined by the librarian and mathematician S.R. Ranganathan, APUPA is an acronym of Alien-Penumbral-Umbral-Penumbral-Alien.  It is an idealised model of how information resources should be classified upon a shelf or within a catalogue.  The Umbral region would be the subject area upon which a researcher was focused.  The most relevant documentation for that information seeker’s search should be found within that area.  Located on either side of the Umbral region are the Penumbral regions, containing related but less relevant resources.  As one moves away from the Umbral area the documents become gradually less relevant to the original area of research until one reaches the Alien regions.


This model relied on being able to classify an object according to its defining characteristic. For example, In the pre-internet video store it was common for new releases to be shelved together while older films were usually shelved according to genre or subject (eg. sports) or, in the case of children’s films, target audience.  When a film was released its newness defined it, but once a film had been released for a certain period of time its genre became its pre-eminent feature in the store’s classification scheme.

However, determining a film’s defining subject or dimension is a fundamentally arbitrary task.  As the example of The Green Inferno shows, even a dimension as seemingly nebulous as ideology is a potential data point for useful and achievable classification despite not necessarily being relevant to horror enthusiasts.

Linked Data facilitates extensible classification.  Rather than relying on limiting and exhaustive records, by using a data model standard such as the Resource Description Framework (RDF) a resource can be described with a potentially infinite number of statements called triples. A triple is the method within RDF for describing resources according to a subject-predicate-object statement.  The subject is the resource being described, the object is the value of the statement and the predicate is the defined relationship between subject and object.   By defining standards for publishing and linking data, it is possible to create applications and services that can make seemingly semantically based queries to retrieve information. Working from a controlled vocabulary or ontology it is possible to define the nature of the relationships between resources.


To take The Green Inferno as an example, the triple statement to describe Roth’s role as director would look something like the following –


The Green Inferno is the subject, Eli Roth is the object and the predicate describes his role as director.


Why is this useful? It facilitates semantic queries and semantic searches.  Collectively triples form graphs of data.  Using some sort of semantic query language, such as SPARQL it would be possible to retrieve and manipulate data from across the graph.  So, to take the example of the right wing canon, it would be possible to link the resource pages for The Green Inferno, Ghostbusters and Dirty Harry with 3 triple statements denoting the target audience of the films.  It would then be possible to write a query to retrieve all films in the database with an intended audience of Conservative.  The extensibility of the model means that films can have potentially infinite connections, turning them into palimpsests with the potential to appear simultaneously within a Conservative canon and the Cannibal Horror genre.

This slideshow requires JavaScript.

By combining semantic data like this along with a semantic search service (based on natural language processing) it is also possible to leverage this data for natural language searches.  This is how the Google Knowledge Graph works.  A search for the best films of a particular year or films from a particular genre presents the user with an authoritative looking set of entries from Google’s Knowledge Graph in a carousel at the top of the page.  These results are not simply scraped together from the web but are actually retrieved from Google’s own linked data graph.  Consequently, from a single search query one is presented with a seemingly definitive list of films from a particular class.



During the first decade of the 21st century the “Long Tail” theory dominated discussions surrounding the distribution of popular culture online.  In a 2004 article for Wired, Chris Anderson applied the frequency distribution model of the “Long Tail” to describe how the internet had drastically altered the economics of popular culture, shifting markets in this area from being reliant on “hits” to being driven by collections of “niches”.  Without the same limitations of space as the brick and mortar store, he posited that retailers and distributors benefitted from offering their customers a vast range of specialist, esoteric content rather than simply being reliant on tentpole releases.

The theory is now over a decade old and appears to hold less relevance to the video streaming market now that the wild west era of torrents and illegal streaming seems to be coming to a close. Over the past few years Netflix, the dominant force in online streaming with over 86 million subscribers worldwide according its shareholders letter for Q3 2016, has actually been reducing the size of its catalogue. Its focus has shifted toward producing its own content and building up a catalogue of “exclusive” movies and TV shows, with Netflix CFO David Wells last year stating that the aim for Netflix is to eventually have a balance of 50/50 between originally produced and licensed content.

 Nevertheless, with linked data and recommendation systems, how films are organized and arranged is more diverse than ever.  This will inevitably drive how we evaluate cinema.  Films no longer need to be defined by a single subject or dimension and can practically exist within multiple canons, each with the potential to raise or lower their value or prestige amongst certain audiences.  It also means that classifications, such genre, become less temporal and more permanent, though only to particular audiences.

In Rick Altman’s Film/Genre he demonstrates the temporality of genre with the example of the 1921 Benjamin Disraeli biopic, Disraeli.  Altman notes that though we would now classify this work as a biopic, in 1921 such a generic classification did not exist.  At various points following its release it was classified as a drama, a comedy and a romance.  Applying the principles of Saussurian linguistics, Altman sees genre as parole (language as it is used and understood by its users) rather than as langue (language as a fixed universal system).  Altman described the process of Disraeli coming to be seen as a biopic as follows –

Implicitly, every film – as well as every critical term – considered along with Disraeli may initiate a new commutation process, but only when the conclusions about that process are generally shared and consecrated by formulaic production and critical vocabulary will a new genre emerge (pp 176-177)

And even then, at some point in the future some alternative dimension of Disraeli and similar films may emerge in the critical and popular consciousness above its biographical nature.  Once this textual signifier is identified it could mean that the very idea of a biopic fades into insignificance, relevant only to academics and historians.  By storing all information relevant to a film the semantic web has the potential to ensure that such classifications are not temporary but permanent.

The Saussurian model does not just apply to genre.  Take for instance, William Friedkin’s To Live and Die in LA.  Until 2011 the film would have been seen by many as little more than a footnote in Friedkin’s canon, significant only for its famous director and undermined by its overt and dated 80s aesthetic.  Yet, following the release of the Nicholas Winding Refn film Drive in 2011, which aped the 80s aesthetic of To Live and Die in LA and kicked of a whole cycle of 80s inspired cinema scored to synthwave soundtracks (It Follows, The Guest etc.), the “80sness” of To Live and Die in LA became it’s defining and greatest quality.  If associations drive how we evaluate films, then surely the linked data model which encourages the creation of such associations, has the potential to prompt many evaluations and re-evaluations of films as new connections are made and discovered.

In his essay The Analytical Language of John Wilkins the Argentine writer, and librarian, Jorge Luis Borges attempts to demonstrate the ultimate arbitrariness of all classification.  In it he cites a taxonomy of animals from a (fictitious) Chinese encyclopaedia entitled, The Celestial Emporium of Benevolent Knowledge.  In the taxonomy, every animal was classified according to one of the following categories –

  • Belonging to the Emperor
  • Embalmed
  • Trained
  • Piglets
  • Sirens
  • Fabulous
  • Stray dogs
  • Included in this classification
  • Trembling like crazy
  • Innumerables
  • Drawn with a very fine camelhair brush
  • Et cetera
  • Just broke the vase
  • From a distance look like flies

This taxonomy has become a touchstone for postmodern theory and the notion of cultural relativism, inasmuch as it shows how classification schemes are socially constructed methods of imposing limitations of meaning upon knowledge.  However, the linked data model actually allows classification to become an expansion of meaning for film, not a limitation, potentially driving the creation of unconsidered canons, such as the Conservative Cannibal Horror Canon.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s