NYCDH Week Reflection on IMDb as a Dataset for Digital Humanities Workshop

I attended the IMDb as a Dataset for Digital Humanities workshop on Wed. Feb. 6 (Nancy Foasberg was in attendance also).

The presenters were friendly and knowledgeable, and their combined expertise areas were Media Studies (Cindy Conaway) and Computer Science (Diane Shichtman).

An early valuable takeaway from the workshop for me was their definition of Digital Humanities, which was/is: “the intersection of humanities’ and social science methods with tools of computer science.”

Cindy and Diane then asked us if we knew Microsoft Access and SQL, and if any of us had ever used IMDb as a data source and/or downloaded data from it. They then listed the content found on IMDb as a “data source” of “media items” (their term) which are: movies, shows, video games, TV series, video short films, audio books (some).

Their biggest caveat in downloading data from IMDb is that you will get a “static subset” and as the IMDb content is constantly changing, this could be a problem in the long-term analysis of a large dataset project.

They then listed some projects using IMDb: The Oracle of Bacon, “Adult Films”, UCLA Race Film, Deb Vanderhoeven, Fan Use, Seinfeld (their project).

With that they launched into their demo, which Cindy introduced as “there are people who are more ‘Bacon-y’ than [the actor] Kevin Bacon now…” as in the well-known trivia “game” of “six degrees of separation from Kevin Bacon”.

Cindy reported that Seinfeld has 32,500 people associated with him via this six degrees “parlor game” ratio.

I was wondering where she was headed with this information which struck me as rather superficial.

Their knowledge of IMDb as a dataset and its shortcomings (of which there are many) combined with computer science and SQL are all excellent.

But I wasn’t “getting” the point of their “research” which I say in quotes because, as much as I love Seinfeld as a fan, I was unclear about the purpose of their work.

So I asked, “what is your objective in using IMDb as a dataset via Seinfeld?” Cindy’s response was that the purpsoe was something on the order of writing a book titled something like “Seinfeld by the Numbers” (a la the existing book “Seinfeldia”). Hmmm…

I was then quite stunned that they’ve gotten serious funding from the NEH for this project and have traveled to conferences such as DH 2018 presenting this project. Cindy has a kindly personality and she was excitedly reporting that “because [their project] is Digital Humanities, it can be anything…!”

This made me realize how valuable and disciplined the GC’s MA in DH and ITP Programs are because, as students in these programs, we must defend the DH-y-ness of our projects for department and project approval.

Having shaped and re-shaped my ITP Project toward DH-y-ness, and now doing the same for DH Praxis group project Lost Art Collective, I do believe Cindy and Diane would be hard pressed to get their Seinfeld project approved by GC DH and ITP professors.

To use Seinfeld itself as a model, for those familiar with the show, Jerry Seinfeld is known to have pitched the show to NBC as a “show about nothing” and even re-created this on the show as a plot line. So, forgive me, but I perceive this project is about “nothing” and yet, it is getting grant $ and conference credits for being DH-y.

Their closing rhetorical questions were helpful and important:

Is the data meaningful? Does it have everything — is it complete? Is it accurate — is it correct? Is the data that IS there meaningful? Does is represent what we think it represents? (validity) Is it consistent? (reliability)

But I couldn’t get over the hurdle that they approach a database such as IMDb — even though it technically falls under the domain of Media Studies and according to them also DH — with such an un-academic content pursuit, especially given its technical shortcomings (such as the shoddy solution of merely numbering same names as 1, 2, 3, etc., so that Paul Newman the actor is listed as Paul Newman 1) and IMDb’s not-so-nice tech management style (which they’ve had personal experience with).

I daresay Jerry, Elaine, George, Kramer and Newman would find this, forgive me, laughable…?

They didn’t offer the Power Point they projected, and have not emailed it to participants yet, but I’m quite sure they would provide it upon request.

Using IMDb as a Dataset for Digital Humanities

Cindy Conaway, an associate professor in Media Studies and Communication and Diane Shichtman an associate professor in Information Systems at SUNY Empire State College will discuss using the Internet Movie Database (IMDb) and its advantages and challenges as a dataset for Digital Humanities. In many ways IMDb is an excellent source for Digital Humanities projects and gives media studies scholars a new way to use Digital Humanities. The organization makes it free to download a great deal of its very robust data. However much of IMDb’s data is inconsistent, incomplete, and often wrong or misleading. The downloadable information is also limited to certain categories. This presentation will also discuss the challenges of interdisciplinary work, and how changes in IMDb’s process over several years, and differing views available to scholars can also create issues as we have found in our project tracing connections using the show Seinfeld.


This entry was posted in Workshops and tagged , , , , , . : . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Posted February 11, 2019 at 10:18 am | Permalink

    Hmmm. I don’t completely disagree as far as this particular project goes, but … I disagree about a lot of this.
    One of the things that I’ve found about data in general and DH in particular is that data is messy. Data’s always messy, but a lot of the data that you find out in the world is especially messy because it isn’t gathered, maintained, or structured by professionals. When it comes to looking at data in the world, I don’t think that means we need to restrict ourselves only to “professional” or “academic” data, because sometimes that doesn’t capture what we need! One of the things that’s important about IMDb is that it’s building a detailed catalog of information that, so far as I’m aware, isn’t captured neatly elsewhere (though perhaps Wikidata will be able to do some of this work eventually). In fact, IMDb is so old and so big that I wouldn’t be surprised if its existence has discouraged others from trying to build a similar dataset. So, it’s really the most efficient want to gather this data. It’s imperfect, for a lot of the reasons you’ve mentioned here! In particular, you mention authority control (that is, getting a specific identity for a person that will save them from being confused with others who have the same name, while keeping together all their work even if it’s under variations of their name), which, to be honest, is a challenge even for real catalogers and metadata people (I’m not one). So, yes, that’s an issue! But it’s a case of best available data, and I think they did a nice job of acknowledging its weaknesses, which is the really important part.
    But I also want to caution about the other issue that you bring up, which is the perceived triviality of a popular culture-related project. While noting here that I’m someone who dislikes Seinfeld more than I like it (though I do enjoy many popular culture products that aren’t any more respectable), I think you are overlooking the value of popular culture. Popular culture (as Stuart Hall argues) is important because it’s the culture that people make use of in their everyday lives. It’s often a good indicator of what our culture values and how people live. In any case, I’ve been taught to be suspicious of the high-culture/popular-culture distinction. I guess my question is, if this project about Francis Bacon’s network is worth funding and publishing — why is a project about Seinfeld intrinsically less so?
    HOWEVER, I agree with you that they didn’t make as much of a case as I would have liked for why they find this project important or what they plan to learn from it. There are lots of interesting things they could do with this data: I liked the idea they had about showing how men’s careers and women’s careers in TV differ over time, though I was less excited about their idea about looking at race because work like that shouldn’t be based on looking at pictures of people and trying to guess their background. Their idea about looking at genre and seeing how people moved in genre circles was also interesting, although I don’t really agree with their conception of what genre is. So I think there ARE interesting questions that can be asked about this data, although I’m skeptical of some of the questions they asked.
    The other thing that was striking to me was their rather dismissive attitude toward fan studies. Here you are, in digital humanities, doing media studies work about popular culture, and you’re not going to respect the work of fan studies to establish itself the same way both those fields have had to, OR the effects that fan studies has had on media studies? Hmm.

    • Posted February 11, 2019 at 2:56 pm | Permalink

      I didn’t intend to telegraph a subtext of high/low culture, as I also value pop culture on many levels — academic, cultural, avocational — to name a few. My first MA is also in Media Studies and my thesis for that degree was an application of Adorno & Horkheimer’s seminal essay “The Culture Industry” to content trends in broadcast advertising for the year 2006, which I’m currently revising and updating for citation in my Independent Study. I’ve also been a voting member of NARAS (the Grammy awards organization).

      That said, my only concern with the Seinfeld project was that it wasn’t clear what they were gathering the data for. I recall from the workshop after I asked them what the objective was, you said, “I was wondering that also” which confirmed to me that I indeed hadn’t missed this.

      Pop culture data is incredibly rich and worthy of study. Hence my wondering what they’re using this data for and to what end, because it was not clear. I even recall Cindy saying that they’re so busy presenting they haven’t been able to decide on what it is their analyzing. So maybe they’re still in the collecting stage and their area of analysis will reveal itself to them as they go? I do think/agree that dismissing fan studies is a miss on their part, especially regarding such an iconic TV show turned cultural phenomenon (achieved via… fans!)

      All the power to them that they’re being funded so well, but I remain surprised that they’re receiving grant money and conference opportunities for a research without an apparent analytical purpose — academic or otherwise — at least, not yet.

      • Posted February 12, 2019 at 10:00 am | Permalink

        Yes, they definitely need to make a better case for their project! I hope they do; I’m interested in seeing what kind of work this analysis can do.

        • Posted February 12, 2019 at 3:39 pm | Permalink

          Here here to a better case and same, esp. as they’re quite skilled and congenial colleagues!

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


Need help with the Commons? Visit our
help page
Send us a message
Skip to toolbar