Data Management Plan Reflection

We’ve been working on the data management plan. Rob has taken the lead on this part of the project and has worked up a really nice plan including a description of the data, a statement on our protocols for working with data, and our plans for archiving the content. This has brought up a lot of good questions that we need to think about in our group. These questions often go far beyond the technical aspects of collecting, storing, and conveying our data and get into how we’re thinking and talking about our project. For instance, we’re working through some questions of language. Since our project explicitly seeks to highlight the always already constructed nature of data, we want to show how the data changes as we work with it, while still rejecting the common language of raw/dirty/unprocessed data vs “cooked”/clean/processed data. So, how do we talk about the different stages we are presenting? What is the best way for us to refer to the datasets we’ve found or constructed, while avoiding falling into the same language that gives rise to the perceptions we seek to challenge? These are questions we’re still considering, and it’s going to be important throughout the project to keep paying close attention to the language we use.

There are more practical considerations we are also working our way through. Some questions that have arisen:

  • What kinds of data are available for us to use in ways we have in mind? Do we need to find public domain or openly licensed data? If we are making our own datasets, what kinds of configurations are allowable under fair use and which are infringing? What can we re-license under our own free licenses? What about the software itself? I know that the GNU Public Licenses are closely related to Creative Commons licenses, about which I know a lot, but I don’t know much more about the GPL and definitely need to read up on that. We’re also consulting Jill Cirasella with some of our fair use questions. For my part, I want to work with some proprietary metadata, but I think that after the first stage of processing, I will probably be able to make a more “data-like” version of it available. Similarly, Natasha is working with scholarly articles, not necessarily openly licensed ones, but is using them to create a dataset that she can share.
  • How should we break up our “stages” of data? We are planning to make multiple copies of each stage of our datasets, following LOCKSS, and we have a plan to use GitHub to keep track of each of these, but it turns out that, just as “data” is not a naturally occurring element, neither are the points at which we’ll break it from one stage to another. This is something that might vary from one dataset to another, and of course, it’s a difficult thing to plan before we actually start getting in there and doing that work.
  • How might our data practices look different from one dataset to the next? This is very relevant to our project as we’re working with very different types of data! Because my data is very text-based, the problems it poses are ones that are generally well understood, and I think following best practices will serve us well in that case, but we also wanted to work with sound and possibly image files. This is useful work! How do we preserve the data from it?
  • How will this be archived? We don’t have a formal plan to continue beyond the end of the semester, but the data will still exist! We plan to use GitHub, and the website will continue to exist, but we’re also talking about how CUNY Academic Works supports the archiving of our data.

This is a really good project for us to work on early in our DH lives, because it’s explicitly about how data is selected, transformed, and presented, so the questions that any researcher would need to ask about data are far more deeply integrated into our thought processes than they might be if we were only interested in analyzing data and finding a result. The DMP encourages us to explicitly ask questions about data types, documentation, and platform that we need to answer in any case. Additionally, it puts us in a position to appreciate (and/or critique) the data practices of the researchers who came before us, because in some cases, we may well be using datasets that already exist.

As for me personally, I’m contributing to the data management aspects of this project by:

  • Writing a narrative that lays out in detail exactly how the dataset was created. If neither MLA nor EBSCO makes significant changes in the near future, a reader could use this to recreate the parts of my dataset that I can’t make public. If such changes ARE made, I hope to create enough documentation that it will be easy to tell how the data is different.
  • Brushing up on GitHub. I was introduced to it during the Graduate Center Digital Institute and got my feet wet there, but I’m not yet comfortable with the platform and need to work on that further. Rob has also introduced us to git-lfs (Git Large File Storage), so I need to walk through the tutorials on this.
  • Contacting Jill for conversations about fair use.
  • Reading up on the GPL.
  • Creating my dataset, making a copy on my hard drive and in the cloud, and uploading it to GitHub.

I’ve already been working on some of these, of course (it’s already Monday!!), but I have a little more to do before class tomorrow.

This entry was posted in Personal Blogs and tagged , , , , , , . : . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


Need help with the Commons? Visit our
help page
Send us a message
Skip to toolbar