|Scan of February 15, 2012 Article on Charlotte Observer|
I came across this article while looking through the newspaper a couple weeks ago. I clipped it out and put it on my refrigerator for inspiration. You can read it online here. I find it inspiring because the article describes a project that is very similar to what I envision.
The 573 letters exchanged between Robert Browning and Elizabeth Barrett were previously only available to people who went to Wellesley College and viewed them in the library. However, in a collaboration between Baylor University and Wellesley, the letters have now been digitized and made public.
The whole world can now access the letters here: www.wellesley.edu/browning
The article continues to explain that,
"The website set up for readers to see the correspondence includes both the handwritten letters and transcriptions, as well as a zoom function for readers to try to decipher faded or illegible words. The body of letters will also be searchable by keywords."
I plan to explore this more in a deeper analysis in a separate blog post.
I began this week where I left off with my project. I couldn't rest until I figured out those menus across the top! I have trouble moving forward if things aren't just right. I used the WordPress "Support" page search to find a tutorial on making a static home page - also called front page - so the blog will have a more traditional web site feel. I found this page: Writing & Editing - Front Page.
I also finally figured out that the tabs across the top, with this theme, are not considered "menus," like I thought. Each "page" I create is tabbed across the top. I was able to create two tabs: one for publications and one for project updates. The publications now has tiered pages - a main page with the story of the magazine, then each issue has been uploaded as a link to a pdf. The updates tab will be where my blog posts are stored. I can use this to keep readers updated on progress. I converted the first post - a basic overview - into a static front page. This is important because it will prevent the overview from being replaced with newer posts.
I also figured out how to space the title, so I know have this:
|Screen Shot by Suzanne Sink of southernundergroundpress|
Also on the agenda is to find out if there is a way to have the pdf files be visible on the post without having to link out to them. I need to add description to the pdf files posts and tags, so people can be directed to authors and content they are interested in.
This week I decided to investigate OCR (Optical Character Recognition) technology. This technology is designed to take a pdf, or other text-based document, and create a file behind the text that recognizes the different words. This will allow the text to be searched for words, show the reader where those words appear, and help the search engine locate the different texts containing the search terms. This is a major part of the project's functionality as I envision it, and it is what makes the Browning database so helpful to scholars.
My goal here is not only to preserve and archive these cultural artifacts, but to make them useful to scholarship in literary, sociological, artistic, and historical fields of study. If the user is unable to search for his or her particular topic - the draft, for example - then the archive is just an interesting read. I need it to be a tool. OCR is at the heart of that function.
Being the novice (fearless novice, but novice nonetheless) that I am, I started this exploration on Wikipedia's page on OCR. As I tell my students, Wikipedia can be a good place to get some background information, but it doesn't replace further research. I needed a primer - something that would help me understand the vocabulary and terms surrounding these products.
OCR - technology that coverts an image or scan of text into machine encoded text, or computer recognizable characters. This allows data to be searched by key words and is required for text mining.
Text Mining - deriving information from texts with the ability to categorize and summarize that information. This is a function that may be useful in the future of this project when there are multiple publications to search and analyze for patterns.
OCR vs. ICR - An important concept for me was the difference between OCR and ICR (Intelligent Character Recognition). ICR is necessary for converting handwriting to machine encoded text. Many of the publications are a combination of handwriting and typing, so ICR capabilities will be something to have in a purchase of software.
Error Rate - There is a great variation of accuracy among the different OCR software. I expect that the more accurate the software depends on the quality of the image to be converted and the sophistication of the software. Since I have older texts in various stages of readability, to get a low error rate, I will need a more advanced system for OCR.
Digital Libraries - I fell into a bit of a black hole on WikipediaNC ECHO - North Carolina Exploring Cultural Heritage Online. This could potentially be an excellent resource for assistance in the process of gathering texts and publishing digitally. They provide some grants, but like NEH, grants are only awarded to institutions and not individuals. However, I could certainly look into partnering with a local library or work through ODU to obtain this kind of funding. This site is one I will need to revisit when funding becomes more of an issue.
Edit vs. Search - One potential area for further research is in locating software that doesn't simply allow for the editing of a scan but for the ability to search within in. For example, there are free versions of software that will convert a pdf to a Microsoft Word document. I have used this before to take an older worksheet and put it into a Word document. In my experience, this process resulted in a high number of errors and no option to search the text. It is something to consider when shopping for OCR software.
Search on my Computer vs. Search on the Blog - I found one resource that at first seemed promising. It can be purchased from a company called Lucion. I was very excited when I started watching the videos about how it works. However, the problem is that it converts pdf files to tif files, which is not a file type supported by WordPress. WordPress will support odt files, so I need to do more research on software that will convert pdf files to odt files.
Zotero - A professor of mine suggested I look around the resource Professor Hacker. There I found the suggestion that I look into using Zotero to store and organize my pdf files. I downloaded the software and plan on exploring that in the upcoming weeks.
Reflection of Learning:
This week was very productive for me. I learned a lot about what kinds of problems I will need to overcome to realize my vision. In particular, I need to find a way to make my scans searchable documents - easy enough to do if I want to have that function on just my home computer. This becomes problematic when thinking about making this a function of a website or blog. I am on the trail though, and after looking at the archive of the Browning letters, I am reassured that it is possible.
However, I can't help but feeling like I am trying to reinvent the wheel. How do I connect with the people who have already digitized and made texts searchable?
Ironically, I now have a follower on my blog. She is a librarian at Jacksonville State University where there is a very large collection of underground press publications on microfilm. There is also a tab where users can see a list of links to digitized publications. Here is what I saw:
I posted these scans earlier today. Later the same day, I find that I have been added to a college library research guide! It's so exciting!
But I see from the her guide, that several papers from the South already exist in digital form. Perhaps collaboration with many projects will be necessary in the future.