Bringing Mathematical Journals to the Web

The National Council of Teachers of Mathematics (NCTM) is a non-profit organization that promotes improvements in the education of mathematics to students of all ages. The organization publishes a variety of educational journals aimed at different educational levels, including:

  • Teaching Children Mathematics (TCM)
  • Mathematics Teaching in the Middle School (MTMS)
  • Mathematics Teacher (MT)
  • Journal for Research in Mathematics Education (JRME)

All of these journals are print journals. None of the content, going all the way back to 1920 when the organization was founded, was available on the web.

Logically enough, NCTM wanted to extend its efforts by making content available to members via the Internet. That's where I came in. NCTM didn't have the in-house expertise to accomplish a complex project like this, so they commissioned me to lead the effort. This case study describes the progress of this project, which ultimately proved to be one of the most successful projects of my career.


Part 2: Sprinting to the Conference

When I was brought into the project, there were no written requirements. There was only the general statement that NCTM needed to make the content available online to members, and that the content had to be searchable.

It also turned out that there was a tight deadline. NCTM wanted to demonstrate the system at their annual conference, which was six weeks away when I was brought in to get the project started.

Unlike other customers that I've dealt with, the management at NCTM already understood that the overall scope of the project greatly exceeded the six weeks that was available until the conference. They explained to me that they'd been taking some fire from their membership for not having searchable journal content available on their web site. But they really wanted to be able to demonstrate something to their members.

After a quick survey of the situation, I was able to confirm that the project vastly exceeded six weeks. Given the six weeks until the conference, I proposed a phased approach to the project, with Phase I consisting of a sprint to get something ready to be shown at the conference. The demonstration would consist of a basic search capability over a small repository containing a couple years of content. Phase II would then build additional features on top of the initial work, with the provision that much of the initial work might need to be scrapped and rewritten due to the time pressures.

The customer was delighted. Given the time constraint, they were ecstatic that I thought I'd be able to deliver something that they'd be able to demonstrate successfully.


Part 3: Shaping the Project

To deliver any features at all in the desired timeframe, I had to deal with four basic characteristics of the project:

  • Management: How was the project going to be managed?

  • Technology: What technologies were going to be used for the project?

  • Requirements: What exactly was going to be implemented?

  • Content: Where was the content coming from?

Six weeks was tight. This project couldn't be delivered in the timeframe if there was going to be a committee overseeing and trying to manage every decision concerning the project. I needed one contact person on their side, and that person needed the power to make decisions. NCTM appointed their webmaster to work with me. Beginning in Phase I, and extending throughout the course of the project, the webmaster and I worked closely as a team in planning the features for the system.

I needed to quickly make technology decisions concerning the project. NCTM was open to any technology that would reasonably accomplish their objectives. They already had an educational discount on the Inktomi search engine, but I was free to suggest alternatives if necessary. A quick investigation of the Inktomi search engine revealed that it was a worthy and enterprise-caliber search engine.

Within a week of being at NCTM, it was clear to me that they were a Microsoft shop. When choosing technologies for a project, you can't just choose any technology in the cosmos, you have to choose technologies that the customer is capable of supporting (or willing to add support for in the future). The database of choice was clearly Microsoft SQL Server. For a web development language, the choices were ASP and ASP.NET, but ASP.NET was relatively new at the time and something of an unknown quantity. I chose ASP for the project.

For Phase I, the requirements were easy. The webmaster had given me a good grounding in what NCTM wanted to achieve. But the clock was ticking and only five weeks were left. For Phase I, the requirements were going to be whatever I wanted them to be. NCTM was going to have to trust me.

The next area of concern to me was the content. In every significant web publishing initiative that I have ever done, the customer has always vastly underestimated the effort that would be needed to get content ready to be published on the web. And here, NCTM was no exception.

This was why I had a sit-down meeting with the relevant parties at NCTM extremely early in Phase I to explain to them what they needed to do to get me the content that I was going to need. Both for Phase I, and continuing through Phase 2 and beyond.


Part 4: Content Preparation

There are a bunch of technologies that will let you index the content of a web site and search it. But the requirements for any enterprise-caliber web publishing initiative typically go well beyond that. Ultimately, NCTM wanted to be able search documents in a variety of ways.

First, and most obviously, they wanted to be able to search the text of a document. Equally important, and significantly harder to implement, they wanted to be able to search information related to the document, which is typically referred to as metadata. The metadata would be stored in the database and would consist of information such as:

  • Author Names
  • Title
  • Journal
  • Issue Number
  • Page Number (the starting page of the article within an issue)
  • Classification Tags
  • Abstract

To help NCTM understand the big picture, I began referring to a "virtual document," which consisted of the PDF version of the document plus the metadata in the database. The basic search that I was going to implement for the conference would only be able to search the text of the PDF files. In Phase II, I would implement more advanced search features that would also be able to search the metadata, allowing users to do things like search for all articles tagged as being about "Trigonometry."

To get content ready to be published on the web, NCTM was going to have to get the PDFs ready. They were also going to have to compile the metadata. Neither task was trivial.

To be searchable, the PDFs needed to be real documents that contained text. That may seem obvious, but it's not. A common mechanism for making content available as a PDF is to scan it in. What most people don't realize is that the scan is an image of the original document. And images aren't searchable.

NCTM had the source files for approximately 5 years of content. Many of the files had actually been put together by external companies. In many cases, the files contained all of the content and images, but didn't look like the originally published article because NCTM didn't have the fonts that the external company had used. NCTM was going to have to do some work to get each article into publishable format.

For articles older than 5 years, it was going to be even more problematic. Some of them might need to be either retyped or scanned in as text. There are always quality issues with text scanning, so there'd also have to be a concerted effort to proofread and correct scanned in articles. Plus they'd have to do some work to integrate images back into the articles.

Metadata was also an issue. NCTM was going to have to compile the metadata manually. People were going to have to read articles, recording information such as the title, authors, journal, issue number, etc. They were going to have to define classification tags for each article and create an abstract for each article. All of this was going to take time.

To put it bluntly, NCTM had a lot of work to do. The sooner they got started, the better. To facilitate content preparation, we defined a naming convention for PDF files. We also defined a format for an Excel spreadsheet that would contain metadata, including information so that we could identify the PDF file that went with each metadata entry.

NCTM's mission was to have two years of content ready in about four weeks, so that I could spend the last week of Phase I tuning the search capability I would have implemented by then.


Part 5: Onwards and Upwards

The Magazine Group (TMG), a design company that NCTM regularly worked with, created a rough look-and-feel for what NCTM had begun referring to as the "E-Resources" initiative. I incorporated that look-and-feel into a framework of the web site.

Working with the Inktomi search engine was an interesting experience. It turned out that the search engine was a fully functional web server implemented in Python and running on different ports than the rest of the web site. Everything that the search engine did was in Unicode, and the NCTM content was not Unicode (so lots of text conversions needed to be done during processing).

To implement a search capability, I had to take Inktomi's search capability, as embodied in a set of HTML templates with embedded Python, and customize it to meet NCTM's needs. I also had to alter their templates to incorporate the new NCTM look-and-feel. After a good bit of trial and error, I was able to get the search functionality working and integrated seamlessly into the E-Resources web site.

On another front, I got PDF versions of the articles from NCTM, along with the metadata in a series of Excel spreadsheets. In parallel with the development of the search feature, I created an organized repository of the PDF documents. I loaded the Excel spreadsheets into a temporary table in the database, and then created some SQL to massage the data and move it into the destination tables.

By the end of Phase I, I had successfully created a web site with a decent look-and-feel, a working search capability that could search through a repository of PDF documents and the rough capability to view the full content of an issue, as well as individual articles. NCTM was able to demonstrate E-Resources to great acclaim at their annual conference.


Part 6: Implementation of Phase II

In Phase II, I had to turn the capabilities that had been demonstrated in Phase I into a full web site that included all of the tools that NCTM would need to maintain the web publishing capability on an ongoing basis. As it turned out, I had done a good job with my design for Phase I. None of the code that had initially been created needed to be scrapped or drastically rewritten. Instead, I was able to build directly on what I had already done.

For this development phase, NCTM needed the following features:

  • New Look-and-Feel
  • Journal Home Pages
  • Admin Area
    • Journal Maintenance
    • Issue Maintenance
    • Article Metadata Maintenance
  • Security
  • Content Navigation Schemes
    • Metadata Search
    • Journal/Issue Drill-down
  • E-Commerce

New Look-And-Feel

The new look-and-feel turned out to be a multi-level design created by The Magazine Group (TMG). It was a quality design, but required some work to integrate into the web site. As mandated by NCTM, each journal delivered through the E-Resources web site was to have its own "home page", with a distinctive look for that journal. Ultimately, there were four "levels" of the look-and-feel, with four versions of each of the two journal-specific levels.

Thus, there were 10 different looks for various parts of the web site, all compatible but subtly different. In practice, it worked like a dream, but the implementation was a good bit of work. I was pleased, though, because I thought TMG had done a good job, and it did an excellent job of showcasing all of the features that I was implementing.

Admin Area

There was enough work on the project that I brought in a second developer to help me. For Phase II, I concentrated on the overall architecture and did the heavy lifting when it came to programming. My new developer concentrated on the grunt work that needed to be done, notably the entire Admin Area of the web site.

NCTM needed an extensible system for maintaining information associated with journals, issues and articles. We created maintenance web pages for all of this information.

Security

We also implemented a login mechanism, complete with a permission-based scheme for regulating access to different features. This was mostly to support administrative users. We also created a subscription feature, with access to journal articles controlled based on whether a user had an active subscription to a particular journal.

Access to articles was an important consideration. I've seen web publishing systems where articles were freely accessible if you happened to know where they were located and what the article naming conventions were. NCTM wanted the documents to be totally inaccessible unless you had a valid subscription.

I stored the PDF document repository outside the web-accessible directory tree of the web site. To facilitate access to individual articles, I created an ASP page that delivered the PDF document to the requester. That ASP page made sure the user was logged and verified that the user's subscription allowed access to the article. Assuming access was allowed, the web page informed the requester's browser that it was sending back a PDF file. The code then read the file in, divided it into chunks and sent the PDF file to the requester's web browser. No article could be accessed without going through this ASP page.

It worked perfectly, and successfully handled even the largest PDF files (which were about 15 MB).

Content Navigation Schemes

A navigation scheme is a mechanism for making web site content accessible to users. The most common navigation schemes for web sites are menus (which the E-Resources site already had). Beyond that, NCTM wanted users to be able to access documents in two distinct ways: 1) searching, and 2) drill-down.

I had implemented a basic search capability in Phase I. In Phase II, I was able to modify Inktomi's indexing capability to index fields of data from the database as part of the "virtual document" that it would search. I then created an Advanced Search screen that would allow users to modify their search results by searching these additionally indexed fields. Thus, users could search for articles by a particular author or that corresponded to a particular tag, such as "Geometry."

Further, NCTM wanted users to be able to view a list of all the issues in a journal, and then drill-down and view the list of articles in that issue. Even better, users could view an article by clicking on the title of an article. This information was easily supported by the metadata that we already needed to store in the database to facilitate the search feature.

This allowed users two distinct ways to navigate to the content that they wanted to view.

E-Commerce

The E-Resources site was targeted for members of NCTM. But what if a person didn't have a membership? Or what if a person just wanted to buy an individual article?

I worked with NCTM to create an e-commerce capability that would allow users to purchase a membership in NCTM (which granted them access to one journal of their choice), purchase additional journals or purchase individual articles. After buying an individual article, the user would continue to have access to that article online for a year.


Part 7: Conclusion

So, how did the project work out overall? Well, the customer was extremely happy with the project. They able to gain credibility with their membership by demonstrating an early edition of the E-Resources site at their annual conference, which was something they hadn't really thought was feasible.

They also got a fully customized content management system with features that allowed them to successfully maintain the site going forward. Even better, the site included an e-commerce feature that made the whole initiative a self-sustaining and money-making endeavor.

The site was a success for the audience of mathematics educators, as well. Previously, a membership in NCTM got you four quarterly journal issues. With E-Resources, your membership got you not only the four print issues of your selected journal, but also access to many years worth of content from that journal. And if there was really something that met your needs in another journal, you could easily buy an individual article.

For the audience, E-Resources was a considerable value-add over their regular membership subscription, both in terms of amount of content and the immediate accessibility of the content.

After-Thoughts

Would I do anything differently if were starting the project today? Well, today I'd be doing the development in ASP.NET, although ASP (now generally referred to as Classic ASP) was a good choice at the time. Otherwise, I'm obviously pretty happy with the project, and with NCTM as a customer.

And now for a brief note about the Inktomi search engine product. I was able to create the fielded search capability using the basic Inktomi product. It required some understanding of how they implemented the search engine and some perusing of their API and online documentation to figure out how to do it.

Shortly after I finished the project, I noticed that Inktomi had removed all information about how to do what I had done from their documentation. The reason is because they came out with an "add-on" product that would do what I had done for a considerable chunk of additional money.

In general, this is pretty typical of search engine vendors. The companies tend to try to slice-and-dice their products into discrete features that you have to pay extra for when those features are almost always already inherent in their base product. I'm not a big fan of what I consider to be "artificial products," i.e. - add-on products created by factoring out features that most customers would consider a valid part of the core product.



Comments

David Keener By dkeener on Sunday, June 13, 2010 at 08:01 AM EST

This is arguably one of the most successful projects that I ever worked on. The project was accomplished from 2002 to 2003. The first phase allowed the organization to successfully demo real functionality to their membership at their annual conference (despite the 6-week initial timeframe). The remainder of the project completed the work, and delivered a project that NCTM was extremely happy with. And it's STILL running now, with only minor changes, more than 7 years later (that's like decades in Internet years.


Leave a Comment

Comments are moderated and will not appear on the site until reviewed.

(not displayed)