login/register

Snip!t from collection of Alan Dix

see all channels for Alan Dix

Snip
summary


Introduction
The Indiana University School of Library and Information ...
D-Lib Magazine Archive Structure
... is also probably not that useful [Salton, McGill: 1983].

Open Archives Data Service Prototype and Automated Subject Indexing Using D-Lib(R) Archive Content As a Testbed
http://www.dlib.org/dlib/december03/mongin/12mongin.html

Categories

/Channels/search

[ go to category ]

For Snip

loading snip actions ...

For Page

loading url actions ...

Introduction

The Indiana University School of Library and Information Science opened a new research laboratory in January 2003; The Indiana University School of Library and Information Science Information Processing Laboratory [IU IP Lab]. The purpose of the new laboratory is to facilitate collaboration between scientists in the department in the areas of information retrieval (IR) and information visualization (IV) research. The lab has several areas of focus. These include grid and cluster computing, and a standard Java-based software platform to support plug and play research datasets, a selection of standard IR modules and standard IV algorithms. Future development includes software to enable researchers to contribute datasets, IR algorithms, and visualization algorithms into the standard environment. We decided early on to use OAI-PMH as a resource discovery tool because it is consistent with our mission.

D-Lib Magazine Archive Structure

We are using the D-Lib Magazine archives [D-Lib Magazine] as a dataset for our prototype for several reasons. D-Lib Magazine is provided via open access for non-commercial, educational use with few restrictions (see D-Lib Magazine access terms and conditions at <http://www.dlib.org/access.html>. The content of the articles is of interest to the software developers, which is convenient when we are developing the software. Also D-Lib Magazine has a metadata file associated with each article. Each article is stored in a directory with the article HTML file, images, and a .meta.xml file that contains metadata about the article.

D-Lib Metadata

Since March 1999, D-Lib has created an XML metadata file associated with each article published in the magazine. The metadata format is related to, but not quite the same as Dublin Core [Dublin Core] which is the metadata structure used by the Open Archives Initiative.

Below is an example D-Lib metadata file from an article by José Canós et al. in the January 2003 issue [Canós 2003]:

<?xml version="1.0"?>
<!DOCTYPE dlib-meta0.1 SYSTEM "http://www.dlib.org/dlib/dlib-meta01.dtd">
  <dlib-meta0.1>
  <title>Building Safety Systems with Dynamic Disseminations of Multimedia Digital Objects</title>
  <creator>Jose H. Canos</creator>
  <creator>Javier Jaen</creator>
  <creator>Juan C. Lorente</creator>
  <creator>Jennifer Perez</creator>
  <publisher>Corporation for National Research Initiatives</publisher>
  <date date-type="publication">January 2003</date>
  <type resource-type="work">article</type>
  <identifier uri-type="DOI">10.1045/january2003-canos</identifier>
  <identifier uri-type="URL">http://www.dlib.org/dlib/january2003/canos/01canos.html</identifier>
  <language>English</language>
  <relation rel-type="InSerial">
      <serial-name>D-Lib Magazine</serial-name>
      <issn>1082-9873</issn>
      <volume>9</volume>
      <issue>1</issue>
  </relation>
  <rights>Jose H. Canos, Javier Jaen, Juan C. Lorente, and Jennifer Perez</rights>
  </dlib-meta0.1>

Most of the elements map into Dublin Core [DC]. The relation field has to be selected for "URL". The creator field has to be combined for all the authors. D-Lib has a few fields that don't map into Dublin Core. This is a common problem with the limited map of Dublin Core.

Choosing an OAI Repository Program

The Open Archives Initiative [OAI] is an open standards, open source group. They offer a number of tools, most of which are available free of charge, to help implement repositories and harvesters. We wanted a program that is easy to install, one implemented in Java, and one with which it is easy to load metadata files. The program we chose was the "Rapid Visual OAI Tool" (RVOT) from Old Dominion University [Old Dominion RVOT]. RVOT is a stand alone Java program that includes a lightweight http server. It is easy to install as a user program. RVOT includes several mapping procedures to convert metadata into the rfc1807 native metadata format [rfc1807]. Sets are supported. The program also includes an interactive user interface to map metadata fields into the native rfc 1807 format. There is some concern that the native file/directory structure might not work for large datasets, and we may need to migrate to a database repository in the future to ensure efficient performance.

Below is an example an example of the rfc1807 metadata record for the Canos et al. D-Lib article:

  TITLE:: Building Safety Systems with Dynamic Disseminations of Multimedia Digital Objects
  AUTHOR:: Jose H. Canos
  AUTHOR:: Javier Jaen
  AUTHOR:: Juan C. Lorente
  AUTHOR:: Jennifer Perez
  ORGANIZATION:: Corporation for National Research Initiatives
  DATE:: January 2003
  TYPE:: article
  ID:: http://www.dlib.org/dlib/january03/canos/01canos.html
  LANGUAGE:: English
  RELATION:: D-Lib Magazine
  COPYRIGHT:: Jose H. Canos, Javier Jaen, Juan C. Lorente, and Jennifer Perez

This rfc1807 record maps into a RVOT Dublin Core record that uses a similar syntax but substitutes rfc1807 tags for Dublin Core tags. Author maps to Creator; Organization maps to Publisher, and Copyright maps to Rights.

Below is the RVOT Dublin Core record for the Canós article:

  TITLE:: Building Safety Systems with Dynamic Disseminations of Multimedia Digital Objects
  CREATOR:: Jose H. Canos
  CREATOR:: Javier Jaen
  CREATOR:: Juan C. Lorente
  CREATOR:: Jennifer Perez
  PUBLISHER:: Corporation for National Research Initiatives
  DATE:: January 2003
  TYPE:: article
  IDENTIFIER:: http://www.dlib.org/dlib/january2003/canos/01canos.html
  LANGUAGE:: English
  RELATION:: D-Lib Magazine
  RIGHTS:: Jose H. Canos, Javier Jaen, Juan C. Lorente, and Jennifer Perez

Note that the rfc1807 fields [rfc1807] are quite similar to Dublin Core XML used in OAI-PMH.

Extracting Data from D-Lib XML Files

Most of the OAI-PMH fields can be directly mapped from the D-Lib meta.xml files that describe each article. The main field missing from the D-Lib metadata file is a subject terms field. D-Lib includes a <meta> tag in the <head> element of every article, using the same three keywords. Since term association is a major part of a search service we decided to use IR algorithms to compute subject (keyword) terms for each article.

Below is an example of the metadata XML file resulting from our use of the IR algorithm to compute keywords:

<article>
  <title>
  Building Safety Systems with Dynamic Disseminations of Multimedia Digital Objects
  </title>
  <creator >Jose H. Canos</creator>
  <creator> Javier Jaen</creator>
  <creator>Juan C. Lorente </creator>
  <creator>Jennifer Perez </creator>
  <publisher>Corporation for National Research Initiatives</publisher>
  <date>
      <month>January</month>
      <year>2003</year>
  </date>
  <type>article</type>
  <identifier>http://www.dlib.org/dlib/january03/canos/01canos.html</identifier>
  <language>English</language>
  <relation>D-Lib Magazine</relation>
  <rights>Jose H. Canos, Javier Jaen, Juan C. Lorente, and Jennifer Perez</rights>
  <subject>train javax interfaces driver size aspect components ejbs home page http station states 
       evacuation screen stations tunnels section decision transportation
  </subject>
  <xmluri>01canos.xml</xmluri>
</article>

Most of the fields, except for the subject fields, were derived directly from the D-Lib metadata file.

Generating Keywords from D-Lib Articles

Here is the term selection algorithm we used:

  • Extract tokens from HTML article text with Java callback parser.
  • Drop terms that contain special characters except terms that end with punctuation (. , ? ;).
  • Convert characters to lower case.
  • Drop terms less than 4 characters in length.
  • Drop terms in stop word list.
  • Add term to term frequency matrix.
  • When terms in document >100, start new document.
  • When the end of the file is reached, process the term by document matrix as follows:
    • Compute df (document frequency matrix).
    • Drop terms that don't appear in at least 10% of documents.
    • Compute tf/idf (Term frequency (tf) is the number of times a term occurs in a single document. Inverse document frequency (idf) is a measure of the distinctiveness of this term across the document space.) tf/idf = tf/log(N/df).
    • Select n terms with highest tf/idf weights.
    • Compress tf/idf matrix.

tf/idf

Each cell of the term by document matrix is the raw frequency count of the number of times a term occurred in a particular document. The overall frequency matrix is the number of documents in which each term is found. These two arrays are used to compute tf/idf. The terms are about 25% of the tokens in the HTML document after rejecting terms with special characters, tokens less than four characters, and terms in the stop word list.

tf/idf = tf/log(N/df)

The tfidf(termArray) method computes tf/idf from the raw frequency arrays.

Below is a snippit of sample Java code for this process:

        tf/idf = tf/log(N/df)

The tfidf(termArray) method computes tf/idf from the raw frequency arrays.

Below is a snippit of sample Java code for this process:

        public void ctfidf(){
          int i,j;
          int N = docs;
          for (i=0;i<nterms;i++)
            for (j=0;j<N;j++)
            if(htdf[i][j] > 0 && hdf[i]   > 0){
              htdf[i][j] = htdf[i][j] * Math.log(N/hdf[i]);
              //System.out.print(hterms[i]);
              //System.out.println(htdf[i][j]);
            }
            else
              htdf[i][j] = 0;

          }

The tf/idf method provides a way for weighing terms in a document space. A term that occurs in every document is not a very useful search term and a term that appears very infrequently is also probably not that useful [Salton, McGill: 1983].

HTML

<td> <!-- Abstract or TOC goes here --> <!-- Story goes next --> <h3>Introduction</h3> <p> The Indiana University School of Library and Information Science opened a new research laboratory in January 2003; The Indiana University School of Library and Information Science Information Processing Laboratory [<a href="#IU_IP_Lab">IU IP Lab</a>]. The purpose of the new laboratory is to facilitate collaboration between scientists in the department in the areas of information retrieval (IR) and information visualization (IV) research. The lab has several areas of focus. These include grid and cluster computing, and a standard Java-based software platform to support plug and play research datasets, a selection of standard IR modules and standard IV algorithms. Future development includes software to enable researchers to contribute datasets, IR algorithms, and visualization algorithms into the standard environment. We decided early on to use OAI-PMH as a resource discovery tool because it is consistent with our mission.</p> <h3>D-Lib Magazine Archive Structure</h3> <p>We are using the D-Lib Magazine archives [<a href="#D-Lib_Magazine">D-Lib Magazine</a>] as a dataset for our prototype for several reasons. D-Lib Magazine is provided via open access for non-commercial, educational use with few restrictions (see D-Lib Magazine access terms and conditions at &lt;<a href="http://www.dlib.org/access.html">http://www.dlib.org/access.html</a>&gt;. The content of the articles is of interest to the software developers, which is convenient when we are developing the software. Also D-Lib Magazine has a metadata file associated with each article. Each article is stored in a directory with the article HTML file, images, and a .meta.xml file that contains metadata about the article.</p> <h3><a name="D-Lib_Metadata"></a>D-Lib Metadata</h3> <p>Since March 1999, D-Lib has created an XML metadata file associated with each article published in the magazine. The metadata format is related to, but not quite the same as Dublin Core [<a href="#Dublin_Core">Dublin Core</a>] which is the metadata structure used by the Open Archives Initiative.</p> <p>Below is an example D-Lib metadata file from an article by Jos&#xe9; Can&#xf3;s et al. in the January 2003 issue [<a href="#Canos_2003">Can&#xf3;s 2003</a>]:</p> <pre>&lt;?xml version="1.0"?&gt; &lt;!DOCTYPE dlib-meta0.1 SYSTEM "http://www.dlib.org/dlib/dlib-meta01.dtd"&gt; &lt;dlib-meta0.1&gt; &lt;title&gt;Building Safety Systems with Dynamic Disseminations of Multimedia Digital Objects&lt;/title&gt; &lt;creator&gt;Jose H. Canos&lt;/creator&gt; &lt;creator&gt;Javier Jaen&lt;/creator&gt; &lt;creator&gt;Juan C. Lorente&lt;/creator&gt; &lt;creator&gt;Jennifer Perez&lt;/creator&gt; &lt;publisher&gt;Corporation for National Research Initiatives&lt;/publisher&gt; &lt;date date-type="publication"&gt;January 2003&lt;/date&gt; &lt;type resource-type="work"&gt;article&lt;/type&gt; &lt;identifier uri-type="DOI"&gt;10.1045/january2003-canos&lt;/identifier&gt; &lt;identifier uri-type="URL"&gt;http://www.dlib.org/dlib/january2003/canos/01canos.html&lt;/identifier&gt; &lt;language&gt;English&lt;/language&gt; &lt;relation rel-type="InSerial"&gt; &lt;serial-name&gt;D-Lib Magazine&lt;/serial-name&gt; &lt;issn&gt;1082-9873&lt;/issn&gt; &lt;volume&gt;9&lt;/volume&gt; &lt;issue&gt;1&lt;/issue&gt; &lt;/relation&gt; &lt;rights&gt;Jose H. Canos, Javier Jaen, Juan C. Lorente, and Jennifer Perez&lt;/rights&gt; &lt;/dlib-meta0.1&gt;</pre> <p>Most of the elements map into Dublin Core [DC<a href="#Dublin_Core"></a>]. The relation field has to be selected for "URL". The creator field has to be combined for all the authors. D-Lib has a few fields that don't map into Dublin Core. This is a common problem with the limited map of Dublin Core.</p> <h3>Choosing an OAI Repository Program</h3> <p>The Open Archives Initiative [<a href="#OAI">OAI</a>] is an open standards, open source group. They offer a number of tools, most of which are available free of charge, to help implement repositories and harvesters. We wanted a program that is easy to install, one implemented in Java, and one with which it is easy to load metadata files. The program we chose was the "Rapid Visual OAI Tool" (RVOT) from Old Dominion University [<a href="#Old_Dominion_RVOT">Old Dominion RVOT</a>]. RVOT is a stand alone Java program that includes a lightweight http server. It is easy to install as a user program. RVOT includes several mapping procedures to convert metadata into the rfc1807 native metadata format [<a href="#OAI_rfc1807">rfc1807]</a>. Sets are supported. The program also includes an interactive user interface to map metadata fields into the native rfc 1807 format. There is some concern that the native file/directory structure might not work for large datasets, and we may need to migrate to a database repository in the future to ensure efficient performance.</p> <p>Below is an example an example of the rfc1807 metadata record for the Canos et al. D-Lib article:</p> <pre> TITLE:: Building Safety Systems with Dynamic Disseminations of Multimedia Digital Objects AUTHOR:: Jose H. Canos AUTHOR:: Javier Jaen AUTHOR:: Juan C. Lorente AUTHOR:: Jennifer Perez ORGANIZATION:: Corporation for National Research Initiatives DATE:: January 2003 TYPE:: article ID:: http://www.dlib.org/dlib/january03/canos/01canos.html LANGUAGE:: English RELATION:: D-Lib Magazine COPYRIGHT:: Jose H. Canos, Javier Jaen, Juan C. Lorente, and Jennifer Perez</pre> <p>This rfc1807 record maps into a RVOT Dublin Core record that uses a similar syntax but substitutes rfc1807 tags for Dublin Core tags. Author maps to Creator; Organization maps to Publisher, and Copyright maps to Rights.</p> <p>Below is the RVOT Dublin Core record for the Can&#xf3;s article:</p> <pre> TITLE:: Building Safety Systems with Dynamic Disseminations of Multimedia Digital Objects CREATOR:: Jose H. Canos CREATOR:: Javier Jaen CREATOR:: Juan C. Lorente CREATOR:: Jennifer Perez PUBLISHER:: Corporation for National Research Initiatives DATE:: January 2003 TYPE:: article IDENTIFIER:: http://www.dlib.org/dlib/january2003/canos/01canos.html LANGUAGE:: English RELATION:: D-Lib Magazine RIGHTS:: Jose H. Canos, Javier Jaen, Juan C. Lorente, and Jennifer Perez</pre> <p>Note that the rfc1807 fields [<a href="#OAI_rfc1807">rfc1807</a><a href="#OAI_rfc1807">]</a> are quite similar to Dublin Core XML used in OAI-PMH.</p> <h3>Extracting Data from D-Lib XML Files</h3> <p>Most of the OAI-PMH fields can be directly mapped from the D-Lib <a href="#D-Lib_Metadata">meta.xml</a> files that describe each article. The main field missing from the D-Lib metadata file is a subject terms field. D-Lib includes a &lt;meta&gt; tag in the &lt;head&gt; element of every article, using the same three keywords. Since term association is a major part of a search service we decided to use<a href="#gen"> IR algorithms</a> to compute subject (keyword) terms for each article.</p> <p>Below is an example of the metadata XML file resulting from our use of the IR algorithm to compute keywords:</p> <pre>&lt;article&gt; &lt;title&gt; Building Safety Systems with Dynamic Disseminations of Multimedia Digital Objects &lt;/title&gt; &lt;creator &gt;Jose H. Canos&lt;/creator&gt; &lt;creator&gt; Javier Jaen&lt;/creator&gt; &lt;creator&gt;Juan C. Lorente &lt;/creator&gt; &lt;creator&gt;Jennifer Perez &lt;/creator&gt; &lt;publisher&gt;Corporation for National Research Initiatives&lt;/publisher&gt; &lt;date&gt; &lt;month&gt;January&lt;/month&gt; &lt;year&gt;2003&lt;/year&gt; &lt;/date&gt; &lt;type&gt;article&lt;/type&gt; &lt;identifier&gt;http://www.dlib.org/dlib/january03/canos/01canos.html&lt;/identifier&gt; &lt;language&gt;English&lt;/language&gt; &lt;relation&gt;D-Lib Magazine&lt;/relation&gt; &lt;rights&gt;Jose H. Canos, Javier Jaen, Juan C. Lorente, and Jennifer Perez&lt;/rights&gt; &lt;subject&gt;train javax interfaces driver size aspect components ejbs home page http station states evacuation screen stations tunnels section decision transportation &lt;/subject&gt; &lt;xmluri&gt;01canos.xml&lt;/xmluri&gt; &lt;/article&gt;</pre> <p>Most of the fields, except for the subject fields, were derived directly from the D-Lib metadata file.</p> <h3><a name="gen"></a> Generating Keywords from D-Lib Articles</h3> <p>Here is the term selection algorithm we used:</p> <ul> <li>Extract tokens from HTML article text with Java callback parser.</li> <li>Drop terms that contain special characters except terms that end with punctuation (. , ? ;).</li> <li>Convert characters to lower case.</li> <li>Drop terms less than 4 characters in length.</li> <li>Drop terms in stop word list.</li> <li>Add term to term frequency matrix.</li> <li>When terms in document &gt;100, start new document.</li></ul> <ul> <li>When the end of the file is reached, process the term by document matrix as follows:<ul> <li>Compute df (document frequency matrix).</li> <li>Drop terms that don't appear in at least 10% of documents.</li> <li>Compute tf/idf (Term frequency (tf) is the number of times a term occurs in a single document. Inverse document frequency (idf) is a measure of the distinctiveness of this term across the document space.) tf/idf = tf/log(N/df).</li> <li>Select n terms with highest tf/idf weights.</li> <li>Compress tf/idf matrix.</li></ul></li></ul> <h3>tf/idf</h3> <p>Each cell of the term by document matrix is the raw frequency count of the number of times a term occurred in a particular document. The overall frequency matrix is the number of documents in which each term is found. These two arrays are used to compute tf/idf. The terms are about 25% of the tokens in the HTML document after rejecting terms with special characters, tokens less than four characters, and terms in the stop word list.</p> <blockquote>tf/idf = tf/log(N/df)</blockquote> <p>The tfidf(termArray) method computes tf/idf from the raw frequency arrays.</p> <p>Below is a snippit of sample Java code for this process:</p> <pre> tf/idf = tf/log(N/df)</pre> <p>The tfidf(termArray) method computes tf/idf from the raw frequency arrays.</p> <p>Below is a snippit of sample Java code for this process:</p> <pre> public void ctfidf(){ int i,j; int N = docs; for (i=0;i&lt;nterms;i++) for (j=0;j&lt;N;j++) if(htdf[i][j] &gt; 0 &amp;&amp; hdf[i] &gt; 0){ htdf[i][j] = htdf[i][j] * Math.log(N/hdf[i]); //System.out.print(hterms[i]); //System.out.println(htdf[i][j]); } else htdf[i][j] = 0; }</pre> <p>The tf/idf method provides a way for weighing terms in a document space. A term that occurs in every document is not a very useful search term and a term that appears very infrequently is also probably not that useful [<a href="#Salton:1989">Salton, McGill: 1983</a>].</p> <h3></h3></td>