Posts Tagged ‘IR lemur’

Lemur ‘s using notes

Use Lemur by Java + Eclipse

Start coding that uses Lemur’s libraries
In eclipse, new a project.
Then go to that project “Build path” (by right click on the project name, choose “Build path…” (or Menu:Project-Properties-Java build path) )
At “Libraries” tab, add the lemur.jar file.

Run written code that uses Lemur’s libraries
Menu:Project-Properties…
At “Order & export” tab, make sure to check the box lemur.jar
Then run it. It should work!

Re-run index note!
If index data destination is already exists, no thing create or the current data is keep, and new data is appended into it!
You should clean up that folder so as new results could be created!

Lemur – my log when study Lemur.

mar 2008

source: The Lemur Toolkit – Tutorials :Starting Out : Overview: A Beginner’s Guide to Indexing

Run Jelinek Mercer model.

To issue a query via the IndriRunQuery, you need to create a parameter file, much like one that was created to build an index, and is run by executing “IndriRunQuery

At the most basic, an IndriRunQuery parameter file should consist of an index path, and a query. As an example:


256M
/path/to/the/index
the query to issue
Set the rule element in the above parameters tag to choose the Jelinek Mercer model
rule
specifies the smoothing rule (TermScoreFunction) to apply. Format of the rule is:

( key ":" value ) [ "," key ":" value ]*

Here’s an example rule in command line format:

-rule=method:linear,collectionLambda:0.2,field:title

and in parameter file format:
method:linear,collectionLambda:0.2,field:title

This corresponds to Jelinek-Mercer smoothing with background lambda equal to 0.2, only for items in a title field.

If nothing is listed for a key, all values are assumed. So, a rule that does not specify a field matches all fields. This makes -rule=method:linear,collectionLambda:0.2 a valid rule.

Valid keys:

method
smoothing method (text)
field
field to apply this rule to
operator
type of item in query to apply to { term, window }

Valid methods:

dirichlet
(also ‘d’, ‘dir’) (default mu=2500)
jelinek-mercer
(also ‘jm’, ‘linear’) (default collectionLambda=0.4, documentLambda=0.0), collectionLambda is also known as just “lambda”, either will work
twostage
(also ‘two-stage’, ‘two’) (default mu=2500, lambda=0.4)

If the rule doesn’t parse correctly, the default is Dirichlet, mu=2500.


13 mar 2008


Lemur keep saying my file is malformed – why?
I want to index my files using lemur but my file, although I try to follow trectext format as below but it still not works!

1


At last, after 1 hour trying (by using the sample data provided in lemur package), I find out that we have to put 1 enter charater at the end of out file – a funny requirement of Lemur for trectext format ! On the other hand, pay attention to the new-line character (\r\n or just \n).

1

//enter twice at the end of file !!!!

mar 2008

source: The Lemur Toolkit – Tutorials :Starting Out : Overview: A Beginner’s Guide to Indexing

What is an index?

An index, or database, is basically a collection of information that can be quickly accessed, using some piece of information as a point of reference or key (what it’s indexed by). In our case, we index information about the terms in a collection of documents, which you can access later using either a term or a document as the reference.

Specificly, we can collect term frequency, term position, and document length statistics because those are most commonly needed for information retrieval. For example, from the index, you can find out how many times a certain term occurred in the collection of documents, or how many times it occurred in just one specific document. Retrieval algorthms that decide which documents to return for a given query use the collected information in the index in their scoring calculations.

11 mar 2008


source: The Lemur Toolkit – Tutorials :Starting Out : Overview: Overview of the Lemur Toolkit

Lemur currently supports the following features:

  • Indexing:
    • English, Chinese and Arabic text
    • word stemming (Porter and Krovetz stemmers)
    • omitting stopwords
    • recognizing acronyms
    • token level properties, like part of speech and named entities
    • passage indexing
    • incremental indexing
    • in-line and offset annotation support
  • Retrieval:
    • ad hoc retrieval (TFIDF, Okapi, and InQuery)
    • passage retrieval
    • cross-lingual retrieval
    • language modeling (KL-divergence)
      • query model updating for pseudo feedback
      • two-stage smoothing
      • smoothing with Direchlet prior or Markov chain
    • relevance feedback
    • structured query language
    • suffix-based wildcard term matching (Indri Query Language only)