Archive for March, 2008

Lemur – my log when study Lemur.

mar 2008

source: The Lemur Toolkit – Tutorials :Starting Out : Overview: A Beginner’s Guide to Indexing

Run Jelinek Mercer model.

To issue a query via the IndriRunQuery, you need to create a parameter file, much like one that was created to build an index, and is run by executing “IndriRunQuery

At the most basic, an IndriRunQuery parameter file should consist of an index path, and a query. As an example:

the query to issue
Set the rule element in the above parameters tag to choose the Jelinek Mercer model
specifies the smoothing rule (TermScoreFunction) to apply. Format of the rule is:

( key ":" value ) [ "," key ":" value ]*

Here’s an example rule in command line format:


and in parameter file format:

This corresponds to Jelinek-Mercer smoothing with background lambda equal to 0.2, only for items in a title field.

If nothing is listed for a key, all values are assumed. So, a rule that does not specify a field matches all fields. This makes -rule=method:linear,collectionLambda:0.2 a valid rule.

Valid keys:

smoothing method (text)
field to apply this rule to
type of item in query to apply to { term, window }

Valid methods:

(also ‘d’, ‘dir’) (default mu=2500)
(also ‘jm’, ‘linear’) (default collectionLambda=0.4, documentLambda=0.0), collectionLambda is also known as just “lambda”, either will work
(also ‘two-stage’, ‘two’) (default mu=2500, lambda=0.4)

If the rule doesn’t parse correctly, the default is Dirichlet, mu=2500.

13 mar 2008

Lemur keep saying my file is malformed – why?
I want to index my files using lemur but my file, although I try to follow trectext format as below but it still not works!


At last, after 1 hour trying (by using the sample data provided in lemur package), I find out that we have to put 1 enter charater at the end of out file – a funny requirement of Lemur for trectext format ! On the other hand, pay attention to the new-line character (\r\n or just \n).


//enter twice at the end of file !!!!

mar 2008

source: The Lemur Toolkit – Tutorials :Starting Out : Overview: A Beginner’s Guide to Indexing

What is an index?

An index, or database, is basically a collection of information that can be quickly accessed, using some piece of information as a point of reference or key (what it’s indexed by). In our case, we index information about the terms in a collection of documents, which you can access later using either a term or a document as the reference.

Specificly, we can collect term frequency, term position, and document length statistics because those are most commonly needed for information retrieval. For example, from the index, you can find out how many times a certain term occurred in the collection of documents, or how many times it occurred in just one specific document. Retrieval algorthms that decide which documents to return for a given query use the collected information in the index in their scoring calculations.

11 mar 2008

source: The Lemur Toolkit – Tutorials :Starting Out : Overview: Overview of the Lemur Toolkit

Lemur currently supports the following features:

  • Indexing:
    • English, Chinese and Arabic text
    • word stemming (Porter and Krovetz stemmers)
    • omitting stopwords
    • recognizing acronyms
    • token level properties, like part of speech and named entities
    • passage indexing
    • incremental indexing
    • in-line and offset annotation support
  • Retrieval:
    • ad hoc retrieval (TFIDF, Okapi, and InQuery)
    • passage retrieval
    • cross-lingual retrieval
    • language modeling (KL-divergence)
      • query model updating for pseudo feedback
      • two-stage smoothing
      • smoothing with Direchlet prior or Markov chain
    • relevance feedback
    • structured query language
    • suffix-based wildcard term matching (Indri Query Language only)

Java tips

Get process id in Java?
Google:java + get process id -> Java Programming [Archive] – Getting Process ID (PID)
There’s no-way pure java. Must use the OS system variable $$.
So when running java program, put a system variable by -DprocessId=$$, then use System.getProperty(“processId”) to get it.


$ cat
public class x {
static void main(String a[]) {
$ javac
$ java -Dpid=$$ x
$ java -Dpid=$$ x
$ ps -p $$ -f
ijbalazs 1844 1842 0 12:02 pts/2 00:00:00 /bin/bash


Read from/write to file in Java?
Google:java + buffered + stream -> Lesson: Basic I/O (The Java™ Tutorials > Essential Classes)

How to convert a Byte[] bytesArray to Long[] longsArray?
Google: java + convert long array to byte array -> Java Programming [Archive] – Cast bytearray into long

byte[] to long[] :
ByteBuffer.wrap(bytesArray).asLongBuffer().get(longsArray) ;

long[] to byte[]:

Read from a text file and Write text to another file?
Google: java write text to file -> Reading and writing text files” href=”″>Java Practices -> Reading and writing text files

Usage context:
Having a text file to read from. Then write the loaded content to another file. How then?

Main idea:
Reading from text file filenameIn, stored in a String content:

File f = new File (filenameIn);
BufferedReader br = (new BufferedReader(new FileReader(f)));
StringBuffer sb = new StringBuffer();
while ( (l=br.readLine())!=null) {
String content = sb.toString();

Writing content to text file filenameOut :

Writer w = new BufferedWriter( new FileWriter(filenameOut) );


How to run external program or command line in Java?
Google: java + compress + object ->Learn Java: Running external commands in Java applications

Main idea :
Use the Runtime.getRuntime.exec(…) function. Below is a sample of usage.


  • Should not use .exec(String) version cause can hardly put in parameters.
  • Can not re-direct standard output using ‘>’ like .exec(“someCommand > outputFile”)
  • Neither can we use piping ‘|’ like .exec(“command1 | command2”)

Sample usage:

//execute cmd ‘command’
String[] cmd = {

, “argument1”,
“argument2”, …
Process p = Runtime.getRuntime().exec(cmd);

//print the output from standard of put of command
InputStream is = p.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
while (true) {
String s = br.readLine() ;
if (s==null) {
break ;
System.out.println(“Exit (code=” + p.exitValue() + “).”);

How to compress/decompress a bytes array? Or even better with an object?
Google: java + compress + object -> Compressing and Decompressing Data using Java

Search for Compressing Object header.

ByteArrayOutputStream baos = new
GZIPOutputStream gz = new GZIPOutputStream(baos);
ObjectOutputStream oos = new
baos.toByteArray(); // this is the compressed data of your objects

Google: java + compress + bytes array
-> Compressing a Byte Array (Java Developers Almanac Example)
-> Decompressing a Byte Array (Java Developers Almanac Example)

How to convert an object to bytes array and convert back from bytes array to object?
(so as to write/read an object stored in a binary file)

To convert an object to bytes array, see the code below from How can I convert any Java Object into byte array? And byte array to file object

public static byte[] getBytes(Object obj) throws{
ByteArrayOutputStream bos = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(bos);
byte [] data = bos.toByteArray();
return data;

To convert a byte array back to an object see the code in here Java Programming [Archive] – convert byte array to object

// .. get data into an array named "byteArray"
ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(byteArray));
// .. let's say you have a class named "Person"
Person person = (Person) ois.readObject();
// close the input stream (probably not necessary in this case)

Java – binary file – random access?
The tittle is the google search keyword
Source I found:
  1. Physics Simulation and Java – Lecture 9A: Binary File I/O Example
  2. Java I/O skip to the “RandomAccess Files” section.

What u may need as I did:

  • FileInputStream – DataInputStream
    • readInt – writeInt
    • readLong – writeLong
  • RandomAccessFile f
    • seek(0)
    • seek(f.length)
    • getFilePointer //current position of file pointer

How to use an external jar file jarExt in a jar file myJar?
Very easy, you just pay attention when jar-build the myJar file, you add the Class-Path attribute to you manifest file to the location of jarExt as below:
Class-Path: path/to/your/external/jar the/2nd/jar/space/separated
That’s all. Don’t know why Sun mention this in their tutorial not so bold for such a big need like this.

How to compile java source file?
Create your source files, putting them into the folders according to their package (e.g class packageName.any.className must be put in packageName/any/
From your current folder somewhere, compile your code by calling :
javac p2ur.source.files -classpath p2ur.used_packageName.folder -d p2ur.storing.compiledOutput.folder


p2ur is shorthand of
If your code do not using any class other than the ones from the JDK, then forget the classpath argument, which means:

-d p2ur.storing.compiledOutput.folder

If you skip also the -d argument, i.e javac p2ur.source.files , then the output will have the same folder with the source.

Special note:

If your code use other class not in JDK, then without providing classpath the compiler can’t say what the classes you used are! So, you have to tell JDK where those used classes are stored. Remember to give your storing path as the classpath-rule below.

JDK actually try to find those classes in your current folder when no classpath is given. This explains why some other refered-by-your-code classes stored in the same folder with your code will be auto compiled when you compile your code without giving the classpath.
Classpath may contain many paths separated by colon in Linux, semi-colon in Windows.

If your code

How to run java source file?

When you have a .class bytecode file with the main method, you wanna run it? Just call:
java p2ur.bytecode.file
Well, it’s just that simple when your code just used the JDK classes. What if you use the other classes (you wrote yourself, use from others, v.v..) ? Then, JDK will look in the classpath, which default (when not provided) is the current folder where you call java. You add classpath when call java as below.
java -classpath p2ur.bytecode.file
Must notice to the way JDK works with the classpath. If some is need to find, then it will iterate every path, called cpEntry, in classpath and see if cpEntry/any/package/class/name exists.

Note that if you class has package, full package of the class must be called when call java. i.e the whole not only the

Also notice that in the above execution command, classpath must be preceding of the bytecode or it won’t work.

Having the compiled classes, now how to easily distribute?
The answer is to use jar files. The target is to combine a set of classes to a single file. To do this we call :
jar cfm jarOutputFile manifestFile -C classesFolder .
We need to add context for the above command to understand it. Let’s say we have :

  • 2 classes : pack.age1.class1 and pack.age1.class2
    accordingly stored in src/pack/age1/ and src/pack/age2/
    then compiled into bin/pack/age1/class1.class and bin/pack/age2/class2.class
    You may need to look back if you don’t know how to do this.

The above command will create the jar file stored at jarOutputFile which include all the files recuresively stored in the classesFolder. Here, the working folder must be the bin/ folder. You can have this by adding -C path/to/bin  theChosenInBin, where theChosenInBin is the files in bin/ folder to be chosen to add to the jar file. Cause we want to add all files, the command should be -C path/to/bin . (note the dot “.” at the end which means the bin/ folder itself, making JDK auto look for the files inside recursively)

Having the jar file, we can call to run it by :
javar -jar jarOutputFile
but this require a starting point, i.e the main method elsewhere among the classes should be chosen
The main method chosen to be executed is the one indicated in the manifest file as:
Main-class: pack.ageX.classX (enter here)
(end of file here)

, note :
again require the last new-line character ending for the entry, without this the entry won’t be realized by JDK
and we just input  the classname of main class without  .class or .java at the end.

How to print a percentage?

Double perc = Double.valueOf();
System.out.print(String.format(“%1$.2f”, new Object[]{perc})

About classpath
How JDK find your referred classes – the classpath’s rule

Both the compiler and the JVM construct the path to your .class files by adding the package name to the class path. For example, if the package name is

and your class path is


then the compiler and JVM look for .class files in


A class path may include several paths, separated by a semicolon (Windows) or colon (Unix). By default, the compiler and the JVM search the current directory.

Perl working tips

Cool tool to test regular expression

Great tool to try regular expression: EditPad Pro (only need the demo version)

Regular expression (re) notes:

Basic syntax:

if (/your-re/)
Alternate syntax:
if (m|your-re|) – note that | can be replaced by any character !
if (m|
Putting grouping-parentheses :
if (m|something (?:one|two)+ anything|) – now you have one or two can appear at least one ! (normally you can have only one+ or two+)
Those are tips that would help you work with re much more easily!

Some quick notes

Remove leading and trailing spaces: $string =~ s/^\s+|\s+$// ;
Access the command line arguments: use $ARGV[0], $ARGV[1], …
Open new text file to write:

open FILE, “>filename” ;
print FILE “file content”;
close FILE ;

Search & replace:

$string =~ s/regex/replacement/g


How to run Linux-shell-command in Perl & get the result ?

Run shell command:
Source: Perl run shell command -> Run Shell COmmand in Perl SCript
From above URL, the answer for running shell command is the system function of Perl.

Get the shell-command result:
Source: Perl run shell command -> Live Search: perl shell command get output
There, they said, there are three basic ways of running external commands:

  • system $cmd;  # using system()
  • $output = `$cmd`; # using backticks (``)
  • open (PIPE, "$cmd |"); # using open()

I myself prefer the backsticks syntax to get the command’s output most, so convenient! Specifically, I wrote the following to get RAM’s free size :

my $cmd=”cat /proc/meminfo | grep MemFree” ;
my $memFree = `$cmd` ;#note the backsticks ` not the ‘ !
$memFree =~ s/
^\D+ (\d+) \D+$
/$1/x ;

What a found!

Notes about arguments when using sub in Perl!

Source: I try it myself!

Assumed we have the code :

#this is where mySub called
my $outside=123 ;
sub mySub {
#here don’t see $outside varible !

So note that in sub, we can’t use outside-declared variables declared after the call command. Be caution !

How to get Perl ‘s array size ?

Source: google:perl array size-> Array Manipulation in Perl

Assumed having array @arr, to get its size, call either :

  • directly using $# :
  • from a indirect-converved varible :


Unix Linux OS using tips – logged questions/problems of mine when working with it

Check if null value?
Check if null

if [ -z $variable ]; then
echo “Null”
echo “Not null”

Search and replace in bash command?
google: unix bash search and replace ->One-line shell script for find and replace [unix] [bash] [perl] [shell]

perl -pi -e 's/find/replace/g' *.txt 

How to know the file system’s block-size?
google: file system + minimum block size -> obtain filesystem block size
Block-size = the minimum size of storage unit used to store file content.
Call the command :

/sbin/dumpe2fs /dev/ | grep “Block size”
where could be :

  • hda, hda1, hda2, …
  • sda, sda1, sda2, ….

In “kate”, how to execute the current open file?
Simply press Ctrl-Shift-X

If command – condition command ?
google: unix bash command condition ->

Syntax : (noted exact space must be absolutely respected )

if [ string1 = string2 ]; then

if [ -eq string1 string2 ]; then


Check if defined or not

if [ -n "$X" ]; then # -n tests to see if the argument is non empty
echo "the variable X is not the empty string"

Sort a file numerically and remove duplicated values?
google: unix command remove duplicated value -> Unix Toolbox

Sort numerically:

sort -n file2sort > output

Remove duplicated lines:

uniq file2RemoveDup > output

Install new font in Unix system?
google: linux install new font -> Installing new fonts – The UNIX Forums

Firts, extract and copy fonts to:


Login as root user.
Then, continue to use the command:

fc-cache /usr/share/fonts/local/directory_you_put_fonts

Print elapsed time from bash command ?
google: bash command elapsed time -> Using bc in bash script

time1=`date +%s`
# your program that need to measure time runs here
time2=`date +%s`
elapsed=`echo $time2-$time1 | bc`
echo “Elapsed time: $elapsed seconds.”

Delete recursive files – How to?

find . -type f -name “yourPatter” -delete

How to call explorer GUI in Linux?

Run a console a call the command nautilus.

How to remote copy files/folder between Linux computers?

google: linux + remote copy ->

Use scp: (more simple)
scp file2Copy username@hostIP:targetPath
after calling the above command, u’ll be asked for username/password.
scp -r folder2Copy username@hostIP:targetPath

Use rcp:
There it said u need to have .rhosts file in home folder that have lines where each line is :
(lost information here -> empty! Sorry!)

Then, call rcp to copy files as sample :
rcp myLocalFile hostname:targetFile

How to change shell-prompt ‘s color?

google: lprocess + view memory used + linux -> Process memory usage –

ps -AH v

How to change shell-prompt ‘s color?

google: linux + command line -> some begginer site to input keyword-space to my knowledge -> shell prompt
google: linux + shell prompt + color -> Tip: Prompt magic

Prompt basics

Under bash, you can set your prompt by changing the value of the PS1 environment variable, as follows:

$ export PS1="> "

Changes take effect immediately, and can be made permanent by placing the “export” definition in your ~/.bashrc file. PS1 can contain any amount of plain text that you’d like:

$ export PS1="This is my super prompt > "
This is my super prompt >

While this is, um, interesting, it’s not exactly useful to have a prompt that contains lots of static text. Most custom prompts contain information like the current username, working directory, or hostname. These tidbits of information can help you to navigate in your shell universe. For example, the following prompt will display your username and hostname:

$ export PS1="\u@\H > "
drobbins@freebox >

Sequence Description

\u Your username

\w Current working directory (such as “/home/drobbins”)


Colors are selected by adding special sequences to PS1 — basically sandwiching numeric values between a “\e[” (escape open-bracket) and an “m”. If we specify more than one numeric code, we separate each code with a semicolon. Here’s an example color code:


When we specify a zero as a numeric code, it tells the terminal to reset foreground, background, and boldness settings to their default values. You’ll want to use this code at the end of your prompt, so that the text that you type in is not colorized. Now, let’s take a look at the color codes. Check out this screenshoot:
Color chart
Color chartaa

export PS1="\w> "


export PS1="\e[32;40m\w> "

So far, so good, but it’s not perfect yet. After bash prints the working directory, we need to set the color back to normal with a “\e[0m” sequence:

export PS1="\e[32;40m\w> \e[0m"

This definition will give you a nice, green prompt, but we still need to add a few finishing touches. We don’t need to include the background color setting of 40, since that sets the background to black which is the default color anyway. Also, the green color is quite dim; we can fix this by adding a “1” color code, which enables brighter, bold text. In addition to this change, we need to surround all non-printing characters with special bash escape sequences, “\[” and “\]”. These sequences will tell bash that the enclosed characters don’t take up any space on the line, which will allow word-wrapping to continue to work properly. Without them, you’ll end up with a nice-looking prompt that will mess up the screen if you happen to type in a command that approaches the extreme right of the terminal. Here’s our final prompt:

export PS1="\[\e[32;1m\]\w> \[\e[0m\]"

Don’t be afraid to use several colors in the same prompt, like so:

export PS1="\[\e[36;1m\]\u@\[\e[32;1m\]\H> \[\e[0m\]"
Bash allows these prompt strings  to  be  customized by inserting a number of backslash-escaped special
characters that are decoded as follows:
\a an ASCII bell character (07)
\d the date in "Weekday Month Date" format
(e.g., "Tue May 26")
\e an ASCII escape character (033)
\h the hostname up to the first `.'
\H the hostname
\j the number of jobs currently managed by the
\l the basename of the shell's terminal device
\n newline
\r carriage return
\s the name of the shell, the basename of $0
(the portion following the final slash)
\t the current time in 24-hour HH:MM:SS format
\T the current time in 12-hour HH:MM:SS format
\@ the current time in 12-hour am/pm format
\u the username of the current user
\v the version of bash (e.g., 2.00)
\V the release of bash, version + patchlevel
(e.g., 2.00.0)
\w the current working directory
\W the basename of the current working directory
\! the history number of this command
\# the command number of this command
\$ if the effective UID is 0, a #, otherwise a
\nnn the character corresponding to the octal
number nnn
\\ a backslash
\[ begin a sequence of non-printing characters,
which could be used to embed a terminal control sequence into the prompt
\] end a sequence of non-printing characters

To have this prompt permanently, put the command : export PS1=”blabla” in .bashrc file (located in your home folder)
My extracted script:

#The followings set the prompt to show username workingFolder with color
export PS1=”\[$boldGreen\]\u\[$default$green\] \w $ \[$default\]”


How to search & replace on multiple files on Linux OS ?

google:Linux + search and replace multiple files -> Search and replace over file(s) with Perl [linux] [perl] [replace] [search]

They use sed command with inline option and using expression (regular expr.) as :
i e s/source/destination/g *.html

More detail on sed?

google:Linux + sed command -> sed – Linux Command – Unix Command

How to rename multiple files on Linux OS ?

Source: google: linux rename multiple files -> Howto Linux rename multiple files at a shell prompt

Syntax :

rename oldText newText *.files

For example rename all *.bak file as *.txt, enter:

$ rename .bak .txt *.bak

Remove all blank space with rename command:
$ rename "s/ *//g" *.mp3

To remove .jpg file extension, you write command as follows:

$ rename ‘s/\.jpg$//’ *.jpg

To convert all uppercase filenames to lowercase:
$ rename 'y/A-Z/a-z/' *

How to get numbers of lines of a huge file?

Source: google: linux shell command + “concatenate string” -> concatenate strings – Linux Forums
There, u can see that we can easily concatenate string by putting variables/strings next to each other.

How to get numbers of lines of a huge file?

Source: my professor Jean Pierre Chevallet -> wc -l

Call the wc command with -l option : wc -l

How to get memory (RAM) size in Linux OS system by command line?

I need to know the memory size, RAM size, of the server which is a Linux OS. I look on the web and found this quite not easy since the input keyword must be precise.

cat /proc/meminfo
There u can see the line that says: MemFree : ### kB (or mB) is where the free memory is showed.

For a brief information output let’s used :

cat /proc/meminfo | grep MemFree
This will show only the MemFree line! What a command!

Next, we can output the result of free memory size :

  • to a file by cat /proc/meminfo | grep MemFree > outputFile
  • to output it into a Perl varible see it here.

How to start a program, e.g Kate, in background ?

Source: google:unix + background + start -> Working With the Unix Shell

Just put an & after ur command will make the ‘kate’ run background.

What is the useful programs used on Linux ?

I myself use :

  • Kate as the code/text(e.g Perl) editor
  • and Krusader as the files browser.
  • More over, I used the SSH Secure Shell from to transfer file between Windows XP and Linux and reversely.