NLP Grid Tutorial

What is the NLP Grid?

The NLP Grid is a computing cluster shared among NLP researchers at Penn. It allows for large scale parallel processing that would be impossible to do on your own machine. In total, there are 11 machines, each with 64 cores and 500gb of RAM.

The purpose of this page is to (1) provide a new user with instructions to start using the NLP grid, (2) collect a cheatsheet of useful arguments to the cluster, and (3) provide some troubleshooting tips for common problems.

How to Access the NLP Grid

To get access to the NLP Grid, email research@seas.upenn.edu and cc the NLP faculty who is sponsoring your request. Then, you can ssh into nlpgrid.seas.upenn.edu with your account information.

Where to Store Your Data

The home directory of the NLP Grid machines are the same ones that you access by logging into eniac, biglab, or a physical computer in a lab. Unfortunately, this means that the home directory is pretty restricted in terms of how much space you use.

The NLP Grid has extra disks that are significantly larger than the SEAS home directories, but they are only accessible on the NLP Grid machines. The recommended place to store your data is /nlp/data/<username>. You can create your own directory and put your private code and data there.

If you have some data that you want to share with everyone on the NLP Grid, then it should be saved under /nlp/data/corpora. For example, a lot of the LDC data is stored under /nlp/data/corpora/LDC.

The Sun Grid Engine

The NLP Grid uses the Sun Grid Engine (SGE) to schedule jobs and manage shared resources. Jobs on the NLP Grid take the form of a bash script. The bash script is submitted to the cluster, which puts it into the queue and runs the job when the required resources are free.

The job bash scripts are identical to regular bash scripts except that it also includes extra arguments which are passed to the cluster. Arguments about the name of the job, where the stdout should be written, how much memory it requires, etc., can all be included within the bash file on lines that start with #$. For example, in the following hello.sh script

#$ -N hello-world
#$ -o hello.stdout
#$ -e hello.stderr
#$ -cwd
echo "Hello, World"

the -N specifies the name of the job, -o and -e specify where the standard out and standard error should be saved, and -cwd marks that the job should be run from the current working directory.

Jobs are submitted to the NLP Grid with the qsub command.

qsub hello.sh

Any arguments you provide after the script name will be passed to the script as if you ran the script with sh.

The scripts can read or write to files, run python or Java code, execute bash commands, etc. Anything you can do in a normal bash script can also be done here.

After the script has been submitted, you should be able to see the job listed in the queue with the qstat command. The output from qstat shows information about all of the jobs currently queued or running on the grid. The state column marks whether or not the job is running (r), queued (qw), or has failed with an error (Eqw).

Once the resources your job requires are free, the cluster will assign it to a worker node which will execute the job. Anything written to the standard out and standard error will be written to the files you specified (if you did not specify these, the default is in your home directory).

Each job is allocated 1 core and about 8gb of RAM. If you want to consume more cores or memory, then see the arguments in the next section about how to request more resources. Although I don’t think the resource limits are enforced in practice, it is polite to the other users to only use the resources you request so your jobs do not interfere with theirs.

If you want to cancel a job, either because it is in an error state or you realized it won’t do what you intended it to, you can do so with the qdel command

qdel <job-id>

where the job ID can be found using the qstat command.

If you are doing very basic processing, then that’s about all you need to know. But, if you need to submit a large number of jobs, use multiple cores, or consume a lot of memory, then you may need to read more details about the available cluster arguments in the next section.

Job Arguments

This section provides more details on the (more advanced) arguments to the cluster which I have found to be useful. All of these arguments can be passed to the qsub command or written at the top of the bash script. That is, the same hello.sh example from above could also have been submitted with the command

qsub -N hello-world -o hello.stdout -e hello.stderr -cwd hello.sh

where all of the SGE arguments have been deleted from hello.sh. This is particularly useful when you have to submit a lot of jobs all at once using a second bash script and you want to dynamically change the parameters of the job without editing the bash file itself.

The useful arguments are as follows:

Set the name of the job which will appear in the qstat command
```
-N <name>
```
Set the files where the standard out and standard error should be written. The directory where the files will be written should exist, otherwise the job will fail with an unhelpful error. The paths can be relative or absolute.
```
-o path/to/stdout
-e path/to/stderr
```
Set the directory that should be used as the working directory for the script. -cwd marks the current working directory should be the one that the qsub command was executed from. -wd specifies a particular directory.
```
-cwd
-wd path/to/working/directory
```
Send an email to a particular address when the job finishes
```
-M <email-address>
```
Request to use a specific number of slots (e.g. 4 in this example). Each slot is equivalent to 1 core. Since each machine has 64 cores, the maximum number of slots you can request is 64. The default is 1.
```
-pe parallel-onenode 4
```
Set environment variables for the job. To set multiple environment variables, I believe you repeat the -v flag multiple times, although this has not been tested yet.
```
-v OMP_NUM_THREADS=1
```
Request a specific amount of memory for the job. There are two ways to do this, either by requesting an amount of memory per slot (mem) or for the entire job (h_vmem). I recommend using h_vmem so it is more transparent how much memory you are using. Each machine has 500gb of memory, so the maximum you can request is 500gb (in practice, I generally only request a maximum of 490gb).
```
-l mem=8G
-l h_vmem=24G
```
Note that multiple -l arguments are allowed for a single command.
Request that a job runs on a specific node or nodes
```
-l h=nlpgrid12
-l h=(nlpgrid12|nlpgrid13)
```
Request a job does not run on a specific node or nodes
```
-l h=!nlpgrid12
-l h=!(nlpgrid12|nlpgrid13)
```
Force a job to start running after another completes (job IDs are comma separated)
```
-hold_jid <job-ids>
```

Qlogin Sessions

In addition to submitting jobs, another way many users interact with the NLP Grid is through qlogin sessions.

When you first ssh onto the NLP Grid, you login to the head node. The head node is a machine that has very little memory and processing power. The idea is that you should only use this machine to do lightweight tasks, such as submitting jobs. However, sometimes you want to do develop code without having to continually submit jobs. That is where qlogin comes in.

The qlogin command will start a qlogin session for you by logging you into one of the larger worker nodes. Here, you have access to a full machine with no artificial limits on the number of cores or memory used. You can use this session to run more computationally intensive tasks that you can’t run on the head node.

The number of qlogin sessions is limited per machine, so it is important to close the session by running exit when you are done. Otherwise, the session will hang and other users won’t be able to start their own sessions. You can see whether or not you have any active sessions using qstat.

If you lost the connection or closed the terminal without exiting, you can still terminate a qlogin session using qdel and the job ID from qstat.

qlogin will try to log you in to nlpgrid10 by default and will fail if there are not enough sessions available. In these cases, you can email someone who has a lot of sessions active or some very old sessions that they likely forgot to terminate and ask them to delete those sessions. Otherwise, you can login to a speicifc worker node using the following

qlogin -l h=nlpgrid14 -now n

Cluster Commands

This section details some SGE commands that are useful for cluster-level tasks.

Check the status of the queue
```
qstat
```
Get detailed information about a specific job
```
qstat -j <job-id>
```
Check the status of the nodes
```
qhost
```

Delete a job

qdel <job-id>
qdel -u <username>  # Deletes all of the jobs for a specific user

Useful Scripts

I have written a couple of scripts to make some common tasks on the NLP Grid a little easier. The scripts are stored in this git repository. I have these scripts on my path on the NLP Grid. The two that I think are worth pointing out are qlogin-node and qsub-script.

qlogin-node is a shortcut for logging into a specific node. It accepts 1 argument, which is the name of the node to login to, like nlpgrid12. I always forget the specific arguments to login to a worker node, and this script just makes it so I don’t have to keep looking it up.

qsub-script is a python script that I made after I was tired of creating bash scripts to run one-off jobs. You can pass a command to qsub-script and it will create a temporary bash file that runs that command and submits it to the cluster. You can specify some common job arguments before the command, such as -N or -o and -e. For example, if you want to quickly submit a job that runs a python script, you can run

qsub-script -N my_job -o stdout.txt -e stderr.txt \
  "python example.py input-file.txt"

Troubleshooting

Errors starting the JVM

If you are on the head node (e.g. the command line will say @nlpgrid and not @nlpgrid10), then you actually cannot start any JVM instance because it requires more resources than you are allowed to use on the head node.

If you are running a lot of jobs that use Java, then you may observe that the JVM fails to start for some number of jobs. For some reason, there is a limit on the number of JVM instances that can be running on any individual machine. This number seems to be around 15. The workaround I’ve found is to artificially request more slots than your job actually needs to prevent the cluster from scheduling too many jobs on any individual machine. Requesting around 4 slots per job seemed to work out well for me.

Processes consuming too many cores

Some common Python libraries, such as spacy and PyTorch, try to consume as many cores as they can access to speed up processing. The consequence of this is that your process may try to consume all 64 cores when you think it should only be using one. There is no easy way to diagnose this problem other than to qlogin to the node where your job is running, then see if htop shows your process using more cores than expected. To fix this, set the environment variable OMP_NUM_THREADS=1 for the job to limit the number of cores that are used. In my experience, this does not make the job run any slower, in fact, I have seen the jobs run faster with just 1 core.

Environment variable TERM not set

If you qlogin to a worker node and some common bash commands (e.g. clear, top) report an error that the TERM environment variable is not set, this is easily fixed by adding export TERM="xterm-256color" to your ~/.bashrc file.