Use Jump Host for RSYNC

I have computer A, storage server B and HPCC.

Previously, I use A to SSH B and RSYNC files from HPCC to B. One day, B can not PING or SSH or RSYNC or SCP to HPCC. But B can PING or SSH or RSYNC or SCP to A, and A can also communicate with HPCC. So I am trying to use A as a jump host to help B to download files from HPCC.

 

Got help form https://www.freeture.ch/?p=815.

The first thing is make sure the SSH KEY works between A and B, and between A and HPCC.

In ~/.ssh directory on B, build a new file config. The content is below.

Host remote_server.com
HostName remote_server.com
User remote_server_user_id
ProxyCommand ssh jump_host_user_id@jump_host.com nc remote_server.com 22

Install charmm on HPCC

I have to use CHARMM force field for PEG in water. To better understand how to use CHARMM force field and how charmm program works, I tried to install charmm (free, non-parallel computing edition) on my iMac but failed.

Our nice HPCC manager David Chaffin helped me to install it on the AHPCC with following commands.

module purge
module load intel/14.0.3 mkl/14.0.3 impi/5.1.2
export MPI_ROOT=$I_MPI_ROOT
export MPI_F90=mpiifort
./install.com em64t medium M MPIF90

Ten handy python libraries for (aspiring) data scientists

As suggested from guys of HPCC managers, I began to learn Python since last year for simple array operations. Now I am a pretty good entry level Python programmer. With Numpy and Scipy, I can handle most of my jobs. Here is a post from http://bigdata-madesimple.com/ten-handy-python-libraries-for-aspiring-data-scientists/. This post briefly introduces popular Python modules to facilitate the programming.

 

Data science has gathered a lot of steam in the past few years, and most companies now acknowledge the integral role data plays in driving business decisions.

Python, along with R, is one of the most handy tools in a data scientist’s arsenal. It’s also one of the simplest computer languages to learn and use, primarily because most concepts can be expressed in fewer lines of code in Python, than in other languages.

Hence, beginners venturing out into the field of data science should definitely familiarise themselves with Python.

Python also offers a slew of active data science libraries and a vibrant community. Below are some of the most commonly used libraries and tools:

NumPy

NumPy is an open source extension module for Python. It provides fast precompiled functions for numerical routines. It’s very easy to work with large multidimensional arrays and matrices using NumPy.

Another advantage of NumPy is that you can apply standard mathematical operations on an entire data set without having to write loops. It is also very easy to export data to external libraries that are written in low-level languages (such as C or C++), and for data to then be imported from these external libraries as NumPy arrays.

Even though NumPy does not provide powerful data analysis functionalities, understanding NumPy arrays and array-oriented computing will help you use other Python data analysis tools more effectively.

Scipy

SciPy is a Python module that provides convenient and fast N-dimensional array manipulation. It provides many user-friendly and efficient numerical routines, such as routines for numerical integration and optimization. SciPy has modules for optimization,  linear algebra,  integration and other common tasks in data science.

Matplotlib

Matplotlib is a Python module for visualization. Matplotlib allows you to quickly make line graphs, pie charts, histograms and other professional grade figures. Using Matplotlib, you can customise every aspect of a figure. When used within IPython notebook, Matplotlib has interactive features like zooming and panning. It supports different GUI backends on all operating systems, and can also export graphics to common vector and graphic formats like PDF, SVG, JPG, PNG, BMP, GIF, etc.

Scikit-Learn

Scikit-Learn is a Python module for machine learning built on top of SciPy. It provides a set of common machine learning algorithms to users through a consistent interface. Scikit-Learn helps to quickly implement popular algorithms on datasets. Have a look at the list of algorithms available in Scikit-Learn,  and you will realise that it includes tools for many standard machine-learning tasks (such as clustering, classification, regression, etc.).

Pandas

Pandas is a Python module that contains high-level data structures and tools designed for fast and easy data analysis operations. Pandas is built on NumPy and makes it easy to use in NumPy-centric applications, such as data structures with labelled axes. Explicit data alignment prevents common errors that result from misaligned data coming in from different sources.

It is also easy to handle missing data using Python. Pandas is the best tool for doing data munging.

Theano

Theano is a Python library for numerical computation, and is similar to Numpy. Some libraries such as Pylearn2 use Theano as their core component for mathematical computation. Theano allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.

NLTK

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial-strength NLP libraries. NLTK has been used successfully as a platform for prototyping and building research systems.

Statsmodels

Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

PyBrain

PyBrain is an acronym for “Python-Based Reinforcement Learning, Artificial Intelligence, and Neural Network”. It is an open source library mainly used for neural networks, reinforcement learning and unsupervised learning.

Neural network forms the basis for this library, making it a powerful tool for real-time analytics.

Gensim

Gensim is a Python library for topic modeling. It is built on Numpy and Scipy.

The figure below summarizes the number of GitHub contributors to the most popular data science libraries.

popular-data-science-python-libraries

These are some of the best libraries I’ve tried or come across. But there are others.

If I’ve missed out any Python data science libraries that you swear by, do let me know what they are by leaving a comment below this blog.

– See more at: http://bigdata-madesimple.com/ten-handy-python-libraries-for-aspiring-data-scientists/#sthash.fpwd8a6A.dpuf

Port forwarding does not work in Virtualbox 5.0

The port forwarding was normal. But after a shutdown of the frontend. It did not work anymore. I have met this before. And what I did was to reinstall Virtualbox, which wasted a lot of time.

After testing for a long time, I found a solution at http://superuser.com/questions/323424/unable-to-do-port-forwarding-in-virtual-box.

In my case, the ip was 35.0.0.1. So in the Port Forwarding Rules, fill it in Guest IP. Leave Host IP empty. Then it works!

Install ImageMagick on HPCC.

I used a long time to understand how to install software on HPCC. Previously, due to the limited permission I also failed to install anything on the university HPCC.

The solution is simple for most cases. Set up the install directory in your own directory. I have a folder at ~/bin and add it to the PATH by export PATH=~/bin:$PATH. Then install software there.

When you ./configure a source file, in most condition, add –prefix=~/bin is fine. But you should check the configure file to see whether it has other requirement.

Here I use ImageMagick as example. I always use it to generate animation with VMD and combine figures generated by GNUPLOT. (citaiton: http://www.imagemagick.org/script/install-source.php)

wget http://www.imagemagick.org/download/ImageMagick.tar.gz
tar xvzf ImageMagick.tar.gz
cd ImageMagick-6.9.1-8
./configure --prefix=~/bin --exec-prefix=~/bin
make
make install
make check

I passed 76 test, no errors : )

Set up my own test HPCC with Virtualbox. Part 2. Install Torque PBS scheduler.

The UARK HPCC used Torque as the scheduler. I am trying to mimic this. The tutorial on Internet to install Torque are all tricky for me. Rocks cluster can do this in much easier way.

I chose 6.1.1 edition. It has many rolls to install corresponding functions. Torque rolls can be downloaded from ftp://ftp.uit.no/pub/linux/rocks/torque-roll/6.0.0/torque-6.0.0-1.x86_64.disk1.iso. I put it in /root/ISO/. And then use the following commands to install the roll. (Citation: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2014-April/065021.html)

cd /export/rocks/install
rocks add roll /root/ISO/torque-6.0.0-1.x86_64.disk1.iso
rocks enable roll torque
rocks create distro
rocks run roll torque | sh
reboot

Then you can use qstat or showq to see if it works.

I have two nodes and one frontend. I am planning to make the two nodes as one queue. But I haven’t find any tutorial about this. I will update that soon.

Set up my own test HPCC with Virtualbox. Part 1. Install a frontend and two nodes.

I am very interested in HPCC. So after using HPCC in the university for two years, I am trying to built my own HPCC with Virtualbox. Just for fun : )

This blog is about how to build the frontend and two more nodes. Since my laptop has four cores and eight threading. I am capable to test this. I will build more nodes later when I am familiar with building a HPCC.

I use Rocks Cluster 6.1.1 which is based on CentOS 6.5. CentOS 6.5 is also the system in the university HPCC. Rocks Cluster makes building HPCC much easier. The ISO file is downloaded from ftp://ftp.rocksclusters.org/pub/rocks/rocks-6.1.1/linux/area51+base+bio+fingerprint+ganglia+hpc+htcondor+java+kernel+kvm+os+perl+python+sge+web-server+zfs-linux-6.1.1.x86_64.disk1.iso.

I mainly followed the instruction from https://www.youtube.com/watch?v=BQil0smjbX8&list=PLKB1wqoi4EhJ2y16O82xozIczztN-Q9I4&index=7. I have some difference with the tutorial.

1. I do not have enough memory. I set frontend 1GB memory and 1 threading. The two nodes have 512MB memory and 1 threading. Since the memory is low, you can only see a text interface when you install the system on nodes. But for the frontend, you can see graphic interface.

2. I changed DNS server to the one used in the university rather than 192.168.*.* in the tutorial.

3. The Virtualbox version is 5.0. My laptop is Ubuntu 15.04.

After the installment and you can see the nodes in Ganglia as the tutorial shows.

Then I set the port forwarding in the Setting of frontend for the NAT network. I chose 2222 from host and 22 for guest. This is not tricky, but sometimes it didn’t work. So I reinstalled Virtualbox as well as the frontend for several times : (

After all of this, I can ssh from anywhere in the world to visit my HPCC by the following command.

ssh -t user@host_ip ‘ssh -p 2222 root@127.1’