Friday, March 28, 2014

CitusDB experience


Currently, I set up CitusDB cluster(1 master node and 2 worker nodes) for analysing data. Here's some my own experience about it.

How to build the cluster you can check on CitusDB's official website. Click here


1. The Way To Load Data into Cluster:
From the official documentation, you can find this "CitusDB currently requires that you log into one of the worker nodes to stage data, and connect to the master node from the worker node using psql"

For example, here's my master node and worker nodes info:
hostname   port
master 5432
worker1 9701
worker2 9702

They are installed on 3 separated servers. If you want to load data on server , for example, worker1
you have to log in as
$/opt/citusdb/3.0/bin/psql -h master  -p 5432 -d postgres
and then invoke the "\STAGE" command. If you have multiple tables to be loaded, you can invoke the STAGE command in shell prompt. Here's my python script, it loops the stage command for multiple tables:

#!/usr/bin/python
import os

CITUS_DB_EXECUTABLE = '/opt/citusdb/3.0/bin/psql '
HOST = 'master'
PORT = 5432
DATABASE = 'postgres'
TABLE_DEF = '(ColumnA VARCHAR(10), ColumnB INTEGER, ColumnC TEXT)'
SOURCE_FOLDER = '/tmp/'

def create_table(table_name):
 os.system( CITUS_DB_EXECUTABLE + ' -h ' + HOST + ' -p ' + str(PORT) + ' -d ' + DATABASE + '  -c \"CREATE TABLE ' + table_name  + TABLE_DEF + ' \"' ) 
 print 'created table:', table_name

def load_to_table(table_name):
 os.system( CITUS_DB_EXECUTABLE + ' -h ' + HOST + ' -p ' + str(PORT) + ' -d ' + DATABASE + ' -c "\STAGE ' + table_name + ' FROM ' + SOURCE_FOLDER + table_name + '.csv' + ' (FORMAT CSV)"')
 print 'loaded table:', table_name

def main():
 all_table_names = []
 with open ('/home/haifzhan/table_names.txt', 'r') as f:
  for line in f:
   table_name = line.strip()
   create_table(table_name)
   load_to_table(table_name)

if __name__ == '__main__':
 main()

After loading data into database, both workers will have the same data. For example, the table name in master node is table_2014_03_27, it will have a serial number appended as suffix in 2 worker nodes as table_2014_03_27_102018.  The suffix annoying me when I try to delete 300 loaded tables in both workers.

You cannot load data from workers server login as either worker1 nor worker2 directly

2.  Query Performance
I wrote sample queries to test it, and it is as good as I expected.  I have a 90 million lines table(this is simulating what i need for my work) about 13GB data, 14 columns.
My query is:
select count(*) from my_table where ColumnA = "columnA" and ColumnB > 0 and ColumnC = "columnC";

It took only about 2 minutes to return the result, MySQL(single node) gives me the same result around 15 minutes. I got this performance without optimizing my table type status. It is a surprise to me. BUT, I found that a lot of functionalities are not working on CitusDB,  like "distinct" and subqueries, which used often in my work, and it is the reason I give up CitusDB.

3. Self created Database
I tried to create a new database in master node, but it doesnt show in the workers. I connected CitusDB, they tell me I have to create the database with same name on worker nodes manually, then they can replicate data on worker nodes, and it really works well.


Finally,  CitusDB is not fit for me, if it can make complex queries work, it will be definitly my first choice. Now I am looking into Amazon RedShift...

Friday, March 7, 2014

Comparing two files which have different number of columns


There are two files file1.txt and file2.txt that file1.txt has two columns contents and file2.txt has 3 columns contents.

file1.txt
abc,1
cba,2



file2.txt
1,abc,001
2,cba,002 
4,haifzhan,003
To get the common lines of those two files, execute this command:
awk -F',' 'NR==FNR{a[$1, $2]++;next} (a[$2,$1])' file1.txt file2.txt  > comm.txt
the output is written into comm.txt, you can see the output below contains 3 columns, that's because once common parts are found, it will output the info based upon the second input file that is file2.txt.
1,abc,001
2,cba,002


It gets the 1st column and 2nd column of file1.txt and 2nd column and 1st column of file2, and checks the equality of those fileds.file1.txt and file2.txt donot have to be sorted when execute the above commands.