haifzhan: 2014

Tuesday, April 8, 2014

Unknown/unsupported storage engine: InnoDB

140408 16:36:12 [ERROR] Unknown/unsupported storage engine: InnoDB
140408 16:36:12 [ERROR] Aborting

$ sudo rm /var/lib/mysql/ib_logfile*

then it will be fixed.

Setup MariaDB Spider Engine

1. Install MariaDB(10.0.4 or later version are compatible with spider engine).
1) Add the following to /etc/yum.repos.d/MariaDB.repo or find your repo here
# MariaDB 10.0 CentOS repository list - created 2014-04-08 20:31 UTC # http://mariadb.org/mariadb/repositories/ [mariadb] name = MariaDB baseurl = http://yum.mariadb.org/10.0/centos6-amd64 gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB gpgcheck=1

2).
$ sudo yum install MariaDB*

2. Activate spider engine
$ mysql -uroot -p
mysql> source /usr/share/mysql/install_spider.sql

3. Modify privileges, access Database remotely $ sudo service mysql stop
1) add the following to /etc/my.cnf
[mariadb]
port=3306
2) sudo service mysql start
3) figure out your mysql path, for example mine is
/usr/bin/mysql
4) set privileges (login mysql remotely, this is for the communication among all servers, run those commands on each node)
I have 3 nodes, dbnode1, dbnode2, and dbnode3
$ mysql -u root -p
Enter password:
mysql> use mysql
mysql> GRANT ALL ON *.* to root@'dbnode1' IDENTIFIED BY 'your-root-password';
mysql> GRANT ALL ON *.* to root@'dbnode2' IDENTIFIED BY 'your-root-password';
mysql> GRANT ALL ON *.* to root@'dbnode3' IDENTIFIED BY 'your-root-password';
mysql> FLUSH PRIVILEGES;

4 Set alias(easy access from other servers/nodes)
alias backend1='/usr/bin/mysql --user=root -password=your-pass --host=dbnode1 --port=3306'
alias backend2='/usr/bin/mysql --user=root -password=your-pass --host=dbnode2 --port=3306'
alias backend3='/usr/bin/mysql --user=root -password=your-pass --host=dbnode3 --port=3306'

5 Create database and table
Im using dbnode3 as my spider server, dbnode1 and dbnode2 are acting as normal databases:
1) dbnode1 and dbnode2
CREATE DATABASE backend;
CREATE TABLE backend.sbtest (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
k int(10) unsigned NOT NULL DEFAULT '0',
c char(120) NOT NULL DEFAULT '',
pad char(60) NOT NULL DEFAULT '',
PRIMARY KEY (id),
KEY k (k)
) ENGINE=InnoDB;

2) on dbnode3(spider server)
CREATE SERVER backend1
FOREIGN DATA WRAPPER mysql
OPTIONS(
HOST 'dbnode1',
DATABASE 'backend',
USER 'root',
PASSWORD 'your-pass',
PORT 3306
);

CREATE SERVER backend2
FOREIGN DATA WRAPPER mysql
OPTIONS(
HOST 'dbnode2',
DATABASE 'backend',
USER 'root',
PASSWORD 'your-pass',
PORT 3306
);

CREATE DATABASE IF NOT EXISTS backend;

CREATE TABLE backend.sbtest
(
id int(10) unsigned NOT NULL AUTO_INCREMENT,
k int(10) unsigned NOT NULL DEFAULT '0',
c char(120) NOT NULL DEFAULT '',
pad char(60) NOT NULL DEFAULT '',
PRIMARY KEY (id),
KEY k (k)
) ENGINE=spider COMMENT='wrapper "mysql", table "sbtest"'
PARTITION BY KEY (id)
(
PARTITION pt1 COMMENT = 'srv "backend1"',
PARTITION pt2 COMMENT = 'srv "backend2"'

) ;

Now it should work.

important: check firewall on your system.

If you want to install MariaDB Galera Cluster(5.5 Series), take a look at this link:
http://matthewcasperson.blogspot.ca/2013/07/setting-up-galera-cluster-in-centos-6.html

Make sure install the MariaDB-Galera-Server other than MariaDB-Server.

Wednesday, April 2, 2014

Amazon Redshift Experience

Purpose is to analyse a big SQL table. For example, the table I am going to query contains 90 million lines and each row has 17 columns. All those data in 20K+ csv files will take 13GB disk space.

Checked on-line about Redshift's performance, here's the comparison of several famous analytic framworks. All comparison results says Redshift seems to be my best choice. With the number of cluster nodes increasing, the performance is guaranteed to improve.

Important:
Loads data from Amazon S3 to Redshift,and S3 and Redshift MUST in the same region or you will get error.

Loads data from local to S3
Before I uploads data to S3, I split my 13GB file to 20K+ files, and then compress them into GZIP(the reason is you can import GZIP file to Redshift database directly, I like this!), I wrote a python script to gzip all files and upload to S3(boto python).

s3cmd
boto python library
Amazon S3 upload option
Amazon S3 CLI

Loads data from S3 to Redshift
Installs Postgres8.4 as a client side to communicate with Redshift database.
The following command is accessing the database. -h is the redshift endpoint name, -U is the master name, -d is database name and -p is port number
$ psql -h ... -U ... -d ... -p ...

The alternative way is workbenchSQL which is not working on my ubuntu12.04

I wrote another python script to loads data from S3 to Redshift. You need to know is there's no way to provide password in the shell commands
for example, it is not possible to provide password to the command "psql -h ... -U ... -d ... -p ..."
I solved this problems with a hacky way:

$ export PASSWORD=yourpassword
$ psql -h ... -U ... -d ... -p ... -c "copy ....delimiter... gzip;"

then loop the "COPY" commands to transfer files, it will not prompt for the password anymore.
-c means run the command on your terminal, and line quoted are the sql query. If you import a csv file, field separate is comma, you have to provide "delimiter ',' ", if the csv file is compressed in gzip, you have to add gzip at the end. You also need to provide the access key id and secret key in your COPY command.

Performance:
1. At the beginning, single node cluster(dw2.large), queries are
select count(*) from my_table;
select count(distinct columnX) from my_table;
select count(distinct columnY) from my_table where data_date like '2013-%';
the query spending time are 5 seconds, 18 seconds and 1m55seconds respectively.

2. Then resize to 2-node cluster(both dw2.large), execute the 3rd query and time spent decrease to 46 seconds.

Amazon Redshift provides UI to monitor the cluster, you can check status, performance, query history, load history easily.

Friday, March 28, 2014

CitusDB experience

Currently, I set up CitusDB cluster(1 master node and 2 worker nodes) for analysing data. Here's some my own experience about it.

How to build the cluster you can check on CitusDB's official website. Click here

1. The Way To Load Data into Cluster:
From the official documentation, you can find this "CitusDB currently requires that you log into one of the worker nodes to stage data, and connect to the master node from the worker node using psql"

For example, here's my master node and worker nodes info:
hostname port
master 5432
worker1 9701
worker2 9702

They are installed on 3 separated servers. If you want to load data on server , for example, worker1
you have to log in as
$/opt/citusdb/3.0/bin/psql -h master -p 5432 -d postgres
and then invoke the "\STAGE" command. If you have multiple tables to be loaded, you can invoke the STAGE command in shell prompt. Here's my python script, it loops the stage command for multiple tables:

#!/usr/bin/python
import os

CITUS_DB_EXECUTABLE = '/opt/citusdb/3.0/bin/psql '
HOST = 'master'
PORT = 5432
DATABASE = 'postgres'
TABLE_DEF = '(ColumnA VARCHAR(10), ColumnB INTEGER, ColumnC TEXT)'
SOURCE_FOLDER = '/tmp/'

def create_table(table_name):
 os.system( CITUS_DB_EXECUTABLE + ' -h ' + HOST + ' -p ' + str(PORT) + ' -d ' + DATABASE + '  -c \"CREATE TABLE ' + table_name  + TABLE_DEF + ' \"' ) 
 print 'created table:', table_name

def load_to_table(table_name):
 os.system( CITUS_DB_EXECUTABLE + ' -h ' + HOST + ' -p ' + str(PORT) + ' -d ' + DATABASE + ' -c "\STAGE ' + table_name + ' FROM ' + SOURCE_FOLDER + table_name + '.csv' + ' (FORMAT CSV)"')
 print 'loaded table:', table_name

def main():
 all_table_names = []
 with open ('/home/haifzhan/table_names.txt', 'r') as f:
  for line in f:
   table_name = line.strip()
   create_table(table_name)
   load_to_table(table_name)

if __name__ == '__main__':
 main()

After loading data into database, both workers will have the same data. For example, the table name in master node is table_2014_03_27, it will have a serial number appended as suffix in 2 worker nodes as table_2014_03_27_102018. The suffix annoying me when I try to delete 300 loaded tables in both workers.

You cannot load data from workers server login as either worker1 nor worker2 directly

2. Query Performance
I wrote sample queries to test it, and it is as good as I expected. I have a 90 million lines table(this is simulating what i need for my work) about 13GB data, 14 columns.
My query is:
select count(*) from my_table where ColumnA = "columnA" and ColumnB > 0 and ColumnC = "columnC";

It took only about 2 minutes to return the result, MySQL(single node) gives me the same result around 15 minutes. I got this performance without optimizing my table type status. It is a surprise to me. BUT, I found that a lot of functionalities are not working on CitusDB, like "distinct" and subqueries, which used often in my work, and it is the reason I give up CitusDB.

3. Self created Database
I tried to create a new database in master node, but it doesnt show in the workers. I connected CitusDB, they tell me I have to create the database with same name on worker nodes manually, then they can replicate data on worker nodes, and it really works well.

Finally, CitusDB is not fit for me, if it can make complex queries work, it will be definitly my first choice. Now I am looking into Amazon RedShift...

Friday, March 7, 2014

Comparing two files which have different number of columns

There are two files file1.txt and file2.txt that file1.txt has two columns contents and file2.txt has 3 columns contents.

file1.txt

abc,1
cba,2


file2.txt
1,abc,001
2,cba,002 

4,haifzhan,003





To get the common lines of those two files, execute this command:

awk -F',' 'NR==FNR{a[$1, $2]++;next} (a[$2,$1])' file1.txt file2.txt  > comm.txt

the output is written into comm.txt, you can see the output below contains 3 columns, that's because once common parts are found, it will output the info based upon the second input file that is file2.txt.
1,abc,001
2,cba,002



It gets the 1st column and 2nd column of file1.txt and 2nd column and 1st column of file2, and checks the equality of those fileds.file1.txt and file2.txt donot have to be sorted when execute the above commands.

Friday, February 28, 2014

Be careful about local time and UTC time when converting Date time to Unix time

In the last post, I paste my python script about converting from local time to UTC time and vice versa.
The reason is I made a mistake when convert Date time to Unix time(Not local time to UTC or UTC to local time). I treated local time as UTC time and got the wrong results.

Here is the comparison of the correct way and the wrong way:
You can see the below python code, input is "2013-08-23 10:00:00", '%Y-%m-%d %H:%M:%S' which is the same for two converters, but the result in Unix time is DIFFERENT!

>>> calendar.timegm((datetime.datetime.strptime("2013-08-23 10:00:00", '%Y-%m-%d %H:%M:%S')).utctimetuple())
1377252000
>>> int(str(time.mktime(datetime.datetime.strptime("2013-08-23 10:00:00", '%Y-%m-%d %H:%M:%S').utctimetuple()) )[:-2])
1377277200

Why?

The reason is calendar.timegm(t) takes a UTC time as input, and time.mktime(t) takes a local time as input! So simple methods, but it took me way way too long time to find out what exactly happens!

Here's the python script:

import sys
import datetime
import time
import calendar

#date format 
TIME_FORMAT = '%Y-%m-%d %H:%M:%S'
#hours offset from UTC 
TIME_OFFSET_TO_UTC = 0


#Converts date time to Unix time
def date_to_unix(date_time):
 delta = datetime.timedelta(0,0,0,0,0,TIME_OFFSET_TO_UTC,0)
 return calendar.timegm((datetime.datetime.strptime(date_time, TIME_FORMAT) - delta).utctimetuple())

date_to_unix("2014-02-28 16:24:00")

Python local time and UTC time conversion

This python script is to show how to get Local time and UTC time, and how to convert Local time to UTC time and vice versa.

Two methods should attract more attention. time.mktime(t) and calendar.timegm(t).

time.mktime(t)  pass local time as a input and calendar.timegm(t)'s t is UTC time. The below is from python doc.

Therefore, when I convert local time to UTC time, I used time.mktime(t), and when I convert UTC to local time, I used calendar.timegm().

There're tons of ways to convert local time and UTC, the below script is one of all choices.


time.mktime(t) This is the inverse function of localtime(). Its argument is the struct_time or full 9-tuple (since the dst flag is needed; use -1 as the dst flag if it is unknown) which expresses the time in local time, not UTC. It returns a floating point number, for compatibility with time(). If the input value cannot be represented as a valid time, either OverflowError or ValueError will be raised (which depends on whether the invalid value is caught by Python or the underlying C libraries). The earliest date for which it can generate a time is platform-dependent.




calendar.timegm(tuple) An unrelated but handy function that takes a time tuple such as returned by the gmtime() function in the time module, and returns the corresponding Unix timestamp value, assuming an epoch of 1970, and the POSIX encoding. In fact, time.gmtime() and timegm() are each others’ inverse.

#!/usr/bin/python
import calendar
import datetime
import time

TIME_FORMAT = '%Y-%m-%d %H:%M:%S'

#Gets local time in given format
def get_current_local_time():
        local = datetime.datetime.now()
        print "Local:", local.strftime(TIME_FORMAT)


#Gets UTC time in given format
def get_current_utc_time():
        utc = datetime.datetime.utcnow()
        print "UTC:", utc.strftime(TIME_FORMAT)


#Converts local time to UTC time
def local_2_utc():
        local = datetime.datetime.now().strftime(TIME_FORMAT)
        print "local_2_utc: before convert:", local
        timestamp =  str(time.mktime(datetime.datetime.strptime(local, TIME_FORMAT).timetuple()) )[:-2]
        utc = datetime.datetime.utcfromtimestamp(int(timestamp))
        print "local_2_utc: after convert:", utc


#Converts UTC time to local time
def utc_2_local():
        utc = datetime.datetime.utcnow().strftime(TIME_FORMAT)
        print "utc_2_local: before convert:", utc
        timestamp =  calendar.timegm((datetime.datetime.strptime( utc, TIME_FORMAT)).timetuple())
        local = datetime.datetime.fromtimestamp(timestamp).strftime(TIME_FORMAT)
        print "utc_2_local: after convert:", local


#Invokes methods
get_current_local_time()
get_current_utc_time()
local_2_utc()
utc_2_local()

Friday, February 21, 2014

Sublime Text --- Visible white-space(indentation) characters

People use Geany knows that user can configure "show white space" to display the indentation spaces, it is really helpful for developers to view source codes like python.

It looks like this:

This feature can be configured in Sublime Text as well. Choose Preference, then Settings-Default and past the below json configuration:

{
"font_size": 10,
"draw_white_space": "all"
}

Reference: link here
Then the source code in Sublime will look like this:

Tuesday, January 21, 2014

Scala and Netbeans IDE

Scala and Netbeans IDE

1. currently using Netbeans 7.3.1

2. Downloads Scala from its official website, version scala-2.10.3.tgz
extracts scala-2.10.3.tgz to /usr/local/scala-2.10.3

3. Open netbeans.conf file, it locates at /usr/local/netbeans-7.3.1/etc/netbeans.conf on my Fedora.

4. vim netbeans.conf and search keyword "netbeans_default_options", and then it looks like below after scala.home is added:

netbeans_default_options="--laf Nimbus -J-Dscala.home=/usr/local/scala-2.10.3 -J-client -J-Xss2m -J-Xms32m -J-XX:PermSize=32m -J-Dapple.laf.useScreenMenuBar=true -J-Dapple.awt.graphics.UseQuartz=true -J-Dsun.java2d.noddraw=true -J-Dsun.java2d.dpiaware=true -J-Dsun.zip.disableMemoryMapping=true"

The highlighted part is the absolute path to Scala.

5. Restarts Netbeans and try to run your Scala HelloWorld project, it should work!