Wednesday, April 4, 2018

How to setup AWS Application Load Balancer for Couchbase

Introduction

You might be wondering why we really need a load balancer for Couchbase. Well, that is a reasonable question to ask and I thought of clarifying this matter before we move further into the topic.
The objective here is NOT to use a load balancer to perform its noble task which is to distribute application traffic. The load balancer uses here as a tool to help manage connection string of the Couchbase. At the end of this article you will understand how to manage the Couchbase connection string in fully automated fashion.

How client applications work with Couchbase

Let's go through a brief description on how client applications work with Couchbase server.

Any application that connects to Couchbase has a connection string which consists of one or more cluster nodes. Since Couchbase is a distributed database, a typical production deployment of Couchbase has multiple nodes in a cluster. As a result you need to provide one or more such cluster node IP addresses or host names to the connection string. It is not required to provide all the cluster nodes to the connection string, however it is often best practice to specify at least two host names because if the application could not connect to one, then it will try the next node.

After establishing the connection to Couchbase, the smart client receives a cluster map from the Couchbase. In nutshell, cluster map has the information about Couchbase nodes, data distribution of vBuckets, etc. From that point, smart client knows where to read/write data precisely. This process is called client bootstrap.

However, overtime, Couchbase cluster may shrinks or grows depending on the application load. If that happens, the cluster nodes will be changed. So then what will happen to the connection string you created at initial cluster setup? Those information is not fully accurate now? Of course you could update the connection string manually but in automated environment any manual work would be considered a cheap thing.

To resolve this or let the client application knows the latest cluster nodes with no human intervention, you could use AWS application load balancer.

Typical role of application load balancer

AWS Application Load Balancer-ALB, is one of the new services of their service arsenal. Load balancer typically used as a single point of contact for clients and it distributes incoming application traffic across multiple targets, such as EC2 instances. Well, as we know how applications connecting to Couchbase cluster, we understand the typical use of the load balancer will not exactly fits in this context to distribute the database traffic because to distribute the database traffic, the load balancer knows nothing about the Couchbase.

Use of Application Load Balancer in Couchbase

The following diagram illustrates the basic architecture of ALB used in Couchbase.

Create ALB for Couchbase cluster

It is now common that Couchbase cluster deploys on AWS autoscaling groups which provides lot of benefits and automation friendly environment. Let's go through step by step process on how to create an ALB for Couchbase.

Assume you already have a Couchbase cluster deployed on AWS autoscaling group. 

1. Create an ALB

I'm not going to repeat the steps in AWS docs here. Please follow the link given below to start creating the ALB and I will highlight the things we need to know for Couchbase context while creating it.


In Basic Configuration there are two schemes you can select as seen below;

It's your choice to expose Couchbase admin UI to internet or internal. Depending on that select the Scheme. Let's assume we selected "internet-facing" scheme.

In Availability zones, select your VPC and the subnets that you need to place the ALB. For internet facing scheme, you need to place your ALB in subnets where you have internet access (Internet Gateway). You need to select more than one subnets for high-availability purposes.

2. Create security group

Follow the steps as it is in the page. Recommended rule are given in below link;


Please note, since our ALB is internet-facing, the Inbound Source of the security group should be "0.0.0.0/0".

3.  Create target group

In Configure Routing section, please make sure you change the highlighted section as below;

4. Register target

Remember that our Couchbase cluster runs on AWS autoscaling group. So that we need to register autoscaling group as one of the targets for ALB. In this section you could register individual instances too but that is not what we need here. So just skip this section for now and click on Review.

Review all the settings and if accurate click on Create.

At this point, you have your ALB created. However still your target group is empty because we have not yet registered any EC2 instances to the target group. We will do that in the next step. 

5. Register autoscaling group as the target

Select the autoscaling group which has Couchbase server deployed.

On Details tab, select Edit.
Select the target group which you have just created in above. See screenshot below;

Then click on Save.

6. Verify target group

As soon as you finished with step #5, please switch over to target groups under Load Balancing and select the target group. e.x: test-tg-couchb

On Targets tab you can now notice the Couchbase cluster EC2 instances are registered. See below screenshot.


Since the Couchbase cluster as single instance (in this example), it shows just one instance. In the same section you also can see health check is passed which means ALB is able to communicate with the Couchbase server.

7. Access Couchbase server admin UI

Remember that we created internet-facing ALB so that we can now access the admin UI using the ALB DNS name.


Just copy and paste this URL into the browser to access the Couchbase UI. Make sure the security group you created for the ALB is accepting traffic from internet.

See below that the security group is open to internet traffic.

8. Enable sticky session

If you are having trouble accessing the Web UI then do one more change in the target group to enable the sticky session. This is specially required when you have multiple nodes in the  Couchbase cluster.

Select the desired target group. Under Attributes on Details tab, Edit attributes to enable the sticky sessions as shown below;


Then try the Couchbase Web UI once again.

Setting up Route 53

Earlier we just tried Amazon internally generated URL of the load balancer to access the Couchbase Web UI. Most often when we design things we need to do it neatly. It is always better to have customized URL for each Couchbase cluster so that it is easier to remember as well. You could achieve this by creating a public or private Route53 entry and then point that to the ALB you just created above. You can refer the link below on how to setup Route53 for the load balancer.

How many load balancers you need?

Well, the load balancer has a cost associated with it. Whatever the cloud design you come up with should also need to consider the cost aspect of it too. Having a load balancer for each Couchbase cluster is definitely expensive but thanks to the new ALB, now you could use one ALB to connect to multiple Couchbase clusters using path-based or host-based routing strategy. Below section describes the routing strategy and how to configure it.

Routing strategy

In Amazon web console, select the ALB that you created above. Select the Listeners tab of the load balancer's properties. You could see something like below;


Which means, the load balancer we created has just one listener. Notice the Rules section. Just click on View/edit rules. You could see the Rule Editor like below;


Look at that, we can define rules for ALB. Even though we did not create any explicit rules, there is a default rule created and it is associated to the target group (test-tg-couchb) we just created.

In here, you can add/modify/delete rules. Let's assume you have another target group which is pointing to a different Couchbase cluster that runs on another autoscaling group. Let's call that target group as test-tg-couchb2. Assume that you also created a Route53 entry which pointing to the same ALB that we created above. Let's call that Route 53 entry as couchbase.couchb2.com.

By utilizing a listener rules as shown above, you can tell the listener to divert all the incoming calls to couchbase.couchb2.com to the target group, test-tg-couchb2. Refer the screenshot below;


Let's backup a little bit and try to understand what we just did.

We just added a new listener rule for the same ALB. The new rule is host-based one and it diverts all the incoming traffic to couchbase.couchb2.com Route 53 DNS name to the new target group called, test-tg-couchb2 which essentially pointing to a different Couchbase cluster runs on autoscaling group. That is amazing right.

Refer the links below for more details about Listener Rules.

Couchbase connection string

Now you have setup stable Route 53 DNS name for your Couchbase cluster. The next step is to use that DNS name to connect to your Couchbase cluster just to be clear, we are not going to use the DNS directly to bootstrap the client SDK. We have already seen how it can be used to connect to the Couchbase admin UI.

Below mentioned is the python code snippet that you could use to get latest Couchbase nodes and then use those nodes to build the connection string as you would do it normally. Now you do not have to worry about adding or removing nodes to your Couchbase cluster because there will be no change in the connection string. Everything will be taken care by the ALB.

url='http://couchbase-couchb2.com/pools/nodes'
u = 'admin'
p = 'Ch@ng3m3!'
response = requests.get(url, auth=(u, p))
nodes = json.loads(response.text)['nodes']
hosts=""
for n in nodes:
hosts = hosts + ',' + n['hostname']
hosts = hosts.lstrip(',')
print(hosts)
couchbase://hosts

Summary

In this article we just went through how to use Amazon Application Load Balancer to manage connection string of the Couchbase cluster. We also learnt, how to use single ALB to connect to multiple Couchbase clusters to reduce the cost. While there are many solutions available to handle connection string, you would find this solution as fully automated and easy implementation. Hope you learnt something new.

However Couchbase recommended method is to use DNS SRV record which is described below;


While the DNS SRV method is efficient but it has some management overhead because you need to do some extra programming to manage the DNS SRV records as you add / remove nodes in the cluster. I still believe Couchbase has no developer friendly method to manage connection string in Cloud environment. In my opinion this is something Couchbase needs to implement very soon rather than later.

Thursday, November 16, 2017

Couchbase: Bucket (Database) Architecture

Introduction

In Couchbase, a bucket is similar to a database in other DB systems. Application inserts data directly into a bucket(s) and there is no any other schema or objects inside it. A bucket contains documents.

Wednesday, November 8, 2017

Rack awareness feature in Couchbase

In distributed database systems, data is distributed across many nodes. If you consider Cassandra, it is not uncommon to have a cluster (in Cassandra terms, a ring) with 1000 nodes or even more. These nodes are then grouped into different racks, in cloud terms, different availability zones. The reason is, in the event that a whole rack (availability zone) goes down since the replica partitions are on separate racks, data will remain available. 

When it comes to Couchbase, the so-called rack awareness feature is controlled by using Groups. You can assign Couchbase servers into different Groups to achieve the rack awareness capability. 

If you are provisioning a Couchbase cluster on AWS, you can create the server Groups analogous to the availability zones on AWS. This logical grouping in Couchbase allows administrators to specify that active and replica partitions be created on servers that are part of a separate rack zone. 

See below figure-1 and notice the Couchbase cluster deployed on AWS has two server groups similar to the availability zone. 

Figure-1 - Multi-dimensional Cochbase cluster deployed on AWS

This cluster has two nodes for each service offering for data, index, and query. The servers are logically grouped into two groups, rack-1a and rack-1b which is similar to availability zone 1a and 1b on AWS respectively. As a result, servers are physically arranged in two racks. 

It is recommended to have the number of servers same between the server groups. If there is an unequal number of servers in one server group, the rebalance operation performs the best effort to evenly distribute replica vBuckets across the cluster. 

The rack awareness feature is available only in Enterprise Edition of Couchbase. 

Cheers!

Friday, October 20, 2017

Couchbase - the engagement database system

Today, I just wrapped up the four-day Couchbase administration course from Couchbase. Since I'm still new to this NoSQL world, it is tons of new stuff. Couchbase refers themselves as beyond just a NoSQL database, they call it engagement database system. 

Couchbase also has SQL style query language which they call it as N1QL. Using N1QL you can query the JSON documents stored in Couchbase buckets. 

What is so special about Couchbase with opposed to the other NoSQL databases like Cassandra and MongoDB is, they provide multi-dimensional scalability which is essentially, different database components like data service, query service, index service and full-text services can be scaled in/out independently. This concept makes distributed system even more complex in my opinion. The main advantage of multi-dimensional scaling is, you can select different hardware resources for each service to best suit its workload which is a very good thing. 

Couchbase provides very good free online courses if you want to get familiar with the database system.


So next few months I'll be working on Couchbase closely with automation on AWS platform. 

Cheers!

Thursday, September 21, 2017

Creating test data for Apache Cassandra cluster

In many cases you are required to generate test data for Cassandra to do various type of testing. Read the rest of the blogpost if this is something you need and at the end, you should be able to create some test data very quickly.

I use the Ubuntu 16.04.2 LTS and Apache Cassandra 3.0 for this testing. The Cassandra cluster that I'm using here is deployed on AWS but this should not be a consideration factor for test data creation.

Use the steps mentioned below to create the test data.

    1. Download the csv file

curl -O http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv

Note. You can download this file directly on to the EC2 but due to some formatting issue it did not work correctly for me. I downloaded the file to my local machine (Mac) first and then opened the file using TextEditor then copied the contents to the EC2.

    2. Create the csv file on EC2

I just used the vi editor to open a new file called, realstatesdata.csv and then pasted the file contents (including headers) that I copied in step #1. Save the file. (You should be familiar with the VI editor to perform this step.)

After completing this step, you now have "realstatesdata.csv" on EC2.

    3. Connect to the Cassandra cluster using cqlsh



    4. Create a Keyspace

You need to create a Keyspace (in general terms, a database) which is a high level hierarchical object to contain Cassandra tables. The keyspace name I've chosen is "kp_realstate" but feel free to have any name which you want. 



    5. Create a table

You also need to have a table which is the actual object which contains your real data. Unlike MongoDB, you need to have a the table schema created before you insert any data. I already analyzed the data set in csv file we just downloaded and decided the columns based on that. The table name is "realestate_data" again its your choice.



Note that, the order of the columns in CREATE TABLE statement should be the same as the order it appears in the csv file.

    6.  Load the data

You use the COPY command to load data from a file to Cassandra.

COPY kp_realstate.realestate_data (street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude) FROM 'realstatesdata.csv' WITH HEADER = TRUE;

It has to be executed at CQL prompt.

If the import is successful you will see the messages like below.

Processed: 985 rows; Rate:    1151 rows/s; Avg. rate:    1865 rows/s
985 rows imported from 1 files in 0.528 seconds (0 skipped).

   7. Make sure the data is imported successfully

At CQL prompt, you can execute any of the statements below.

select count(*) from kp_realstate.realestate_data;
select * from kp_realstate.realestate_data limit 30;

At the end, you've full data set for your testing.

Cheers!

Wednesday, September 13, 2017

JEMalloc and Cassandra

Memory management in Cassandra

Cassandra depends on JVM-Java Virtual Machine, to accomplish Cassandra's memory management requirement. The JVM mainly divided into two areas as follows;
  1. Heap - data area which contains the runtime structures. 
  2. Internal data structures - Java methods, thread stack and native methods. 

Cassandra uses its memory in four ways which are mentioned below. This includes OS memory too.
  1. Java heap
  2. Offheap memory (OS memory that is not managed by JVM G.C-Garbage Collector)
  3. OS page cache
  4. OS TCP/IP stack I/O cache

Since Cassandra uses JVM for its memory management, tuning of JVM is necessary to get optimal performance in Cassandra. The tuning of JVM includes the changing the settings in cassandra-env.sh as mentioned below;
  • MAX_HEAP_SIZE
  • HEAP_NEWSIZE

What is JEMalloc?

JEMalloc is an enhanced memory allocator in Linux based platforms. With JEMalloc, the memory allocation for multithreaded applications scales well as the no.of processors' increases. The previously used memory allocator, malloc(3) suffered scalability bottleneck for some multithreaded applications that caused JEMalloc to emerged. 

Use of JEMalloc has been introduced in Cassandra after 2.0. 

Ensure JNA-Java Native Access and JEMalloc are installed on Linux AMI. If you're creating an Amazon AMI for Cassandra, then you want to install both of these. 

yum install -y jna
yum install -y jemalloc

Cassandra.yaml configuration requires the change mentioned below in order to use JEMalloc. 

memtable_allocation_type: offheap_objects

Note. The above setting is set to "heap_buffers" by default. 

What is the benefit of using JEMalloc in Cassandra

By enabling JEMalloc in Cassandra, it reduces the amount of Java heap space that Cassandra uses. Data written to Cassandra is first stored in memtables in heap memory. Memtables are then flushed to SStables on disk when they get full. The garbage collection process of JVM is used to clear the heap memory. Sometimes, this garbage collection process causes issues in Cassandra due to garbage collection pause. 

The benefit of JEMalloc is, it reduces the pressure of garbage collection because Cassandra uses off-heap memory allocation with JEMalloc. 

References


Wednesday, April 26, 2017

Cloud Spanner - Google's mission critical and massively scale RDBMS with NoSQL features

Isn't this amazing?
A database system which blends most important relational components and highly scalable NoSQL capabilities in one system. Google call it, Cloud Spanner

As per Google, it can scale massively to hundreds of data centers around the globe. Most importantly it can speak SQL which is the strongest database language so far. 

In a nutshell, Cloud Spanner has the following features. 
  • Data consistency
  • Scales horizontally across data centers
  • Speaks SQL 
  • Strong consistency and availability at global scale
  • Massively distributed
  • ACID compliance
  • Automatic sharding and synchronous replication with low latency
  • Schema updates without downtime
  • Auto-scales up and down as needed
  • Simple pricing model  
This proves the point that eventually both relational and non-relational databases will merge!

Cheers!