Tuesday, April 9, 2019

Mongodb 3.4.5 Standalone Community to Enterprise 



Recently, I was given the task of upgrading a Mongodb 3.4.5 standalone community edition to Mongodb 3.4.5 standalone enterprise edition. As I started working on it, I realized that there was very little documentation for versions below 4.0. Here are the steps, issues and workarounds, that I encountered during the upgrade:

  • First shutdown the existing Mongodb Community server using the following command and make sure that there are no other mongo services running: 
service mongod stop
 ps -ef | grep "mongo" 
  • Under /etc/yum.repos.d/ directory, create a mongodb-enterprise.repo file
  • Add the following to the mongodb-enterprise.repo file :
[mongodb-enterprise]
name=MongoDB Enterprise Repository
baseurl=https://repo.mongodb.com/yum/redhat/$releasever/mongodb-enterprise/3.4/$basearch/gpgcheck=1enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-3.4.asc
  • Make sure that there is no other repo file in this directory other than the mongodb-enterprise.repo file. 
  • Next, install Mongodb 3.4.5 enterprise by issuing the following command
sudo yum install -y mongodb-enterprise-3.4.5 mongodb-enterprise-server-3.4.5 mongodb-enterprise-shell-3.4.5 mongodb-enterprise-mongos-3.4.5 mongodb-enterprise-tools-3.4.5
  • While installing, you may encounter package conflicting errors. We had to remove the following conflicting packages:
sudo yum remove mongo-10gen-server mongo-10gen-tools mongo-10gen-shell mongo-10gen-mongos mongo-10gen
  • If there is no other conflict, Mongodb enterprise should be installed by now.
  • Next, start Mongodb Enterprise Server with the following command, if using a config file.
/usr/bin/mongod -f /etc/mongod.conf 
  • Sometimes, you will get an error when starting the Mongodb server. Check the mongo logs and address the issues accordingly. We had an issue with the number of open files at the OS level, and increasing the number of open files helped. Mongodb recommended number for open files is 64000 and it can be increased with the following command
 ulimit -n 64000 
   



Thursday, September 3, 2015

Comparison of Salaries between School Districts in California

If you are a California state employee, have you ever wondered what your counterparts in other counties and cities within California make? Let us look at the data provided by transparentcalifornia.com.

Transparentcalifornia  provides salary and pension information for all the state employees. I am going to download this data into an Oracle database first and then to 'R' to analyze and generate some interesting graphs.

Let us first look at the school district data. We have created a table "SCHOOL_DISTRICT" with the following columns as shown below. We will import the school district data into the school_district table and run some queries on that.


SCHOOL_DISTRICT


JobTitle

BasePay

OverTimePay

OtherPay

Benefits

TotalPay

TotalPay_Benefits

School_year

Agency
TEACHER 86817.53 0 0 26158.91 86817.53 112976.44 2014 ABC Unified
TEACHER 89501.99 0 5761 16385.89 95262.99 111648.88 2014 ABC Unified

Let us see which position earns the top salary in each school district.

There are 537 distinct school districts in California,so I will pick a few school districts and run the query

SELECT sal.jobtitle,sal.agency,sal.totalpay_benefits FROM (
                SELECT jobtitle, to_char(totalpay_benefits, 'fm9999999.90')totalpay_benefits, agency,
                RANK() OVER (PARTITION BY agency ORDER BY totalpay_benefits desc)"Rank"
                FROM school_district
                WHERE agency IN('Antioch Unified','Bakersfield City School District',
                'TurlockUnified', 'San Francisco Unified', 'Los Angeles Unified', 'Mountain View 
                Whisman' ))sal       
WHERE sal."Rank" < 2
ORDER BY sal.totalpay_benefits DESC

Comparison of Salaries among different School Districts
 

Let us look at the top ten school employees in California who get paid the highest salary.

SELECT jobtitle, to_char(basepay,'fm9999999.90')basepay, overtimepay, to_char(totalpay_benefits,'fm9999999.90')totalpay_benefits, agency FROM
              (SELECT jobtitle,basepay,overtimepay,totalpay_benefits,agency FROM school_district
                ORDER BY totalpay_benefits DESC)
WHERE rownum < 11
Top Ten Salaries of School Employees in CA
Let us also look at the teachers who are in the top ten in terms of salary in California.

SELECT jobtitle,to_char(basepay,'fm9999999.90')basepay, overtimepay, otherpay, benefits, to_char(totalpay_benefits,'fm9999999.90')totalpay_benefits, agency FROM
    (SELECT jobtitle, basepay, overtimepay, totalpay_benefits, agency, otherpay, benefits FROM  school_district
        WHERE jobtitle like 'Teacher%'
            ORDER BY totalpay_benefits DESC)
WHERE rownum < 11

Top Ten Salaries of Teachers in CA

Wow, teachers in Orange County seem to have excellent benefits. Their basepay is also very good. If you are looking for a teaching position, you know which school district you want to be in.

I think by now, you must have gotten an idea of how to find the top ten values in a given data set. You can download this data from transparentcalifornia.com and try these queries and much more on your own.

We will load this data in 'R' and generate some cool graphs in my next post.

Tuesday, August 18, 2015

Find nth highest value using SQL Queries

Let us see how to calculate the first, second and the nth highest value.

Our employee table is as shown below.

Employee
Employee ID Salary
1 4000
2 2500
3 4500
4 1800

To calculate the highest salary and the second highest

SELECT sal.employeeid, sal.salary FROM
   (select employeeid, salary from employee order by salary desc) sal
WHERE rownum <  3

EmployeeID        Salary
3                          4500
1                          4000

To calculate 2nd highest salary

SELECT max(salary)"Second Max Salary" FROM employee
where salary not in (select max(salary) from employee)

Second Max Salary
4000

To calculate nth highest salary(Using Rank() function) 

Select EmployeeID, Salary, RANK() OVER (ORDER BY salary desc) "RANK" from employee

EmployeeID      Salary          Rank
 3                       4500                 1
1                        4000                 2
2                        2500                 3
4                        1800                 4

In the next blog, let us apply these queries to a real data set and analyze the results.

Sunday, August 9, 2015

Who works the hardest?

Recently, one of the Republican candidates made a statement about Americans - "People need to work longer hours to achieve 4% growth in our nation's economy". Really? I was curious to see how many hours did an average American work compared to his counterparts.  I downloaded data provided by International Labor Organization. Data is available in a csv format. I have loaded this data into 'R'.  Let us analyze this data set and see where we stand.

Let us see if there are any countries working more than 40 hrs a week.

Countries working more than 40 hrs/week

In Korean Republic (South Korea), people work more than 50hrs a week . Asian countries like Malaysia,Thailand,Hong Kong China and Bangladesh all work close to 45 hrs per week .
 
Let us see which countries are working between 35 and 40 hrs per week.



Countries working between 35 and 40 hrs/week


Our country of interest, United States, works 35 hrs per week which is in par with several other developed countries. Not sure what our candidate is taking about.

Japan's median working hours is only 36.5 hrs per week. I thought they worked harder. Something is not right. Well to verify, let us just extract Japan's data over the years and plot it instead of just taking the median value.

That's interesting. Japanese have started working less number of hours since 1990. In fact, they work less than the United States.

Now, I am very curious to know which countries work least. Let us plot the data for working hrs between 25 and 30.

Countries with least number of working hours/week
 Friends, maybe it is time to move to one of these these countries. What do you say?



Tuesday, June 9, 2015

Alcohol consumption and Liver Disease

There have been lot of studies on high alcohol consumption and liver disease

Out of curiosity, I wanted to find out which country consumed the maximum amount of alcohol? Does consumption of alcohol in large quantities have any effect on the liver? I found data on Alcohol consumption per country provided by WHO. I found another data set on liver cancer per country again provided by WHO. We are going to analyze these data sets separately and see if there is any co-relation between alcohol consumption and liver cancer. Accordingly, countries with high alcohol consumption should have high cases of liver disease too as consuming alcohol in large quantities can result in liver damage. Let us start the analysis and see if the data supports this study.

The first data set comprises of Alcohol consumption per adult in litres for the years 2005 and 2008. There are 189 countries listed in this dataset . We loaded the dataset into a dataframe in R and used the melt function in package reshape2 to convert data into a long format.


I took the mean of the values for 2005 and 2008 grouped by country. Dplyr package was used to summarize the data. Let us start plotting the data.

Let us plot the lowest alcohol consuming countries(0-1 lt/person) first. Do you see anything common between these countries? 

Lowest Alcohol Cosumption

Now, let us look at highest alcohol consuming countries in the world( 15 lts/person and above)

Highest Alcohol Consumption

Maldova is the highest consumer of alcohol at 19.8 lts/person. Next comes Slovenia and South Korea.

Let us see how some of the first world countries are doing.

Alcohol Consumption amongst First World Countries

United States with less than 10lts/person is the lowest consumer in the group. That's unbelievable. I always thought that United States would top the list.

Let us analyze the liver cancer data now.

Countries with highest liver cancer cases in men

Mongolia seems to have the highest number of male liver cancer patients in the world. Next is Mozambique followed by North Korea and South Korea.

Now let us look at the female liver cancer data

Countries with highest liver cancer cases in women


Mongolia and Mozambique have higher numbers compared to the rest for women. But the numbers are half of what is for men. Wonder why? Is it because women in some cultures don't drink at all or drink lesser than men?

Now let us take a look at the liver cancer rates for countries with high alcohol consumption.



         
The co-relation between the two data sets do not exist. Look at the Liver cancer(men) data for countries with high alcohol consumption. Maldova has a liver cancer rate of 12/100,000 for men and 4/100,000 for women compared to Mongolia with a liver cancer rate of 98/100,00 for men and 46/100,000 for women. We were expecting Maldova to have the highest liver disease cases but Mongolia won the race. In this case, the data is not supporting the study.


 

Friday, June 5, 2015

Are South Asian Women Literate enough?

Literacy rates are rising, but women continue to lag behind. I was curious to find out how women in certain parts of the world are doing compared to their male counterparts. I started to look at data provided by UNESCO on females and males ages 15 and above who can, with understanding, read and write a short, simple statement on their everyday life. 

The data is available separately for the genders for all the countries. I combined them together and extracted only the countries I was interested in so that I could compare and contrast.

Raw data from UNESCO can be downloaded in CSV format. To massage the data and produce the graphs, I used ggplot2, dplyr, reshape2 and gridExtra packages available in R. 

Plots are as shown below. I have used grid.arrange to put multiple plots on a display with ggplot2

grid.arrange(pakistan,SriLanka,india,bangladesh,nepal,madives,ncol=3)

To save the plots I used ggsave

ggsave(file="South_Asia.jpg")




Looking at the data, it is obvious that some countries like India, Pakistan and Nepal need to do more to make their women literate.

Great job by Sri Lanka, Maldives and Bangladesh. Women are almost as literate as men.




Thursday, April 9, 2015

Wine Quality Analysis

Have you ever wondered what chemical properties qualify a wine to be a good quality wine? As part of the final project for my certification in 'R', I had to do some Exploratory Data Analysis on a data set which had various chemical properties and the associated quality rating for different types of wines grown in and around Portugal. Let us look at the different properties and their influence on quality in the following study.

The scope of this analysis is to understand the relationship of various chemical properties and their impact on the quality ratings of Red and White wine.

The wine data sample set contains information on 6497 observations of 13 variables. I combined the red wine data which had 1599 observations with white wine data consisting of 4898 observations. I started by analyzing the individual variables in the red and white wine separately,and then I combined them and explored interesting questions and leads as I continued to make observations on plots.



The pH of wine is an important factor which determines the quality of the wine in terms of taste, color, oxidation and chemical stability.  Most wines optimally have a pH between 2.9 and 4.0 Lower pH values allow the wine to stay fresher for a longer period and retain its original color and flavor. 




From the analysis of the plot above ,we can see that, as the quality of the wine goes up, the pH stays at around 3.5. Medium quality wines seem to have a slightly higher pH compared to the fine quality wines.


Alcohol has a negative correlation with density. As the alcohol content increases, the density decreases. This is true for both red and white wine samples.

The plot below indicates that white wine does exhibit higher SO2 components as compared to Red Wine across all pH values within the sample.There seems to be higher variation for both SO2 values in both red and white wines between a pH value of 3.5 and 4.0.



Red wine has less residual sugar and less SO2 compared to white wine.








Interesting observation is as the acidity goes up, the pH comes down. Also, acidity seems to be higher for red wine sample compared to the white wine sample.





Wine makers add the extra SO2 to white wine to prevent oxidation. Red wines have natural anti-oxidant. White wines lack this anti-oxidant and wine makers add SO2 to prevent the oxidation.The analysis proves that White wine exhibits a higher level of SO2 properties compared to red wine for the same pH.


Reflection

The pH of wine is an important factor which determines the quality of the wine in terms of taste, color, oxidation and chemical stability. The general rule for pH in wine making is the higher the pH reading, the lower amount of acidity in the wine. Most wines optimally have a pH between 2.9 and 4.0, with different values based on the type of wine. Any pH above 4.0 indicates that the wine will spoil quickly and be chemically unstable. Lower pH values allow the wine to stay fresher for a longer period and retain its original color and flavor. The pH can also impact the amount of free sulfur dioxide (SO2) present. The free SO2 helps in breaking down bacteria and prevent spoiling.

For the final analysis, I chose the effect of pH on acidity,free SO2 and Quality.

Some of the learnings from the analysis were as follows:
  1. pH value is a key factor in determining the quality of the wines and from the analysis, we found that higher quality wines had a low pH and high fixed acidity. From the samples analyzed,the wines with lower pH had higher acidity content. The acidity went down as the pH level increased. Red wines have more acidity and lower pH compared to white wines.
  2. Wine makers add the extra SO2 to white wine to prevent oxidation.Red wines have natural anti-oxidant. White wines lack this anti-oxidant and wine makers add SO2 to prevent the oxidation.The analysis proves that White wine exhibits a higher level of SO2 properties compared to red wine for the same pH. SO2 can cause allergic reactions in some people. It is dangerous for asthmatics even at very low levels. So, now that we know that white wine has more SO2 than red, it is better to avoid white wine, if one is asthmatic.
  3. High pH values for a wine will affect its quality. Lower pH values will make the wine taste tart and crisp. Our analysis proves this factor. As the quality goes up, the pH remains at around 3.5. Medium quality wines seem to have a slightly higher pH compared to the fine quality wines. For the white wine sample with a quality value of 9, the pH remained less than 3.5
  A limitation of the current analysis is that the data consists of samples collected from a specific region in Portugal. It will be interesting to compare wines grown in other regions with the current data set and current analysis. It would be intriguing to see the effect of climate on the quality of wine.