Adding method dynamically in Python

Here is a good answer in SO:

>> class Dog():
… def __init__(self, name):
… self.name = name

>>> skip = Dog(‘Skip’)
>>> spot = Dog(‘Spot’)
>>> def talk(self):
… print ‘Hi, my name is ‘ + self.name

>>> Dog.talk = talk # add method to class
>>> skip.talk()
Hi, my name is Skip
>>> spot.talk()
Hi, my name is Spot
>>> del Dog.talk # remove method from class
>>> skip.talk() # won’t work anymore
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
AttributeError: Dog instance has no attribute ‘talk’
>>> import types
>>> f = types.MethodType(talk, skip, Dog)
>>> skip.talk = f # add method to specific instance
>>> skip.talk()
Hi, my name is Skip
>>> spot.talk() # won’t work, since we only modified skip
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
AttributeError: Dog instance has no attribute ‘talk’

So it looks like adding method to class is simple:

class.method = method

I’ve been using the types approach, it actually makes code harder to understand or maintain.

Advertisements
Adding method dynamically in Python

Running wordpress on ECS

Recently I need to setup a POC website for one of my project to validate the idea, and I decided to go with wordpress since it’s a very power tool and easy to setup. Since all my app are on ECS already so I decide to do wordpress on ECS as well.

WordPress provides a docker image already. And to make use of the docker image there are two things stateful needed to be handle:

  1. database data
  2. wp-content directory

Database

For database, I already have a RDS instance for it. It’s also easy to config in wordpress image.

WP-CONTENT

The wp-content directory is where wordpress stores all the uploaded files, installed plugins and theme files. It must be mounted to docker so changes are persistent outside ECS.

EFS is used to do this: I created a EFS volume, and mount it for each of my ECS instances (set it up in user data so it’ll mount automatically for new instances), then in the ECS task definition mount the efs drive to docker volume /var/www/html/wp-content.

For a comparison between EFS/EBS/S3, check EFS FAQ. Basically EFS makes a good balance between performance and availability, it fits my usage for concurrent accessibility.

Detailed Steps

Assuming a ECS cluster already exists.

  1. Create an EFS volume, make sure it’s in the same availability zones with ECS cluster.
  2. I used spot instance for my ECS cluster (highly recommended for cost saving), so I just created a new spot fleet request and in user data we just need to join ECS cluster and attach the efs drive. A sample user data is showed below.
  3. Update ECS tasks and services to use the mount. I also have a sample script that register task and update service below.

User Data


#!/bin/bash
# join ECS cluster
echo ECS_CLUSTER=cluster-name >> /etc/ecs/ecs.config

# mount efs
mkdir /efs-drive
yum install -y amazon-efs-utils
sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 efs-id.efs.us-east-1.amazonaws.com:/ /efs
mount | grep efs
sudo cp /etc/fstab /etc/fstab.bak
echo 'efs-id.efs.us-east-1.amazonaws.com:/ /efs nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 0 0' | sudo tee -a /etc/fstab
sudo mount -a

Update ECS Task


import boto3

client = boto3.client('ecs')
image = 'wordpress'
app_name = 'my-website'


def main():
    # register new version of task
    resp = client.register_task_definition(
        family=app_name,
        containerDefinitions=[
            {
                'name': app_name,
                'image': image,
                'cpu': 256,
                'memory': 256,
                'portMappings': [
                    {
                        'containerPort': 80
                    },
                ],
                'essential': True,
                'mountPoints': [
                    {
                        'containerPath': '/var/www/html/wp-content',
                        'sourceVolume': 'efs-wp-content'
                    }
                ],
                'environment': [
                    {
                        'name': 'WORDPRESS_DB_HOST',
                        'value': 'db-host'
                    },
                    {
                        'name': 'WORDPRESS_DB_NAME',
                        'value': 'db-name'
                    },
                    {
                        'name': 'WORDPRESS_DB_USER',
                        'value': 'db-user'
                    },
                    {
                        'name': 'WORDPRESS_DB_PASSWORD',
                        'value': 'db-pass'
                    }
                ],
                'healthCheck': {
                    'command': [
                        'curl',
                        'http://localhost:80'
                    ],
                    'interval': 60,
                    'timeout': 20,
                    'retries': 1,
                    'startPeriod': 120
                }
            },
        ],
        volumes=[
            {
                'host': {
                    'sourcePath': '/efs-drive/wp-content'
                },
                'name': 'efs-wp-content'
            }
        ],
        requiresCompatibilities=['EC2']
    )

    # update ecs service to pick new task
    family = resp['taskDefinition']['family']
    revision = resp['taskDefinition']['revision']
    new_task_arn = family + ':' + str(revision)
    resp = client.update_service(
        cluster='ECS',
        service=app_name,
        desiredCount=2,
        taskDefinition=new_task_arn,
        healthCheckGracePeriodSeconds=120
    )


if __name__ == '__main__':
    main()

Running wordpress on ECS

Duplicate Results in Elasticsearch Spark Job

VERSION: 5.2.2. Both cluster and spark driver are in this version.

Issue

I found an interesting issue with the Elasticsearch hadoop driver that when I read data from elasticsearch I got duplicated rows, while the total number of result remains the same. Which means there are rows replaced by the dups.

For example, I’m running a simple spark job that export a query matching 50m rows from elasticsearch cluster. However, in the result, which is still 50m number of rows, there are maybe 20m of them are duplicates. So eventually I got 30m unique rows, and there are 20m rows replaced by those duplicates.

I didn’t find any open issue in the hadoop driver repo in elastico’s github page. But I did manage to find a workaround on this issue.

Solution

es.input.max.docs.per.partition: <number of docs in smallest shard>

Why

This is a new parameter added in ES 5. It tries to slice reading from elasticsearch. This parameter is basically the batch size. Default value is 100K.

So, when you have a shard that has more than 100K docs, the spark driver slice that shard into multiple reads.

I’m guessing there’s a bug in calculating the slice number, somehow it reads duplicate documents.

So the workaround is basically set this number big enough to avoid slicing read.

That’s why we need to set this parameter to be the number of docs you have in your smallest shard.

Last

I love elasticsearch and the hadoop driver. I just didn’t have the time to reproduce this issue in a smaller scale and report it. So this bug is unconfirmed from the official channel. However what I can say is the workaround worked for me. No duplicates anymore.

Duplicate Results in Elasticsearch Spark Job

A Summary on Classification Model

  • 1. Evaluation Metrics
  1. Confusion Matrix – predicted class, actual class, accuracy..
  2. Cost Matrix – Just multiply and add corresponded item in the original cost matrix model.
  3. Cost-sensitive Measures – Precision and Recall

Precision and Recall

Precision: what % of tuples that the classifier labeled as positive are actually positive

Recall: what % of positive tuples did the classifier label as positive

F measure – just a harmonic mean of precision and recall. “a measure of a test’s accuracy” (Wikipedia)

A Summary on Classification Model

Store zipcode+date in an integer type in mysql

What I’m trying to do is to generate a primary key in database. The easiest way is probably using auto increment on id to do that. But in this case the table is partitioned by id, auto increment won’t provide load balance on the table. As the unique key is zipcode (5 digits) + date (8 digits). it’s natural to generate an primary key based on that. You may think it’s easy to just add date behind zipcode as id something like this 9000120130928.

But the problem is due to certain condition, I can only use Int type instead of BigInt. So simple add up won’t fit in an Int type of which the max id is 2^32 = 4.2B, where as my id can easily go up to 9000B as shows above.

Theoretically, we have 43000 zipcodes in U.S., and we store history data from 1830 to present. So there will be in total 43000 * 365 * 200 = 3B records which should be able to fit in a regular Int type. The key is to find a function that map all of them to this int type. So it comes as find f(x) fit to use.

Store zipcode+date in an integer type in mysql

How to learn hadoop

This is an email I sent to my colleagues sharing my hadoop learning experience. I hope it can help a little bit to others in the Internet as well.

===========================================

I thought it’s probably good to share some experience on my learning of hadoop, to help others who avoid mistakes I’ve had and share some useful links. I learned from Apache distribution, so this probably apply to just Apache (maybe partially other distribution). But anyway, I hope this will help you in somewhat way. So here’s my suggestion:

1. To start, you may wanna follow this tutorial to make a mapreduce job running in single mode.

This will definitely help you understand how hadoop works, and it provides a really good prototype for you to scale out into real cluster. Inside this tutorial you’ll learn how to set up a single mode cluster, how to write a mapreduce job. Here‘s another article on how to set up single mode cluster and it’s easier to read.

2. Set up a multi-node cluster

Follow this article will help you set up a real cluster. We don’t have enough machine to do that, but by using vmware player you won’t worry about it (just create 2-3 ubuntu instance in it).

3. Run a bigger hadoop job

When your scale out starting process a much bigger data set, there will always be some issues appearing. So you may want to do it in real cluster to know what it feels like. This doesn’t have to be a complicated job, but the input dataset should be HUGE. So you can know how the cluster running, distribute data, control its node those stuff.

Other than above, hadoop documentation and stackoverflow are always my top choices for trouble shooting. Hadoop Wiki is also a very valuable resource to answer common questions you may have during learning hadoop.

Besides, Hadoop Definite Guide is a very popular book recommend by many user, another book Hadoop In Action is one I used the most during set up clusters.

Hope this helps. Cheers

funny gifs

How to learn hadoop