Majek's technical blog: May 2008

2008-05-29

Concurrent programming: It’s not about the language, it’s the framework

There’s a huge discussion on the web about concurrent programming. Now we have 4-core processors and that number will double every few years. The problem is that programmers don’t know how to use multiple cpus.

There are several approaches that address this issue:

Intel is developing a compiler that’s going to automatically parallelize software
people from Python are developing extension that’s enabling Python to use multiple cpus using threads-like api (normal python threads use only one cpu - see GIL)
there are many extensions to C that enable easy writing of parallel software
Java has built-in threading support
everyone is admiring Haskell support for multiple cpus
some people believe that the Software Transactional Memory is the parallel processing silver bullet

I wonder if we need parallelization at this level at all. Maybe the next level over “one processor” is not “multiple processors” but rather “multiple machines”. Here are the strategies that are popular nowadays:

Google map/reduce
Erlang

Erlang as a language is horrible. It’s the language for really determined programmers, because the learning curve is so steep. But the Erlang’s framework is excellent. You can easily scale over many machines, using the Erlang message passing you can accomplish more than in thousand lines in other languages.

I believe that the Erlang framework ideas aren’t tied to the Erlang language. I’d love to have so powerful framework for other languages.

Maybe we should skip the “multiple processors” phase and learn to use “multiple machines” technologies right now.

2008-05-22

Finding Iris on the image

During one of the classes we're supposed to find Iris on given images. I created hackish script that does it. The script isn't finished and to be frank it barely works. But I think the result images look cool.

2008-05-12

Google App Engine tips&tricks

source
A while ago I was writing some sample applications (source) for Google App Engine. I noted the things that can be useful for other GAE programmers.

I used Google's webapp framework, my code here is using it.

Please take a look at the shell application, it can help you test simple code.

How to dynamically get application name and version?

This question was asked before. You can use os.getcwd() or os.environ['PATH_TRANSLATED'].

>>> os.getcwd()
'/base/data/home/apps/shell/1.21'
>>> os.getcwd().split('/')[-2]
'shell'
>>> os.getcwd().split('/')[-1]
'1.21'

>>> os.environ['PATH_TRANSLATED']
'/base/data/home/apps/shell/1.21/shell.py'
>>> os.environ['PATH_TRANSLATED'].split('/')[-3]
'shell'

How to identify current host?

There's a very interesting file that should be unique for every server:

>>> open('/base/python_dist/search.config').read()
'datapath .\nsorttempdir .\ndisk /export/hdc3/borgletdata/dirs/prod-appengine.\
mpm_python_dist_v12.apphosting.77627982/bigfiledata/466024'

>>> open('/base/python_dist/search.config').read()
'datapath .\nsorttempdir .\ndisk /export/hdc3/borgletdata/dirs/prod-appengine.\
mpm_python_dist_v12.apphosting.77627739/bigfiledata/465336'

You can identify the machine on which the process is deployed by using hash based on this file. Something like that:

def get_server_id():
    try:
        fd   = open('/base/python_dist/search.config')
        data = fd.read()
        fd.close()
    except IOError:
        return 'unknown'
    
    return '%s' % data.__hash__()

Google doesn't inform you on how many machines your application is going to be deployed (this probably depends on the traffic your site generates). But you can add this server_id to your site footer. Than you can do multiple wget's to know on how many unique machines your app is being deployed.

$ for i in `seq 20`; do
    curl -s http://cometchat.appspot.com|\
    grep server_id; \
  done    |sort -n|uniq -c
  
     20 server_id: '7341146770217830363'

It seems that my app is deployed on only one server.

How to identify current process?

Yet again, how many processes with your app are deployed? This time a trick with global variable:

the_process_global = "something"

def get_process_id():
    return '%s' % id(the_process_global)

Now I know that my application is deployed using two processes:

$ for i in `seq 20`; do
    curl -s http://cometchat.appspot.com|\
    grep _id;
  done    |sort -n|uniq -c

    13 process_id: '12457625149327067176'
     7 process_id: '3996238433791648184'

Are we on production or development server?

I use this snippet:

if os.environ.get('SERVER_SOFTWARE','').startswith('Devel'):
    HOST='local'
elif os.environ.get('SERVER_SOFTWARE','').startswith('Goog'):
    HOST='google'
else:
    # logging.error('Unknown server. Production/development?')
    HOST='unknown'

Captcha on GAE?

Joscha Feth wrote tutorial about using reCaptcha on GAE.

Cookies?

Google suggests that request and response objects follow the WebOb interfaces. This works for getting cookies from request:

username = self.request.cookies.get('username', '')

Unfortunately you can't use WebOb method response.set_cookie. But you can set cookies by hand:

self.response.headers.add_header(
        'Set-Cookie', 
        'username=%s; expires=Fri, 31-Dec-2020 23:59:59 GMT' \
          % username.encode())

You can find some other hints on google-app-engine discussions. I don't know if cookies work from django-helper.

Debugging datastore access

I created very simple datastore debugger. It appends some debugging info to the footer of generated page. To use it you must just change your classes to inherit from debug.DebugMiddleware instead of webapp.RequestHandler.

For example:

class List(debug.DebugMiddleware):
    def get(self):
        ... blabla ...

Sample footer can look like that:

**** Request took:   830ms/170ms (real time/cpu time)
**** GQLs, datastore accessed 1 times.
98ms GQL app: ":self"
            kind: "Image"
            Order {
            property: "modified"
            direction: 2
            }
            args: (50,) {}

This GQL log was caused by the code:

ims = Image.all().order("-modified").fetch(50)

Yet another example of output footer:

**** Request took:   150ms/130ms (real time/cpu time)
**** GQLs, datastore accessed xx times.
  219ms PUT ({'full':...
  178ms PUT ({'full':...
    6ms GET ([datastore_types.Key.from_path('Image', 350L, _app=u'srv')],) {}
    2ms GET ([datastore_types.Key.from_path('Image', 349L, _app=u'srv')],) {}
    2ms GET ([datastore_types.Key.from_path('Image', 348L, _app=u'srv')],) {}

This datastore debugger can be easily modified to be used as Django middleware.

Dynamic images uploading

This is the code I use. The template:

<form action="." method="post" enctype="multipart/form-data">
    <label>File: </label><input name="file" type="file"><br />
    <input type="submit">
</form>

Server side:

class Image(db.Model):
    name        = db.StringProperty()
    content     = db.BlobProperty()

class UploadImage(webapp.RequestHandler):
    def post(self):
        if 'file' not in self.request.POST:
            self.error(400)
            self.response.out.write("file not specified!")
            return
        
        if (self.request.POST.get('file', None) is None or 
           not self.request.POST.get('file', None).filename):
            self.error(400)
            self.response.out.write("file not specified!")
            return
        
        file_data = self.request.POST.get('file').file.read()
        file_name = self.request.POST.get('file').filename
        
        im = Image()
        im.name    = file_name
        im.content = file_data
        im.save()
        self.response.out.write("image %r saved." % im.name)

How to get image size and type

Tj9991 found an implementation of function getImageInfo that can extract image size without any external libraries. The usage is straightforward:

content_type, width, height = getImageInfo(im.content)

Dynamic images serving

There's an article about this topic in the official docs. Here's my non-optimal code:

class ServeImage(webapp.RequestHandler):
    def get(self, key):
        im = db.get(db.Key(key))
        if not im:
            self.error(404)
            return
        
        content_type, width, height = getImageInfo(im.content)
        self.response.headers.add_header("Expires", "Thu, 01 Dec 2014 16:00:00 GMT")
        self.response.headers["Content-Type"] = content_type
        self.response.out.write(im.content)

Image resizing

Google doesn't support image converting libraries like PiL. You have to convert images using some external services. You need to upload your data somewhere far from GAE and then somehow get the resized image. Especially for this I created a service (which is not the-most-stable way unfortunately). You can try other people methods as well.

Is comet/http-push/long polling supported by GAE?

No, but keep reading. You could try to do normal polling. For example by loading ajax data every second. But the GAE resources are limited, there are only 650k requests/day available. This limit is going to be reached with only 8 constant users for 24 hours (using ajax polling every second). I created external service that allow you to use comet techniques from GAE.

You can also take a look at my some sample applications that use my external services (source).

2008-05-11

Shared Fridge Magnets - simple collaboration for GAE

Recently I read The ELC Community Blog with their example of sharing objects using red5.

The obvious task is to do the same without red5 and any flash.

So I created yet-another-example of the comet daemon service.

Just open the fridge example site in two browsers. You can move, resize (with shift) and upload images.

2008-05-07

Google App Engine: Ytalk like multiuser chat

This is a follow up on my last post describing missing services for AppEngine.

The idea is to help developers writing apps for AppEngine by providing them some common functionality missing and impossible to have on AppEngine and offer them as external services.

These services are accessible through a simple API over HTTP, and it is easy to call them from AppEngine applications using urlfetch methods.

The services available now are image resizing and comet.

Image resizing is built around a work queue, to which you can POST resize requests with a URL that will be called back when the resize is finished.

The comet service is much more interesting. It is a standalone comet server with properly implemented long running requests (not cheating with constant polling by the client) and allows many to many conversations using channels/chat rooms.

The comet server can notify your AppEngine application when users enter or leave channels, so your app can display updated presence info.

But the conversation can be going directly between client browsers and comet server, so it doesn't eat the precious AppEngine traffic limits.

As a demo of the comet server I have written a chat application modeled after an old-school unix program called ytalk.

The unique feature of the ytalk implementation is that it sends messages after every keystroke, rather than after end of line. That way you can see words not only being written, but being deleted too, letter by letter. It is quite funny to watch it specially when you realize the amount of all the HTTP traffic that happens in the background.

The biggest challenge is to maintain high speed of sending updates, which depends on the connection latency and on the speed of our server. The server is event driven, and is mainly multiplexing streams of bytes flowing through file descriptors, so the overhead here is small.

Summing up
Anyway, the point I want to stress is easy integration with external application. You can write your app in any framework and use the comet server as a service (via HTTP requests used only for control).

The service doesn't require any proxy server, as it is sometimes used in similar setups.

Of course the services are still in development, call it experimental version.

Here you can see my demos of AppEngine apps using the described services.

You can test the ytalk example, or take look at the video to see it in action (yes, in the movie there are two different browsers opened: left is firefox, right is opera):

2008-05-01

Missing services for Google App Engine (comet as a service!)

Google App Engine is a great product, but it lacks several features. I created few simple services to help GAE developers. Of course services aren't GAE specific, you can use them from any site.

The services are:

Image resizing
Cron service
Comet service

Image resizing
One can't easily resize image on Google architecture - they blocked PIL. This service just resizes images and uploads them back to your site.

Example: simple gallery
Image queue on my server.
image resizing service documentation draft

Cron service
Sometimes you need to do something on regular intervals like collecting garbage or updating counters. This service just wgets your url at specified hour.

cron service documentation draft

Comet daemon service

This is my favorite. I wondered if it's possible to create a web service that would allow developers to easily write comet sites. It wasn't easy but finally here it is. Now you don't have to have your own advanced comet server. You can use my service instead! Maybe someone will be interested in possibility to outsource comet servers.

During the work I solved many very interesting problems, but that's a topic for another post.

The only major limitation is that you have to own a domain. You can't build comet application and serve it from *.appspot.com.

Summary
Of course you can download the source code of my examples from Google App Engine.

I see it fascinating to use Google's infrastructure connected with other people (my?) services and create really advanced sites without touching the servers. That brings new possibilities. I'm really excited about it.

Majek's technical blog