RSS

Tag Archives: mongodb

My New Python Project Setup

[update 2012-01-18] postgres has been updated to 9.1.2; the latest version as of today.

[update 2012-01-17] feel free to ignore my comments about Lua. While Lua might sit in an interesting place between Python and Java in an embedded/scripting place. The fact that lunatic-python does not compile and lupa depends on LuaJIT2 which is compatible with Lua 5.1 and the current Lua version 5.2 was recently released… and the comment from the LuaJIT team about adoption was a little snarky. I gotta think about something else.

[update 2011-12-29] I forgot to add twitter’s bootstrap CSS/JS. I’ll cover that in a future post when I also discuss modern-package-template

It’s pretty simple to set things up. There are some prerequisites and some basic install packages that need root access but the intent is to get the config in userspace as soon as possible. This article covers VM slices at Rackspace using Ubuntu 11.10.

First: Install and update:

  • allocate the OS
  • select the OS and wait for it to complete.  You’re going to receive an email with the root password
  • login, change the root password
  • create an “admin” privileged user (usually my name or “builder”)
  • add this user to the sudo
  • change it’s password
  • edit /etc/ssh/sshd_config and disable root login
  • update the package definitions (apt-get update)
  • upgrade the packages (apt-get upgrade)
  • reboot

Second: Install the required roo packages

  • postfix – when prompted select the default values
    • apt-get -y install postfix
  • apt-get -y install python-setuptools daemontools daemontools-run python-dev mailutils mutt build-essential uuid-dev python-nose vim htop sysstat dstat ifstat screen locate apache2-utils unzip siege python-virtualenv bwm-ng libcairo2-dev libglib2.0-dev libpango1.0-dev libxml2-dev fail2ban openssl libssl-doc openvpn libssl-dev libgcrypt11-dev lighttpd lighttpd-dev libevent-dev libcurl4-openssl-dev  libreadline6-dev beanstalkd tree
  • apt-get install postgresql-9.1 postgresql-client-9.1 postgresql-doc-9.1 postgresql-plperl-9.1 postgresql-plpython-9.1 postgresql-server-dev-9.1
  • easy_install pip
  • easy_install mercurial
  • easy_install pycurl
  • pip install virtualenvwrapper

That’s it, essentially, for the second layer, however, here’s an explanation of the modules from a macro perspective:

  • python-setuptools – make the installer, easy_install, available
  • daemontools daemontools-run – there are so many ways to implement a ‘daemon’ this tools make it simple to make daemon deployment simple
  • python-dev python-nose python-virtualenv – basic prereqs for python development. virtualenv is needed so that packages can be installed in userspace
  • mailutils mutt – generate emails
  • build-essential uuid-dev  - basic developer tools
  • vim screen – editor and console tool
  • htop sysstat dstat ifstat locate unzip bwm-ng – debug /monitoring tools
  • libcairo2-dev libglib2.0-dev libpango1.0-dev libxml2-dev – libs used when rendering usage graphics
  • fail2ban – detect login attempts and put the IP in time-out
  • openssl libssl-doc openvpn libssl-dev libgcrypt11-dev libcurl4-openssl-dev - crypto
  • lighttpd lighttpd-dev – web server that should be in front of the framework
  • libevent-dev – kevent, kpoll libs
  • apache2-utils siege – performance simulation tools
  • beanstalkd – message queue

And finally the third layer, the userspace framework layer. But before you start installing packages you need to create the virtual environment:

  • cd ${HOME}
  • mkdir -p src
  • cd ${HOME}/src
  • virtualenv currentenv
  • . ./currentenv/bin/activate
Now install the third layer.
  • pip install tornado
  • pip install flask
  • pip install flask-rest
  • easy_install pip
  • pip install pycurl
  • pip install simplejson
  • pip install tornado
  • pip install Fabric
  • pip install PasteDeploy
  • pip install PasteScript
  • pip install modern-package-template
  • pip install requests
  • pip install gevent
  • pip install pystache
  • pip install nose
  • pip install redis-py
  • pip install pymongo
  • pip install hoover
  • pip install pyzmq
  • pip install pyyaml
  • pip install beanstalkc
  • pip install django
  • pip install django-redis-cache
  • pip install clint
  • pip install djangorestframework
  • pip install pyparsing
  • pip install flup

I’m hoping that there is a practical use-case for embedding Lua in Python. There are an few interesting projects like lunatic-python and lupa. Normally I would not consider Lua for anything beyond “hello world”, however, the redis team is embedding Lua, it seems like a very lightweight codebase, it can be embedded in just about any language (do a google search).

  • cd ${HOME}
  • mkdir -p tmp
  • cd tmp
  • wget http://www.lua.org/ftp/lua-5.2.0.tar.gz
  • tar zxvf lua-5.2.0.tar.gz
  • cd lua-5.2.0
  • make linux
  • sudo make install

NOTE: the lunatic project does not compile under Lua 5.2. So this thread is postponed for now.

  • pip install lunatic-python

Alternatively I tried [lupa] but that requires LuaJIT 2.0 which is currently in beta (version 9)

  • cd ${HOME}
  • mkdir -p tmp
  • cd tmp
  • wget http://luajit.org/download/LuaJIT-2.0.0-beta9.tar.gz
  • tar zxvf LuaJIT-2.0.0-beta9.tar.gz
  • cd LuaJIT-2.0.0-beta9
  • make
  • sudo make install
  • sudo ldconfig

Then install lupa.

  • pip install lupa
NOTE: hoover is a client library for loggly.com.  You’ll need an account if you want to use this service.

In closing, I would like to include a few more libraries, however, the current version in apt-get is too old. I’d prefer installing them from scratch. They are necessary packages so for the time-being I’m just going to list them. They should be installed when installing the first layer and by the root user (or sudo)

  • ZeroMQ - trivial to build and deploy if you follow the instructions
    • cd /tmp
    • wget http://download.zeromq.org/zeromq-2.1.11.tar.gz
    • tar zxvf zeromq-2.1.11.tar.gz
    • ./configure
    • make
    • make install
    • ldconfig
  • MongoDB – (can actually be installed in userspace)
    • cd /tmp
    • wget http://fastdl.mongodb.org/osx/mongodb-linux-x86_64-2.0.2.tgz
    • mkdir -p ${HOME}/bin
    • cd ${HOME}/bin
    • tar zxvf mongodb-linux-x86_64-2.0.2.tgz
    • sudo mkdir -p /data/db
    • sudo chown `id -u` /data/db
    • ./mongodb-linux-x86_64-2.0.2/bin/mongod
    • … or …
    • cd ${HOME}/bin
    • find ./mongodb-linux-x86_64-2.0.2/bin/ -type f -exec ln -s {} \;
    • ./mongod
  • Redis – trivial to build and deploy if you follow the instructions
    • cd /tmp
    • wget http://redis.googlecode.com/files/redis-2.4.6.tar.gz
    • tar zxvf redis-2.4.6.tar.gz
    • cd redis-2.4.6
    • make
    • make install
  • SQLite – a simple SQL DB
    • cd /tmp
    • wget http://www.sqlite.org/sqlite-autoconf-3070900.tar.gz
    • tar zxvf sqlite-autoconf-3070900.tar.gz
    • cd sqlite-autoconf-3070900
    • ./configure
    • make
    • make install
    • ldconfig
  • ISO8583 – ISO8583 lib

Good luck!

PS: You should consider scripting this installation so that the deploy can be automated.  Specially via Fabris, chef, or puppet.

 
1 Comment

Posted by on 2011/12/30 in architecture, beta

 

Tags: , , , , , , , , , , , ,

Response to “Seven Databases in Seven Weeks”

For this “7 in 7″ book I just glanced at the motives for selecting the DBs that the author did. What caught my attention was the TOC. While the title of the book suggests that this is going to be a reference to modern databases and the NoSQL movement it included Postgres. What’s curious here is that a) PSQL is not a modern database and it’s not a NoSQL database either. b) While it is a modern implementation none of the modern features are mentioned.

And then there is a huge gap where BDB, BerkeleyDB, should be. While BDB is sometimes considered a NoSQL database it does not implement the CAP theorem which is consistently attached to NoSQL DBs. What makes BDB interesting, and which would seem to be the subliminal rationale for the many query dialects of the NoSQL DBs is an essay that Mike Olsen wrote where he justified BDB’s APIs and the absence of a formal query language. [programmers know their the data better than any query optimizer] and then there was [the extra steps to compile and optimize are time consuming and better at compile time instead of runtime].

CAP is the anti-pattern to ACID. Essentially CAP comes down to a principle of economics [pick two of the following three attributes]. A lot of time has been devoted to this paper and the many followup research papers. I’m not qualified to rebut the thesis but I always wonder if there is a spoiler out there. VoltDB has a novel approach that suggests that you can, in fact, have your cake and eat it too. (It’s also absent)

The real challenge with the NoSQL movement and this publication is that they are implementing code as fast as they can. By the time this article is posted something new and interesting will have been deployed.

Missing from consideration:

  • memcache
  • leveldb
  • big table
  • S3
  • BDB (mentioned)
  • Orient
  • UnSQL (a completely different movement)
  • SQLite

Finally, the one thing that is missing for me is a comprehensive or at least a beginner list of use-cases and the DBs that best satisfy those use-cases and why. For example Riak seems to be a special purpose DB where MongoDB seems to be more of a general purpose DB. There are still some edge cases… but when you’re talking about the volume of data that many of the NoSQL people talk about you better have a good plan, specially if you think you might be moving the data from one storage engine to another.

 
Leave a comment

Posted by on 2011/12/21 in database, future

 

Tags: , , , ,

Website name?

I need help deciding what the name of my project website is.  a) it’s going to use the mongoDB b) and Mojolicious… to store, manage and print mailing labels. This is just a sample project to demonstrate (a) and (b).

 
Leave a comment

Posted by on 2011/10/14 in database, site, web

 

Tags: , , ,

Eventually Consistent Storage Will Save Mankind

I recently read a tweet from @justinsheehy, the very public face of Riak @ basho.com. He wrote:

Paraphrasing @GeorgeReese: to be protected from failure, put as much of your data in an eventually-consistent system as possible.

In response, and without thinking too deeply, I asked the question:

@justinsheehy @georgereese good point so why not a flat file and import later? Why all the extra cycles/rotations to write to any type DB?

And then @georgereese and I started to converse at 140-byte intervals until he sent me a link to this article: Eventual consistency – Wikipedia, the free encyclopedia.  At that moment I realized that my original question was really more of a statement; in it’s absolute simplest form; the wiki definition for eventual consistency can be applied to a flatfile on a DOS-based computer so long as you take backups and restore them on another computer… at some point in time.

That said, I think; and I could be wrong, Sheehy and Reese were probably talking about Riak which has a lot more moving parts in it than -say… a zipped-flatfile and rsync… and there is plenty of computer science reference material that discusses BLOC (bugs per line of code).

I’m currently designing and implementing a credit card payment gateway. It’s not overly complicated, however, the most interesting piece of this implementation is the use of Redis as the storage engine. While Redis stores everything in memory, I have enabled the feature/function that saves the data to disk; so while I have not enabled replication… this “system” can be described as eventually consistent.

Eventually Consistent is so much more interesting when applied generally and globally across systems instead of narrowly defined applications.

In the interest of full disclosure; I recently interviewed with @justinsheehy for a position on the Riak project. While I recognized that I had not performed well after only having a few hours sleep thanks to my pair of newborns I have not yet received any formal feedback. This conversation and post are meant to be informative and with the sincere hope that one day basho might offer me a position.

 
Leave a comment

Posted by on 2011/10/03 in database

 

Tags: , , , , ,

mojolicious – first app (part 1)

[Update] I should have mentioned that the elapsed time was only about 15 min; maybe less.

I’m a perl programmer from way back and while it has been 2 years since I’ve written any perl of consequence I’ve decided that my client’s application was going to be implemented in perl… just for the fun of it. (I’ve been writing a lot of python+tornadoweb+redis+mongodb and I’m very comfortable in this space. Mojolicious is a half step outside my comfort zone because I really do not want to rewrite this app in python if I don’t have to. I’m a big believer in DRY)

I’m certain that everything is installed. I used the same installation for documentation’s sake recently:

wget http://fastdl.mongodb.org/linux/mongodb-linux-x86_64-2.0.0.tgz
# need to untar and then relocate the files to the path

# MOJOLICIOUS - (as root)
curl -L cpanmin.us | perl - App:cpanminus
curl -L cpanmin.us | perl - Mojolicious
curl -L cpanmin.us | perl - Redis
curl -L cpanmin.us | perl - ZeroMQ
curl -L cpanmin.us | perl - JSON
curl -L cpanmin.us | perl - MongoDB

And we should be ready for the first app.

The application folder structure is going to look like this:

.
|-- COPYRIGHT
|-- INSTALL
|-- LICENSE
|-- README
|-- bin
|-- data
|-- doc
|-- src
`-- test

It’s far from perfect and it’s missing the Mojolicious application (webapp). So let’s create a mojo app (adapted from the docs):

Nope, not yet. First you have to know whether you are going to create a lite_app or a full-blown app.

mojo generate lite_app

or

mojo generate app

The difference is quite extensive. The second selection is probably best suited for anything more that a a few dynamic pages. And the lite app seems suited for all-in-one file deployment. Right now I’m not sure which is best for this design problem, however, I know that I can convert later; I can cut-paste everything and with any luck I won’t be repeating myself.

So before I make that decision. What does my application look like.

  • I need a login screen and session state so that the data is private for my client. One user is sufficient.
  • I need to upload a CSV file to a file system and queue that file for processing.
  • I need a background process that looks for queued files and then processes them.
  • The process is as simple as a GET against a URL, storing the result in another CSV and then getting the next record from the input file.
  • Once all of the records have been processed the second file is converted to a PDF of mailing labels.
  • The PDF can be downloaded from inventory.
  • and finally the user should be able to delete the PDF, the original CSV and the intermediate CSV.

So that’s pretty simple. I need a few pages:

  • login
  • error
  • queue status
  • upload
  • are you sure (popup??)
  • completed (optional, could always be inline)

So for the time being lite_app it is. This is what I did:

mkdir webapp
cd webapp
mojo generate lite_app lisapp.pl

Notice that I specified my application name (lisapp.pl). Starting the app was a breeze eve though it bugs me that the mojo team misused daemon and does not even want to be open minded about it. Oh well, it’s a good platform in spite of ‘daemon’. And then I noticed the output:

$ ./lisapp.pl daemon
[Wed Sep 21 00:13:47 2011] [info] Server listening (http://*:3000)
Server available at http://127.0.0.1:3000.

Notice that the output is stating that it’s listening on ‘*’ and 127.0.0.1.  What’s up with that? When I did a ‘netstat -ln’ I only had one instance of port 3000.

tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN

I’m not certain I understand the motives here. But I have to say that it drop be nuts because I knew that I had to test on 0.0.0.0 since my server is at Rackspace. And at first glance I only saw the 127.0.0.1. So it’s on me to pay attention and on them to be more clear.

When I pointed my browser to the URL it popped. As I expected since this is a very lightweight app. So for now I will take a break. Next time I will add the login screen. It’s part of the 3rd screencast and pretty simple. I also need to read-up on the under command. It seems to be where all the power comes from. Keep in mind that the home page is going to be a login screen. No data should leak out. Everyone logs in.

 
1 Comment

Posted by on 2011/09/21 in database, nosql, web

 

Tags: , , , , , , ,

How does [mongoDB] concurrency work

This is a short post to call attention to this [mongoDB concurrency]. I do not doubt, for a second, that concurrency is difficult. It’s probably very hard. Part of which depends on the overall architecture, resultset size, record version collision, replication, sharding, and conflict resolution; among others.

I do not have a particular concern other than it is an important problem and that it’s going to effect write performance and potentially some reads. (This is very similar to the GIL problem in Python). There is a ticket on their system that references this issue and plans to resolve the issue by pushing the lock to the collection instead of the DB. While this is more granular I’m not sure it resolves the real issue with a workable solution. (And I do not have a formal recommendation)

It’s not clear to me whether this problem references a lock across all shards or just the one shard. Therefore, could the problem be alleviated by adding shards? (doubt that)

I can say this, however, most RDBMS servers have addressed this problem by setting the locking level via a pragma and that it is tunable. In order to know whether this is an issue for you; you will need to know the distribution of calls [read, insert, update and delete] and the TPS rates too.

I’m less convinced that the day of the RDBMS is over. (Recently PostgreSQL 9.1 was released)

 
Leave a comment

Posted by on 2011/09/15 in database, nosql

 

Tags: , , ,

Mongolab Surprise

I’m still digging into all things NoSQL and I have tried mongohq in the past but now I wanted to try mongolab. I was not expecting anything too super fantastic as I created my user account, my database and then my first collection. But I was.

I had a csv file on my Mac. I had previously installed mongo on my computer… I clicked on the import/export link on the mongolab website and they gave me a list of commands that I could copy/paste to the command line. After I inserted my username and password I executed it. It worked first time. Very nice.

Viewing and editing documents is a snap but subject for some self exploration.

Right now my only criticism is that it’s expensive and I cannot determine the value proposition solely based on the webapp. I could install a mongodb server for a small fraction of cost and then clustering and ha become something I can measure.

 
Leave a comment

Posted by on 2011/09/04 in database, nosql

 

Tags: ,

NoSQL != NoDBA

For the reader who is not familiar; the title of this article reads: NoSQL not equal NoDBA. And what I mean by it is that while the traditional function of the DBA is different in the NoSQL environment; one still needs a subject matter expert (SME) on the payroll in order to keep the “engine” running smoothly. NoSQL is just another specialty.

Many years ago I was caught-up in SleepyCat’s BDB libraries. They worked, they were fast, and as they promised; you could forgo a DBA. I developed a few proof of concept applications using BDB and they worked great. They included speed, big data, ACID and everything they promised. Luckily for me, at the time, the projects never ran long enough for a disaster to occur. I know now that, at the time, I did not know enough about BDB to recover from even a moderate system failure.

Today we are inundated with NoSQL alternatives. Riak, MongoDB, Redis, Cassandra, Volt, Orient; just to name a few. To my knowledge, none of them actually state that a DBA is not required, however, they all seem to imply that your developers are going to assume the responsibility. At least Riak and MongoDB have enterprise consoles for the NOC (network operations center) suggesting that they realize otherwise.

Let’s start with the schema. Most developers will knock out their first or second iteration of the schema over lunch. And in most cases it’s probably pretty simple. It’s not until you get into production that “you” realize the warts when your perfect parochial schema. I’ve implemented several payment systems. The first holds 12B active accounts and processes 12M sale transactions a day(333TPS). The second had a hard time at 25TPS. The first contained only 5 tables and the second was a beautiful 100 table constraint nightmare.

And then there is “real world” data. For example, when you’re doing 12M transactions a day Oracle it’s still a challenge to export the data so that it can be warehoused and reported upon. ETL is going to take time. That’s when one might consider sharding and other approaches to optimization; even normalization (all functions that should be performed by a DBA). However, in the NoSQL/NoDBA world, this function is going to fall on the developer… who is no longer working on new functions or revenue generating opportunities but is instead sandbagging the dam.

As far as SME’s go. They tend to know vertical markets or applications very well. They tend not to know every last detail about the data store.

For example, there was a time when my DOS based PC would crash and I’ve have to fix my harddisk. There was a time when I could and would repair the filesystem by hand, however, after Norton Utilities performed that function in a fraction of the time I had to turn in my keys. And now, when that type of failure occurs on my Linux machine I simply reinstall. I do not have the time or the inclination to repair the data.

That function was always left to the DBA when it came to traditional RDBMS and the sysadmin when the filesystem went bad. I just cannot imagine that anyone would want to perform that function when there are people who specialize in it.

So just because you have read the docs for the client libraries and maybe the source code. None of that makes you a SME. And there is nothing that is going to replace the SME. Just because you’re not calling him/her a DBA does not mean that the function is not being performed.

 
1 Comment

Posted by on 2011/08/23 in database

 

Tags: , , , , , ,

Chef installation : you gotta be kidding me!

Last night I started working on puppet and things were iffy. At least the server and client installed from their ubuntu packages. Admittedly there were errors in the end but they might have been mine… and there are some compatibility issues that have been documented. So I switched to chef with good intentions.

Before I get to the details… in hindsight I must have been nuts to try chef. My first clue was the list package dependencies; there must have been 50+. What were the designers thinking?

First of all they need a DB and an MQ; and I think I like the idea that they are using packages that exist in the open source environment… but I am amazed that they would use such beheamoths. First of all CouchDB and RabbitMQ both depend on erlang and all those extra packages. When a standard SQL-type DB like SQLite or if they really need a document repo then MongoDB would be fine. At least the packages are small, available in binary form and they have a REST interface that is easy enough to write too. Of course there are so many other DBs that are integrated directly into Ruby or with shallow dependencies.

The same can be said for their choice of MQ. RabbitMQ is the thousand pound gorilla. There are two strong candidates in ZeroMQ and beanstalkd. Both are extremely lightweight to install and deploy. They are fast and reasonably functional.

So even though I have a personal dislike for all things ruby (based on personal experience in the Birmingham Alabama area) it can do the same job that other dynamic and non-dynamic languages can. Performance and some of the edge cases not withstanding… I hate deep dependencies… (same reason I dislike most package managers including maven).

 
Leave a comment

Posted by on 2011/07/22 in database, nosql, ProgLang, Tools

 

Tags: , , , , , , ,

“The Network Is the Machine” : Erlang is not all that

I like erlang and I like it most because it solves a number of problems, however, the problems that I think it solves in general application development are not the kinds of problems that most erlang programmers want to solve. For the [sp: life] live of me I cannot understand why erlang programmers would implement a database like Riak. It’s a complicated undertaking and frankly considering how deep the callstack has to be at times it does not seem practical without a real debugger.

As I consider the amount of work that it takes to implement a single credit card transaction I realize that the entire callstack is going to consist of a few thousand instructions regardless of the language. The hardest part of a credit card transaction is the DB record versioning and not the actual in-memory workflow.

So then we start talking about the threading and IPC. MEH. I no longer care about that stuff. Not even for a second. With libraries like ZMQ “we” should reconsider how we allow processes to communicate. Modern MQs are fast, reliable, persistent, and easy.

Finally, If you have a transaction that takes a predictable number of machine cycles (specially in the context of a CC transaction) then executing the transactions synchronously via a fixed number of workers will have less overhead than each transaction being launched at the same time. As light as they may be there is still overhead. O(1)+1 still has a +1 and at some time…. say after 1M transactions they will count for serious performance.

[update 2011-06-22: When the machine is idle then parallel execution and all the thread happiness in erlang makes sense, however, when the machine is busy then single process execution makes the most sense as there is no overhead no matter how small. ie; if you have 100K transactions you still have 100K worth of work to complete. The mean execution time will be higher just because of the latency and overhead but on the hole there is no real advantage. see nodejs as a partial example.]

[update 2011-08-24: I get it and it was not because I was talking to the CTO at Riak. Two days ago I was looking at the connection pool to a Postgres DB. That's when I realized that an erlang implemented database... at least for socket(s) and systems with long runtimes like connection pooling; was a very strong use-case for erlang.]

 
Leave a comment

Posted by on 2011/06/21 in nosql, ProgLang

 

Tags: , , , ,

 
One Page Docs

Creating a library one page at a time.

One Page Bugs

Reducing the friction of writing and fixing bugs or features.

Follow

Get every new post delivered to your Inbox.

Join 223 other followers