RSS

Tag Archives: cdr

Reliable Asterisk CDRs

This is going to be a long and technical article pertaining to the capture of CDRs (call detail record) using the custom extension file and an AGI script.

I designed, built, deployed, and maintain a number of asterisk servers which are used as part of a VOIP arbitrage system.

The first generation system, which inherited, was a single system that housed the Asterisk server, database server for CDR and other billing and reporting data, and a PHP webapp that reported on the data. The system worked (a) when the volume was low; (b) when the overall amount of data was low. Needless to say I was brought in when the “system” (1) started hanging (2) losing call detail records (3) webapp could not return result before the browser would timeout. It was a mess.

The new system uses multiple systems (n+1). There can be a limited number of asterisk servers connected to a dashboard. The dashboard is where all the data is stored, where the ETL is performed, and where the reporting is initiated. The following design supports about 5000 channels on fairly moderate hardware; will auto-restart/recover if there is a crash; and will operate independently from the dashboard.

Here’s how it works. In the VOIP reporting business we live and die by the accuracy of the CDR. In VOIP there is no association equivalent of Visa or MasterCard that sets the rules or arbitrates the discrepancies. It’s always going to be “on you”. Therefore it’s always important to get the data from the switch. RDBMS people like to call the transaction ACID. Something very similar applies here.

In the basic Asterisk installation there are a number of ways to get the CDRs from the system. You can export them directly into flat files or directly into one of several brands of SQL databases like SQLite3. The problem with this approach is that the database is expensive in terms of resources and the flat file is inefficient because it’s one big file. This is additionally cumbersome when you’re trying to report and monitor in realtime.

My strategy is twofold. (1) Export the CDRs to a small flat file and change the flat file once a minute. (2) Then send the flat file to the dashboard server for processing. This is surprisingly efficient and it allows the system to continue to process calls if the dashboard is rebooting or in maintenance mode.

While approach has been wildly successful there is still some for improvement. The first improvement went live today.

Today’s challenge: When Asterisk receives an incoming call it authenticates the source and then tries to locate a route for the call (or the destination).  The routing of the call takes place in a file called extensions_custom.conf. In this file you’ll see some “code” that is more of a macro or script then an actual programming language. This macro tells asterisk when to do with the incoming call and at the end of the call to hangup. There are some other more complex functions like interactive voice prompts and voicemail but we’re just interested in routing. When the call completes we have to initiate a “hangup” and then we need to record the CDR.

So based on the approach above when the call was terminated (hangup) control would be passed to a 3rd layer script (through the AGI interface). This script could be written in any language and it would collect all of the data from the call and append the CDR to the flat file.

So let’s review:

  1. call initiated
  2. authenticate the call source
  3. check the extensions config for an appropriate route
  4. when the call is complete hangup
  5. send the CDR to a PHP script through the AGI interface

Step #5 looks like this:

exten => _X.,n,Hangup
exten => h,1,Set(CDR(userfield)=Hangupcause:${HANGUPCAUSE} Qos:${RTPAUDIOQOS})
exten => h, n, AGI(cdr_new.php, ${SIPCALLID}, ${CDR(dcontext)}, ${SUPPLIER},${CDR(start)}, ${CDR(duration)}, ${CDR(billsec)}, ${CDR(disposition)}, ${HANGUPCAUSE}, ${V_NETWORK}, ${CDR(lastapp)}, ${DEST})

The module that was replaced was “cdr_new.php”. The new module took the same parameters and was called “cdr_pub”.

The problem with the original PHP code was that it processed the incoming data and then created the target filename… and then opened the target for in order to append a record. It’s been working great but we are to a point where we might be losing some CDRs. (this is not definitive, just intuition) With 5000 channels running that means that there can be as many as 5000 instances of the routing process. That means when 5000 calls terminate at once there is a rush to append their CDRs. It’s simply not efficient for PHP to block when appending to the file. Not to mention that there is a lot of overhead for the PHP interpreter to load with each call completion.

The performance issues:

  • the latency to load php with each call completion
  • the possible deadlocks when more than one process tries to append to the same file at the same time. Blocking and resolution are not guaranteed.

The new plan. I rewrote the PHP script in C. Even with the few libraries I needed it’s not more than 20 or 30K. Since it’s native C it loads very fast. So this program gets all of the data from the AGI in the form of command line parameters and data in the STDIN. Then, instead of rushing to append the data to a file the small program sends or “publishes” the CDR to a redis pub/sub queue. There is a single, external, application that “subscribed” to the redis queue an when a message event arrives that external app will write the CDR to the flat file. Since there is only one external app appending to the flat file it cannot have the same problems.

One side note. If the publisher fails then the message event is posted to the syslog. And if the subscriber fails to append to the flat file then it also posts an event onto syslog. If something goes horribly wrong (with the exception of disk space) then we should have a chance to replay the calls in the dashboard by scrubbing the syslog file.

PS: once side note. This configuration also limits the number of simultaneous channels. Therefore if the CDR recording process blocks of any reason that will prevent the system from accepting the next call when the system is running at capacity for that source.

PS: the subscribe app was written in ruby. Installing ruby on my production asterisk server was not my first choice but it was worth it. The Ruby code was compact and it handled exceptions nicely. There were some idioms that I liked a little more than python. And while some of the development took place in Ruby 1.9.3 and the default version on the server was 1.8.7 I did have some challenges getting it to run and I needed to install some additional packages…… which as a side note confirms all of my previous beliefs about full stack awareness.

PS: One last note. When deciding on the publisher implementation and after abandoning C based on it’s lack of a JSON library that made sense I tried go and then considered java, other JVM-based and several dynamic languages… In the end C was the only choice because of it’s size, load latency and runtime.

 
5 Comments

Posted by on 2012/03/10 in nosql, ProgLang, VOIP

 

Tags: , , , , , ,

Loading CDRs into MongoDB

Sweet. This was as slick as you’d expect.

The task was to load 235529 records from 100+ CDR files into MongoDB using the mongoimport tool. Using a Rackspace server with 512M ram and 20GB disk… but it’s all virtual anyway.

Here are the numbers (not scientific at all):

  • 1m 10s – with verbose turned on
  • 34s – with verbose turned off

I’m certain that some portion of the latency with verbose on is that the console was remote and so there was some lag in the i/o across the internet.

The import:
$ . ./bin/cdrmongoimport.sh
connected to: 127.0.0.1
dropping: data.cadb
			30700	10233/second
			57500	9583/second
			85000	9444/second
			113600	9466/second
			144000	9600/second
			170700	9483/second
			197200	9390/second
			223600	9316/second
imported 235529 objects
Just to be sure I checked that all of the data was loaded… some people have been complaining that data has been lost.
$ wc -l /tmp/20110515/*
. . .(snip). . .
  235529 total

And then I checked the count on mongo.

$ ./mongo/mongo
MongoDB shell version: 1.9.0
connecting to: test
> use data
switched to db data
> db.cadb.find().count();
235529
> 
bye

So everything is exactly where it needs to be in terms of performance. With any luck the loading is going to be linear. So that if I loaded 20M records I could expect to take about 40 minutes.

What is interesting here… is that 40 minutes of loads all at once would normally cause a SQL/RDBMS to burp as the locks were escalated and as indexes needed rebalancing etc. This is one of the main reasons why DBAs prefer to load the initial data from bulk loads into temp tables before moving them into their final resting place. Any why Postgres supports sharded tables that can be temporarily detached while the import takes place.

[update]

I decided to try loading a similar range of files remotely over the WAN. It got off to a slow start but then it got to about 75% of the performance that “on the same box” did… and this was through an encrypted tunnel.

rbucker@klub:~$ . ./bin/cdrmongoimport.sh 
connected to: 127.0.0.1
dropping: data.cadb
dashboard@cadb.bigbllc.com's password: 
			100	33/second
			23100	3850/second
			48700	5411/second
			80300	6691/second
			106500	7100/second
			131700	7316/second
			158200	7533/second
			183200	7633/second
imported 187600 objects

 
2 Comments

Posted by on 2011/06/17 in Tools, VOIP

 

Tags: , , , , , ,

Reported my First Bug to MongoDB

I have a client that generates several million Asterisk CDR (call data records). These CDRs are not perfect. In fact they are formatted as TSV and not CSVs; and they have a leading TAB character. Since the CDRs are generated in 5 minute intervals and the files contain a few thousand CDRs it does not make sense to load the DB a record at a time. It actually makes more sense to bulk load so that the data is processed at as low a level in the DB engins as possible.

My first attempt to load data into MongoDB failed. The data was all askew. The problem is/was that there was a leading tab in the TSV file. And during the normal processing of the input file the import utility was stripping all leading whitespace regardless of the filetype. Since the whitespace includes the TAB character and since the first column of my data was mostly empty… the file had a leading TAB character.

And this character was considered a whitespace and so it was deleted before the record was processed.

So I did what any open source guy would do. I opened a ticket. Fixed the bug. And presented my patch in the ticket.I hope they will accept it.

 
Leave a comment

Posted by on 2011/06/17 in beta, database, nosql, Tools

 

Tags: , , , , ,

 
One Page Docs

Creating a library one page at a time.

One Page Bugs

Reducing the friction of writing and fixing bugs or features.

Follow

Get every new post delivered to your Inbox.

Join 223 other followers