Statistics on the length and linguistic complexity of bills

  Where would you go to find out what the longest bill of the 112th Congress was by number of sections (H. R. 1473)?  How about by number of unique words (H.R. 3671)?  What about by Flesch-Kincaid reading level  (S. 475)?

  Head on over to this table of bills, updated daily for the 112th Congress, which contains the following fields:

  • Bill Name
  • Publish Date
  • Bill Title
  • Stage
  • Section Count
  • Sentence Count
  • Word (Token) Count
  • Unique Words (Tokens)
  • Unique Stem Count
  • Avg. Word Length
  • Avg. Sentence Length
  • Reading Level (Flesch-Kincaid)

I’ll be adding more automated analysis and figures over the next few weeks, but for now, here’s a morsel to get your gears turning.

Posted in Law, Programming, Research | Tagged , , , , | 5 Comments

Hash collision attacks hose Oracle Transportation Management

Ars had a great write-up a few weeks ago about the huge hash resource starvation attack that was uncovered awhile back (when I say awhile back, I mean from 2003).  The press has been focused on the impact for a relatively small number of the affected vendors; this is not only somewhat unfair, but also leads many to ignore that the issue may hit closer to home.

If you want to avoid script-kiddieism and have had a basic introduction to algorithms and O-notations, I strongly recommend reading the whitepaper Crosby-Wallach Usenix document.  If you just want to test your stack for issues, however, there’s a simple Python script you can grab to fire away at yourself on Github.  I pointed this at a sandbox OTM 6.2.3 installation, and, after a few sample payloads, struck gold.

INFO | 2012/02/01 17:25:11 | SEVERE: All threads (200) are currently busy, waiting. Increase maxThreads (200) or check the servlet status
INFO | 2012/02/01 17:25:33 | [INFO ][memory ] [YC#20] 682796.926-682796.970: YC 4173305KB->2485427KB (4915200KB), 0.044 s, sum of pauses 43.666 ms, longest pause 43.666 ms.
INFO | 2012/02/01 17:25:40 | Feb 1, 2012 5:25:40 PM org.apache.jk.common.ChannelSocket processConnection
INFO | 2012/02/01 17:25:40 | WARNING: processCallbacks status 2

 

I let the services continue to run, even after HTTP processes had died.  The web/app tier were unable to recover gracefully.  Moral of the story – just because Oracle didn’t release any notes specific to OTM doesn’t mean you don’t need to address the issue.

Posted in Technology | Tagged , , | Leave a comment

Now hosted on EC2

  After a few days of configuration and testing, it’s official – michaelbommarito.com and computationallegalstudies.com will now be hosted on an AWS EC2 instance with S3 acting as CDN for large files.  If you notice any issues with performance or see any missing images, make sure to shoot me an email.

Posted in Technology | Tagged , | Leave a comment

Visual Summary of #jan25 Twitter Activity

  Last year, I covered a number of the so-called “Twitter protests” in China (#cn220), Iran (#25bahman), and Algeria (#fev12).  Since these protests began in January 2011, the Arab Spring has claimed many members of both ruling and revolting groups – Mubarak in Egypt, Gaddafi in Libya, Ben Ali in Tunisia, Saleh in Yemen, and countless civilian and military casualties throughout the region.

  Despite this seemingly stark year-over-year contrast, some aspects remain the same.  One year after the original #jan25 movement that began in Egypt, protesters again took to Twitter to air their discontent.  In order to document this movement, I used my Python tweet archiving script to record all tweets over a window leading up to and after the January 25th event.  The final dataset has the following properties:

  • First tweet: 20 Jan 2012 02:20:58
  • Last tweet: 27 Jan 2012 14:12:36
  • 33,997 unique Twitter handles
  • 193,935 tweets

  To summarize these tweets, I produced the figures below to highlight who tweeted, when they tweeted, and how the tweets built user-to-user networks. The first plot below lists the 30 highest frequency users on the #jan25 tag.

  This second figure displays the frequency of #jan25 tweets; each bar represents a five-minute interval, and the color of each bar indicates its intensity relative to other intervals in the sample.

  This last figure is a network visualization of the 3-core subgraph of the giant component of the Twitter user mention/RT graph formed by this dataset.  This simplified visualization of the graph contains 9,914 users and 62,631 edges, whereas the entire network is made up of 31,805 users and 89,554 edges.

  While there are a number of trivial observations that could be stated about these three figures, I’ll refrain.  Instead, I’ll be working on answering a more interesting question – how does one “movement” compare to another?  Stay tuned for upcoming analysis ranking #jan25 against other Twitter movements.

Posted in Research, Society | Tagged , , , , | 3 Comments

Network of foreign key constraints in Oracle Transportation Management

  If you’ve ever worked under the hood with OTM, you’ll know that there’s some pretty serious normalization.  As I recently explained, deleting a single order can involve traversing over a hundred tables.  So, what does this network of table relationships look like?  After some quick SQL and a few minutes in Gephi, we get the visualization below:

OTM foreign key constraint network.

  If you’re interested, download a PDF copy of this OTM foreign key network here.

Posted in Programming, Research | Tagged , | Leave a comment

Debugging ORA-02292: integrity constraint (OWNER.CONSTRAINT) violated – child record found

  Working with Oracle databases can be daunting to developers and analysts who lack a deep understanding of relational models and SQL.  One excellent example of a mystifying pitfall is the dreaded ORA-02292: integrity constraint error.  This error often occurs when users are attempting to DELETE rows in a well-normalized schema.  For example’s sake, I’ll focus on Oracle Transportation Management (OTM), which is where I’ve spent much of my time over the last six months.

  Given the T in OTM, configurators and analysts often focus on orders and shipments.    After discovering the backdoor SQL Servlet or connecting via SQL Developer, new configurators and analysts may think they can simply delete rows from the ORDER_RELEASE or SHIPMENT table to remove unwanted data.  This often goes something like this…

> DELETE FROM order_release
    WHERE order_release_gid = 'CUSTOMER.ORDER_NUMBER'

Error starting at line 1 in command:
DELETE FROM order_release WHERE order_release_gid = 'CUSTOMER.ORDER_NUMBER'
Error report: SQL Error: ORA-02292: integrity constraint (GLOGOWNER.FK_SSUL_OR_GID) violated - child record found
02292. 00000 - "integrity constraint (%s.%s) violated - child record found"
*Cause: attempted to delete a parent key value that had a foreign dependency.
*Action: delete dependencies first then parent or disable constraint.

  Oops.  The database is telling us that there are records in a different table in the database that point to this record.  If we delete this record, the broken reference will confuse the database. But how do we know where the reference is?  There are nearly 2000 tables in OTM, so looking through them one-by-one isn’t an option…

  At this point, most analysts will learn not to try direct SQL deletions for tables.  If you ask, they’ll tell you it’s not possible to do outside of the OTM UI.  While it’s true that the UI or application should be used for most common deletion tasks, this misunderstanding of database constraints and normalization can promulgate into broader misunderstandings of feasibility and relational models.  So let’s back up and understand what actually went on in this ORA-2292 error above.  How do we know where the dependencies on ORDER_RELEASE are?

  Oracle databases contain a special table that stores all constraints in the database, conveniently named ALL_CONSTRAINTS.  Looking at ALL_CONSTRAINTS, you’ll see that the first two fields are OWNER and CONSTRAINT_NAME.  To find the constraint violated in ORA-2292 above, look for the following portion of the message – “integrity constraint (GLOGOWNER.FK_SSUL_OR_GID) violated.”  In this error string, GLOGOWNER corresponds to the constraint owner, and FK_SSUL_OR_GID corresponds to the constraint name.  So, where does this get us?

> SELECT owner, constraint_name, constraint_type, table_name, r_owner, r_constraint_name
    FROM all_constraints
    WHERE owner='GLOGOWNER'
      AND constraint_name='FK_SSUL_OR_GID';

owner     | constraint_name | constraint_type | table_name       | r_owner   | r_constraint_name
GLOGOWNER | FK_SSUL_OR_GID  | R               | S_SHIP_UNIT_LINE | GLOGOWNER | PK_ORDER_RELEASE

  OK, but what does this mean? In a nutshell, the table S_SHIP_UNIT_LINE has a column that must match up with the PK_ORDER_RELEASE constraint…yes, another constraint to look up.

> SELECT owner, constraint_name, constraint_type, table_name, r_owner, r_constraint_name
    FROM all_constraints
    WHERE owner='GLOGOWNER'
      AND constraint_name='PK_ORDER_RELEASE';

owner     | constraint_name  | constraint_type | table_name       | r_owner   | r_constraint_name
GLOGOWNER | PK_ORDER_RELEASE | P               | ORDER_RELEASE    |           | 

 > SELECT * FROM all_cons_columns
  WHERE owner='GLOGOWNER'
    AND constraint_name = 'PK_ORDER_RELEASE';

owner     | constraint_name  | table_name    | column_name       | position
GLOGOWNER | PK_ORDER_RELEASE | ORDER_RELEASE | ORDER_RELEASE_GID | 1

  This one isn’t so bad – PK constraints in OTM’s schema always map to the corresponding GID, or unique identifier for a record.  So, to summarize the ORA-2292 error above, we can’t delete the record from ORDER_RELEASE because a record from S_SHIP_UNIT_LINE is tied to that ORDER_RELEASE_GID.

  Now, if that makes this seem easy, stop yourself – it’s not.  This is just one of many possible integrity constraints that an ORDER_RELEASE may have.  Let’s try to summarize the above process into a single query.  Given a constraint, how would we find the tables and columns that are tied together?

> SELECT ac.owner AS left_owner, ac.constraint_name AS left_name, ac.table_name AS left_table, acc.column_name AS left_column, acc.position AS left_position, acr.owner AS right_owner, acr.constraint_name AS right_name, acr.table_name AS right_table, accr.column_name AS right_column, accr.position AS right_position
  FROM all_constraints ac
  JOIN all_cons_columns acc ON  ac.constraint_name=acc.constraint_name
  JOIN all_constraints acr ON ac.r_constraint_name=acr.constraint_name
  JOIN all_cons_columns accr ON acr.constraint_name=accr.constraint_name
  WHERE ac.owner='GLOGOWNER'
    AND ac.constraint_name='FK_SSUL_OR_GID';

LEFT_OWNER | LEFT_NAME      | LEFT_TABLE       | LEFT_COLUMN       | LEFT_POSITION | RIGHT_OWNER | RIGHT_NAME       | RIGHT_TABLE   | RIGHT_COLUMN      | RIGHT_POSITION
GLOGOWNER  | FK_SSUL_OR_GID | S_SHIP_UNIT_LINE | ORDER_RELEASE_GID | 1             | GLOGOWNER	 | PK_ORDER_RELEASE | ORDER_RELEASE | ORDER_RELEASE_GID | 1

  Great! While the query isn’t small, the only parts you should need to change are the two final WHERE constraints.  The output is easy to understand – LEFT_TABLE.LEFT_COLUMN references RIGHT_TABLE.RIGHT_COLUMN.

  For extra credit, how would we find every such dependency between a table and other tables?  While this may seem much harder than the original query, we’ve already joined the relevant table.

> SELECT ac.owner AS left_owner, ac.constraint_name AS left_name, ac.table_name AS left_table, acc.column_name AS left_column, acc.position AS left_position, acr.owner AS right_owner, acr.constraint_name AS right_name, acr.table_name AS right_table, accr.column_name AS right_column, accr.position AS right_position
  FROM all_constraints ac
  JOIN all_cons_columns acc ON  ac.constraint_name=acc.constraint_name
  JOIN all_constraints acr ON ac.r_constraint_name=acr.constraint_name
  JOIN all_cons_columns accr ON acr.constraint_name=accr.constraint_name
  WHERE acr.table_name='ORDER_RELEASE';

...
(37 rows returned)

  There – 37 first-degree constraints on the ORDER_RELEASE table. Now, if only each of these 37 tables didn’t have their own integrity constraints…but more to come on this in a later post! In the meantime, a few things to remember or bookmark:

  • While I’ve focused on the OTM example in this post, the processes in this post are applicable to any Oracle database.
  • To learn more about the constraint and constraint column tables, read the documentation: ALL_CONSTRAINTS ; ALL_CONS_COLUMNS.
  • These constraints can extend “recursively” through tables.  For example, the S_SHIP_UNIT_LINE table depends on the PK_S_SHIP_UNIT_LINE table.
  • FKs build a real, directed network!  For real-world data models, these networks can be substantial and complex.
Posted in Programming | Tagged , , , , | 1 Comment

Saving memory in redis and python with struct.pack

  In redis, every object is either a binary-safe string or collection thereof.  Even if you’re storing numbers from your client library or using the INCR/DECR counter within redis, the actual values are stored in memory as strings.  In some cases, this may be fine; however, if you are storing many numbers, whole or decimal, this may add up.

  If you’re using python to store and retrieve data from redis, there’s an easy way to save loads of memory – the built-in struct library.  struct handles converting python types to binary representations of underlying C types.  In the case of numbers, this means that we can quickly and painlessly convert that sequence of characters into the underlying binary representation, which is quite often smaller.  The simple pattern below shows an example of handling non-negative whole numbers.

for number in numbers:
	if number < 2**8:
		rp = struct.pack('B', number)
	elif number < 2**16:
		rp = struct.pack('H', number)
	elif number < 2**32:
		rp = struct.pack('I', number)
	else:
		rp = struct.pack('L', number)

  In a simple test of 1M integers uniformly distributed between 1 and 10^5, this pattern resulted in a 45% reduction in memory usage. While there’s no hard-and-fast rule, I’ve had at least 25% reductions in almost every real-world application of this technique. YMMV, but given the ease of this hack and the cost of memory, it’s definitely worth a shot!

(If you have a better handle on the range or structure of data and aren’t afraid of math, you should also look at rolling your own encoding with GETRANGE or GETBIT. More to come on this in a future post.)

Posted in Programming, Research | Tagged , , , | Leave a comment

Building, configuring, and benchmarking redis from Github source

  Part of blogging for myself is making notes about process that may or may not garner a wide audience.  The first of these notes will cover the process to build, configure, and benchmark redis that I’ve put together.

Building redis

  The first step is to obtain and build redis.  Since I regularly use both 32-bit and 64-bit versions of redis, I keep these source repositories and installation paths distinct.  The process looks like this:

mjbommar@verde:~/src$ git clone https://github.com/antirez/redis redis32
mjbommar@verde:~/src$ cd redis32/
mjbommar@verde:~/src$ make 32bit
mjbommar@verde:~/src$ sudo make PREFIX=/opt/redis32 install
mjbommar@verde:~/src$ sudo mkdir /opt/redis32/etc
mjbommar@verde:~/src$ sudo mkdir /opt/redis32/var/
mjbommar@verde:~/src$ sudo cp redis.conf /opt/redis32/etc/6379.conf
mjbommar@verde:~/src$ git clone https://github.com/antirez/redis redis64
mjbommar@verde:~/src$ cd redis64/
mjbommar@verde:~/src$ make
mjbommar@verde:~/src$ sudo make PREFIX=/opt/redis64 install
mjbommar@verde:~/src$ sudo mkdir /opt/redis64/etc
mjbommar@verde:~/src$ sudo mkdir /opt/redis64/var/
mjbommar@verde:~/src$ sudo cp redis.conf /opt/redis64/etc/6379.conf

Configuring redis

  Once redis has been installed to /opt/redis32 and /opt/redis64, I next tweak some of the configuration parameters in /opt/redis32/etc/6379.conf and /opt/redis64/etc/6379.conf.  Below is the list of changes, which make it easier to maintain multiple redis instances on one host for my typical usage.  Note that this configuration disables automatic saving; I prefer to let the application or user handle save scheduling, as the I/O and memory cost of SAVE and BGSAVE is significant.

  • daemonize yes
  • pidfile /opt/redis32/var/6379.pid
  • logfile /opt/redis32/var/6379.log
  • databases 2
  • Comment out all save lines
  • dbfilename 6379.rdb
  • dir /opt/redis32/var
  • appendfsync no
  • Tweak the hash-max-*, list-max-*, set-max-*, and zset-max-* variables if known a priori for problem.

Running and benchmarking redis

  Once we have redis installed and configured, it’s time to run and benchmark.  This stage looks like the following:

mjbommar@verde:~/src$ cd /opt/redis64
mjbommar@verde:/opt/redis64$ sudo nice -n -10 bin/redis-server etc/6379.conf

  At this point, the server should be running. We can check in (at least) two ways:

mjbommar@verde:/opt/redis64$ ps aux | grep redis
mjbommar@verde:/opt/redis64$ bin/redis-cli INFO

  And it’s finally time to benchmark. While redis-benchmark runs a whole suite of redis commands, I find just GET and SET to be enough to test the host hardware and compiler.

mjbommar@verde:/opt/redis64$ bin/redis-benchmark -t GET,SET
====== SET ======
  10000 requests completed in 0.07 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1

100.00% <= 0 milliseconds
138888.89 requests per second

====== GET ======
  10000 requests completed in 0.06 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1

100.00% <= 0 milliseconds
153846.16 requests per second

  If you want machine readable or even less verbose output, you can also pass the -q flag:

mjbommar@verde:/opt/redis64$ bin/redis-benchmark -q -t GET,SET
SET: 149253.73 requests per second
GET: 151515.16 requests per second

  Note that there are better results to be had if you can afford to use Unix/IPC sockets instead of TCP; while you can’t cluster/shard across machines using Unix sockets, these TCP loopback results are more than 25% slower than using Unix sockets.

Happy redis-ing!

Posted in Programming, Technology | Tagged , | Leave a comment

Blogging for myself, not you.

  In between writing the wrong year on documents, I’ve been reflecting on this blog.  Specifically, I’ve wondered whether the frequency and content of the blog work for me.  When I look at my workspace, it’s clear that I’ve had no shortage of started posts; however, it’s also clear that much of what I start doesn’t make it onto this page.  The primary issue is that I worry about audience perception – too much, too little, too technical, too too vague.

  And so, I’ve come up with a solution – to blog for myself sometimes, not just for you.  While I still enjoy putting together longer-form essays or multi-part technical pieces, I’d like this blog to get a little bit closer to the first word of the tagline – thoughts – whether or not they’re well thought.

Posted in Personal | 1 Comment

Building Legal Language Explorer: Interactivity and drill-down, noSQL and SQL

  Dan and I recently released a new legal informatics project with a few colleagues. The project, which we’ve named the Legal Language Explorer, provides an interface similar to Google Ngrams Viewer for the U.S. Supreme Court. Unlike Google’s viewer, however, the Legal Language Explorer also allows users to drill-down into case-level information for each n-gram. The technical architecture that allowed us to robustly provide both of these worlds wasn’t simple, but I felt that the experience and result are worth sharing.

Legal Language Explorer

Legal Language Explorer

  Before I get into the technical aspects, I wanted to mention one especially important feature of the project – the low cost of failure in interaction. By engineering a system where the downside of bad choices is small relative to the payoffs, people are much more willing to experiment. Sure, that search for “gorilla” was a dud, but it only took 50ms. Before disappointment can even register, you’ve already gone on to “mickey mouse” or “Kant.” This may seem like an unimportant point, but when search spaces are large, exploring them requires asymptotically small costs.  For Legal Language Explorer, this low cost is a direct consequence of combining two technologies – SQL and noSQL.

(Aside: While SQL and noSQL are often viewed as antithetical, I think this is a product of a confused conversation. Our real goal should be to store or process data efficiently. noSQL is a response to the frustrations of attacking problems like graph traversal, document store, or tall-and-skinny numeric with traditional, row-store SQL.  The world isn’t so black-and-white that we can generalize well, however.  For example, SQL databases like Vertica can compete with noSQL databases like kdb for financial data, and SQL databases that support WITH RECURSIVE can compete with neo4j on shallow traversals.)

noSQL

  The front page of Legal Language Explorer displays a time series plot of usage for “interstate commerce,” “railroad,” and “deed” between 1791 and 2005.   This plot contains 645 (year, n-gram) values and represents a total of 103,285 occurrences.  If your experience is like most people’s, the plot has loaded before the page text has even rendered.  Is it just a cached PNG?  Is the data hard-coded into the page?

  The answer lies in redis, a small, neatly written key-value store that many refer to as a noSQL database (to me, redis is more of an in-memory data structure server than a database).  Regardless of what you call it, redis excels at providing blazingly fast read and write access to common structures.  It does this by keeping all data live in memory, organized for efficient access.  Here are some basic facts and figures on our redis instance:

  • redis Architecture: Single 64-bit instance
  • Memory: 12GB running (4GB dump)
  • Keyspace: 103,281,603 keys, up to 214 fields per key
  • Average Keyspace Hit Rate: 77.7%

  Despite the size of these numbers, the HGETALL query that drives the Legal Language Explorer plot returns in under 50ms for every query we’ve tested, even under load from  100 concurrent requests. Could you deliver a solution like that in a traditional RDBMS?

SQL

  If you find an n-gram that you want to learn more about, Legal Language Explorer allows  you to drill down and view the list of all cases that n-gram occurs in.  This table provides the full citation, year, and title for every case.  Since there are often many parties with long names, storing this in redis would be expensive!  What do we do?

  For drill-down, we rely on a traditional, row-store database; in this case, postgres.  The recipe is fairly straightforward.  Populate a (somewhat) normalized database with a case table, and an occurrence table, where occurrences are pairs of (ngram, case_id).  Throw in an index on the n-gram, and you’re done.  The final product looks something like this:

  • SELECT COUNT(*) FROM occurrences: 298,471,830
  • Indices: 10GB
  • Total Tablespace: 25GB

  When we serve up data to the user, we SELECT on occurrences, constraining by n-gram, and JOIN the matching cases.  (Would normalizing the n-grams help us get to 3NF?  Yes, but it didn’t help us conserve CPU while competing with redis.)

  The end result is that most queries are returned in under 10 seconds.  This is great news for everyone who wanted a list of Supreme Court opinions referencing Elvis Presley.  Search away!

Wrapping it up

  Legal Language Explorer allows users to explore n-grams in the Supreme Court corpus at the macroscopic and microscopic level.  This dual mandate is met through a marriage of SQL and noSQL, and we feel that the whole experience is much greater than the sum of the parts.  Over the next few months, we hope to raise enough money to add new data and features to the site, so please stay tuned!

(If you’re really paying attention to the geeky stuff, you may have noticed that I said nothing about the hardware we’re running on.  I didn’t want to draw attention to this, since this post is about software, not hardware.  However, if you have to know, the site is running on an EC2 m2.2xlarge spot instance for about $10/day.  Not bad!)

Posted in Law, Programming, Research, Technology | Leave a comment