Text File Databases: Part 1

October 19, 2010

There is a lot of data stored in plain-ascii text files consisting of records separated by newlines, each record consisting of multiple fields, and it is useful to have a function library for dealing with them. This exercise looks at some functions for reading the data; the next exercise will look at some functions for processing the data.

We will consider four common types of text file databases. A file with fixed-length data fields has records of a fixed number of characters, each record containing fields that are similarly in fixed positions; the data may be preceded by a fixed-length header. A file with character-delimited fields has variable-length records, each with fields separated by a single-character delimiter; the delimiter is often a tab or vertical bar. A particular type of variable-length delimited text database is known as comma-separated values, where the delimiter is a comma and fields may be surrounded by double-quote characters so that a comma within a quoted field loses its meaning as a field separator; in that case, a literal double-quote character may appear within a quoted field as two double-quote characters in succession. The fourth type that we will consider is a name-value record, where each record consists of multiple fields, one field per line, separated by blank lines, each field consisting of a type-name and a value separated by a delimiter; this format is often used for databases that have many optional fields, such as bibliographic databases.

We want reader functions for each of these file formats that all return a single record each time they are called, or an end-of-file marker when the input is exhausted, and advance the file pointer to the beginning of the next record. The return value should be a list or array, whichever is convenient, containing the value of one field in each element, except for the name-value record, which should return a list of name/value pairs.

Different operating systems have different methods of signalling the end of a line. For maximum portability, your functions should accept the end of a line indicated by a carriage return, a line feed, or both characters in either order. You should be prepared to accept any type of line marker because the data may come from any source; for instance, your computer running MS Windows with a CRLF line marker may fetch data from a Linux computer with a bare LF for the line marker. You should also accept the final line in the file whether or not it has a trailing line marker.

Your task is to write functions to read one record from each of the four file types described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Greplin issued a programming challenge recently that required programmers to solve three problems; when completed, Greplin issued an invitation to send them a resume. The first problem required the programmer to find the longest palindrome in the following 1169-character string:


Fourscoreandsevenyearsagoourfaathersbroughtforthonthisconta
inentanewnationconceivedinzLibertyanddedicatedtotheproposit
ionthatallmenarecreatedequalNowweareengagedinagreahtcivilwa
rtestingwhetherthatnaptionoranynartionsoconceivedandsodedic
atedcanlongendureWeareqmetonagreatbattlefiemldoftzhatwarWeh
avecometodedicpateaportionofthatfieldasafinalrestingplacefo
rthosewhoheregavetheirlivesthatthatnationmightliveItisaltog
etherfangandproperthatweshoulddothisButinalargersensewecann
otdedicatewecannotconsecratewecannothallowthisgroundThebrav
elmenlivinganddeadwhostruggledherehaveconsecrateditfarabove
ourpoorponwertoaddordetractTgheworldadswfilllittlenotlenorl
ongrememberwhatwesayherebutitcanneverforgetwhattheydidhereI
tisforusthelivingrathertobededicatedheretotheulnfinishedwor
kwhichtheywhofoughtherehavethusfarsonoblyadvancedItisrather
forustobeherededicatedtothegreattdafskremainingbeforeusthat
fromthesehonoreddeadwetakeincreaseddevotiontothatcauseforwh
ichtheygavethelastpfullmeasureofdevotionthatweherehighlyres
olvethatthesedeadshallnothavediedinvainthatthisnationunsder
Godshallhaveanewbirthoffreedomandthatgovernmentofthepeopleb
ythepeopleforthepeopleshallnotperishfromtheearth

Your task is to write a function that finds the longest palindrome in a string and apply it to the string given above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Rotate An Array

October 12, 2010

Today’s exercise is simple but tricky: write a function to rotate the elements of an array. The function takes two arguments: the array to be rotated and the number of elements to rotate, where a positive count rotates to the left and a negative count rotates to the right. For instance, given the array [1 2 3 4 5 6], a rotation of 2 to the left gives [3 4 5 6 1 2], and a further rotation of 2 to the right restores the original array.

You should be sure to handle the edge cases properly. If the count is zero or the length of the array, the array should remain unchanged. If the count is greater than the length of the array, you should still do the right thing; for instance, a rotation of 8 on the array given above gives [3 4 5 6 1 2], the same as a rotation of 2.

Your task is to write the rotate function described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Zeller’s Congruence

October 8, 2010

Zeller’s Congruence is a simple mathematical method for determining the day of the week for a given date.

In the 1880s, Christian Zeller, a German mathematician, noticed that, if you run a year from March through February, the cumulative number of days in each month forms a nearly straight line; that works because February, which would normally perturb the straight line, is moved to the end. He worked out the formula ⌊(13m−1)/5⌋ to give the number of weekdays that the start of the month moves each month, where m is the month number.

Then it is easy to calculate the day of the week for any given day: add the day of the month, the offset for the number of months since March, an offset for each year, and additional offsets for leap years and leap centuries (remembering to subtract one year for dates in January and February), taking the whole thing mod 7. It’s fun to work out the arithmetic yourself, but if you don’t want to take the time, the whole formula is shown in the solution.

Your task is to write a function that uses Zeller’s Congruence to calculate the day of the week for any given date. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

In two posts on Usenet, George Marsaglia, a mathematician, statistician, computer scientist, and professor at Washington State and Florida State Universities, described a series of nine high-quality 32-bit random number generators; the references pages describe the various generators and their tasteful use in a variety of situations. Here they are in C:

#define znew  (z=36969*(z&65535)+(z>>16))
#define wnew  (w=18000*(w&65535)+(w>>16))
#define MWC   ((znew<<16)+wnew)
#define SHR3  (jsr^=(jsr<<17), jsr^=(jsr>>13), jsr^=(jsr<<5))
#define CONG  (jcong=69069*jcong+1234567)
#define FIB   ((b=a+b),(a=b-a))
#define KISS  ((MWC^CONG)+SHR3)
#define LFIB4 (c++,t[c]=t[c]+t[UC(c+58)]+t[UC(c+119)]+t[UC(c+178)])
#define SWB   (c++,bro=(x<y),t[c]=(x=t[UC(c+34)])-(y=t[UC(c+19)]+bro))
#define UNI   (KISS*2.328306e-10)
#define VNI   ((long) KISS)*4.656613e-10
#define UC    (unsigned char) /*a cast operation*/
typedef unsigned long UL;

/* Global static variables: */
static UL z=362436069, w=521288629, jsr=123456789, jcong=380116160;
static UL a=224466889, b=7584631, t[256], x=0, y=0, bro;
static unsigned char c=0;

/* Example procedure to set the table, using KISS: */
void settable(UL i1,UL i2,UL i3,UL i4,UL i5, UL i6)
{ int i; z=i1; w=i2; jsr=i3; jcong=i4; a=i5; b=i6;
for(i=0;i<256;i=i+1) t[i]=KISS; }

Personally, I think that code is a bit ugly on the outside but absolutely beautiful and exceedingly elegant on the inside. Marsaglia also provided a test suite, which should print seven zeros on successive lines:

#include <stdio.h>
int main(void){
int i; UL k;
settable(12345,65435,34221,12345,9983651,95746118);

for(i=1;i<1000001;i++){k=LFIB4;} printf("%u\n", k-1064612766U);
for(i=1;i<1000001;i++){k=SWB  ;} printf("%u\n", k- 627749721U);
for(i=1;i<1000001;i++){k=KISS ;} printf("%u\n", k-1372460312U);
for(i=1;i<1000001;i++){k=CONG ;} printf("%u\n", k-1529210297U);
for(i=1;i<1000001;i++){k=SHR3 ;} printf("%u\n", k-2642725982U);
for(i=1;i<1000001;i++){k=MWC  ;} printf("%u\n", k- 904977562U);
for(i=1;i<1000001;i++){k=FIB  ;} printf("%u\n", k-3519793928U); }

Your task is to translate those nine random number generators into your favorite language. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Oban Numbers

October 1, 2010

Copyright FXer, license is Gnu FDL, from WikiPediaOban is a small town on the west coast of Scotland, home to a rocky port, a small distillery, and a half-built folly started by a banker with not-enough money and less sense. I fondly remember a visit to Oban in 1987. My mother bought a scarf, a tin of shortbread, and a wee dram of the local whiskey at a small shop on the main street (it may have been one of the shops in the picture, I’m not sure) and was scolded by the waitress for using the wrong fork at dinner. Oban numbers have nothing to do with Scotland, except that the name was the occasion of pleasant reminiscing.

MathWorld describes an oban number as a number from which the letter “o” is missing when it is spelled out in words (the letter “o” is “banned”).

Your task is to make a complete list of all the oban numbers. If your language provides a function that spells out a number in words, don’t use it, but write your own instead. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Maxiphobic Heaps

September 28, 2010

Priority queues are a data structure in which items arrive in random order and exit in priority order; they provide the operations find-min, delete-min, insert and merge. We implemented priority queues using leftist heaps in one exercise and using pairing heaps in another exercise. In today’s exercise, we will implement priority queues using the maxiphobic heaps of Chris Okasaki.

Maxiphobic heaps are a variant of leftist heaps. Like leftist heaps, maxiphobic heaps are represented as binary trees arranged according to the heap property that every element is less than or equal to its two children. Find-min looks at the root of the tree, delete-min discards the root and merges its two children, and insert merges the existing tree with a singleton tree containing the new element.

The key to leftist trees and maxiphobic trees is the merge operation. In leftist trees, the rank of each left child is always less than the rank of its sibling, where the rank of a node is defined as the length of its right spine, and the merge operation enforces this invariant. In maxiphobic trees, the merge operation is implemented by comparing the roots of the two trees. The smaller root survives as the root of the merged tree. That leaves three sub-trees: the tree that lost in the comparison of the two roots, and the child sub-trees of the winner. They are merged by first merging the two smaller trees, where the size of a tree is determined as the number of elements it contains, then attaching that merged tree along with the remaining tree as the children of the new root. The name maxiphobic, meaning “biggest avoiding,” refers to the merge of the two smaller sub-trees.

Your task is to write functions that implement maxiphobic trees. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Alien Numbers

September 24, 2010

Today’s exercise comes to us from the Google Code Jam via Christian Harms’ blog:

Problem

The decimal numeral system is composed of ten digits, which we represent as “0123456789” (the digits in a system are written from lowest to highest). Imagine you have discovered an alien numeral system composed of some number of digits, which may or may not be the same as those used in decimal. For example, if the alien numeral system were represented as “oF8”, then the numbers one through ten would be (F, 8, Fo, FF, F8, 8o, 8F, 88, Foo, FoF). We would like to be able to work with numbers in arbitrary alien systems. More generally, we want to be able to convert an arbitrary number that’s written in one alien system into a second alien system.

Input

The first line of input gives the number of cases, N. N test cases follow. Each case is a line formatted as

alien_number source_language target_language

Each language will be represented by a list of its digits, ordered from lowest to highest value. No digit will be repeated in any representation, all digits in the alien number will be present in the source language, and the first digit of the alien number will not be the lowest valued digit of the source language (in other words, the alien numbers have no leading zeroes). Each digit will either be a number 0-9, an uppercase or lowercase letter, or one of the following symbols !”#$%&'()*+,-./:;?@[\]^_`{|}~

Output

For each test case, output one line containing “Case #x: ” followed by the alien number translated from the source language to the target language.

Your task is to write a program to solve the Google Code Jam. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Kaprekar Numbers

September 21, 2010

Wolfram’s MathWorld describes Kaprekar numbers like this:

Consider an n-digit number k. Square it and add the right n digits to the left n or n-1 digits. If the resultant sum is k, then k is called a Kaprekar number. For example, 9 is a Kaprekar number since 92 = 81 and 8 + 1 = 9 and 297 is a Kaprekar number since 2972 = 88209 and 88 + 209 = 297.

Your task is to write a function that identifies Kaprekar numbers and to determine the Kaprekar numbers less than a thousand. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

In the previous exercise, we wrote RESIDUE, which implements the first half of Morrison and Brillhart’s factorization algorithm based on continued fractions. In today’s exercise, we complete the factorization of F7 by writing the ANSWER program, which takes the factored Qn and computes a factorization of N.

The math underlying the continued fraction factorization algorithm, which we skipped in the previous exercise, is that we are searching for x and y such that x2y2 (mod N) with xy (mod N). The convergents of the continued fraction expansion of √N have the property that An−1 and Qn, as defined in the previous exercise, will indicate a factor of N whenever Qn is a perfect square, as we saw in the example of Q52 in the expansion of √13290059. But such Qn are rare, which is why the method of Lehmer and Powers never found favor. Much more common is the case that two or more Qn multiply to a perfect square, as they share some odd-power prime factors that, when multiplied together, become an even power.

Consider the example of the previous exercise. The product of Q5, Q22 and Q23 is the perfect square 2050 · 4633 · 226 = 2146468900 = 463302 = (2 · 5 · 41 · 113)2. The product of the corresponding An−1 is 171341 · 5235158 · 1914221 = 1717050890347212038 ≡ 1469504 (mod 13290059), and we have the congruence (171341 · 5235158 · 1914221)2 ≡ (2 · 5 · 41 · 113)2 (mod 13290059) or 14695042 ≡ 463302 (mod 13290059). Thus a factor of N = 13290059 is GCD(1469504 − 46330, 13290059) = 4261 and N = 3119 · 4261.

ANSWER uses the Gaussian elimination algorithm of linear algebra to find a set of Qn (Morrison and Brillhart call it an S-set) with product a perfect square. The primes in the factorizations of the Qn are 2, 31, 41, 43, 53 and 113, and we also need −1, as there is an implied factor of −1 attached to each Qn with odd n. Thus, any Qn can be represented as a vector of the powers of the exponents of its factors, modulo 2; for instance, Q5, with odd-power factors 2 and 41, is represented by the vector [1 1 0 1 0 0 0]. Associated with each exponent vector is a history vector that records our work, which initially has 0 everywhere except 1 in the ordinal position corresponding to the row with which it is associated. Here are the exponent matrix, on the left, and history matrix, on the right, corresponding to the seven Qn of our example problem; the columns of the exponent matrix are factors, left to right from low to high, starting with -1, the columns of the history matrix are Qn, left to right from low to high, and the rows of both matrices are Qn, top to bottom from low to high:

1 1 0 1 0 0 0    1 0 0 0 0 0 0
0 0 1 0 1 0 0    0 1 0 0 0 0 0
0 0 0 1 0 0 1    0 0 1 0 0 0 0
1 1 0 0 0 0 1    0 0 0 1 0 0 0
0 1 1 0 0 1 0    0 0 0 0 1 0 0
1 1 0 0 0 0 1    0 0 0 0 0 1 0
0 1 0 0 1 1 0    0 0 0 0 0 0 1

Once the exponent matrix and history matrix are established, the reduction procedure, which performs the forward step of Gaussian elimination, is done as follows:

For each column in the exponent matrix, working from right to left:

Find the “pivot” vector of smallest subscript (nearest the top) whose rightmost 1 is in the current column. If none exists, continue working on the next column to the left.

For each other vector with rightmost 1 in the current column:

Replace the exponent vector with the element-wise sum, modulo 2, of the pivot vector and the current vector.

Replace the associated history vector with the element-wise sum, modulo 2, of the pivot vector and the current vector.

The reduction of the initial matrices given above is shown below:

1 1 0 1 0 0 0    1 0 0 0 0 0 0
0 0 1 0 1 0 0    0 1 0 0 0 0 0
0 0 0 1 0 0 1    0 0 1 0 0 0 0
0 0 0 0 0 0 0    1 0 1 1 0 0 0
0 1 1 0 0 1 0    0 0 0 0 1 0 0
0 0 0 0 0 0 0    1 0 1 0 0 1 0
0 0 0 0 0 0 0    0 1 0 0 1 0 1

Each of the three rows with zero exponent vectors represents an S-set. Some of those S-sets lead to factorizations, but others fail. For instance, the S-set in the seventh row of the reduced matrix gives the congruence (6700527 · 11455708 · 3213960)2 ≡ (2 · 31 · 43 · 53)2 (mod 13290059), or 1412982 ≡ 1412982 (mod 13290059), which is useless. The S-set in the sixth row gives the congruence (171341 · 5235158 · 1895246)2 ≡ (2 · 52 · 41 · 113)2 (mod 13290059), or 130584092 ≡ 2316502 (mod 13290059); however, GCD(13058409 − 231650, 13290059) = 1 and the factorization fails. However, the S-set in the fourth row gives a completed factorization, as shown above.

Your task is to write the ANSWER program described above, and then to find the factors of F7 using the factor base and residues from the previous exercise. When you are finished, you are welcome to read, run or print a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2 3