Examples of Non-database Sources in Report Building

 

Besides databases, local files, HDFS files ,NoSQL databases and JSON data streams are also frequently used data sources in building reports. These types of data sources are not as good as databases in terms of computational abilities. Since they can’t implement algorithms such as statistical queries, group and aggregation, and sort and merge in the stage of data source preparation, they can only achieve the same effect using complicated report scripts or JAVA language.

esProc has rich class libraries for structured-data handling and supports heterogeneous data sources, making it convenient to build reports with Non-database sources, and thus a suitable tool for report data source preparation. The reporting tool regards an esProc script as the stored procedure, passes parameter to it and gets the returned result after execution through JDBC. For more details, see How to Use esProc to Assist Reporting Tools.

Here follows some typical scenarios about Non-database sources and offers their esProc solutions.

Simple grouping and aggregation

The sales.csv file stores sales orders. Requirement: In the reporting tool, group data by SellerId and calculate the total sales amount per seller. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

The reporting tool can import data in its entirety, perform group and aggregation and then hide the detailed data. But the performance is not equal to that of esProc used specially for arranging data source.

  A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.groups(SellerId;sum(Amount))

Explanation: First import the source file as a structured two-dimensional table; then perform the group and aggregate.

By default The import function imports all fields using tab as the separator; @t means importing the first row as the field names. The groups function can perform group and aggregate on a two-dimensional table.

Here’s the result:

esProc_text_file_group_1

Extended information:

  1. If the source data has been ordered by SellerId, we could use @o to speed up the data handling by only comparing the adjacent records with the code =A1.groups@o(SellerId;sum(Amount)).
  2. esProc allows for grouping data according to multiple fields and the implementation of various aggregate algorithms. To calculate the sales amount and count the orders for each seller per year, for example, we could use this code:groups(SellerId,year(OrderDate);sum(Amount),count(OrderID)).

Here’s the result:

esProc_text_file_group_2

  1. When the source file is too big to be entirely loaded into the memory but the aggregate result would be smaller (this is what usually happens), we can import the file as the cursor and perform the aggregate with the group function, thus the code will be file(“D:\\sales.txt”).cursor@t().groups(SellerId;sum(Amount)).
  2. If the aggregate result set isn’t small enough to be completely put into the memory, we could carry out the aggregation using groupx Unlike the group function, groupx returns a cursor.

Duplicate-related operations

This scenario is about removing the duplicate rows in the Sales.csv file and present the result in the report. The rule is: for two rows with the same OrderID, only keep the first one. Below is a selection of the source file:

OrderID   Client        SellerId     Amount    OrderDate

1       WVF Vip 5       440.0        2009-02-05

1       WVF Vip 5       1040.0      2009-02-04

1       UFS Com          5       1040.0      2009-02-03

2       UFS Com  13     1863.4      2009-07-05

2       UFS Com  13     1863.4      2009-07-05

3       SWFR        2       1813.0      2009-07-08

4       JFS Pep     27     670.8        2009-07-08

5     DSG       15    3730.0    2009-07-09……

esProc code:

  A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.group@1(OrderID)

Explanation: Group the imported data by OrderID and get the first record of each group without aggregation. The group function is able to group data without performing the aggregate; @1 means getting the first row of each group and returning the results as a new two-dimensional table。

Here’s the result:

esProc_text_file_group_3

Extended information:

  1. It’s a more practical practice to retain records according to a certain condition. To keep the record with the earliest date in each group, use the following code: group(OrderID).(~.maxp(-OrderDate)).

Here’s the result:

esProc_text_file_group_4

To find the first of each group of records sorted by OrderDate in ascending order and by Amount in descending order, we can find the records with earliest OrderDate and then find the one from them with the greatest amount. The code is A1.group(OrderID).(~.maxp([-OrderDate,Amount])).

If there are multiple records satisfying the condition and we want them all, use A1.group(OrderID).conj(~.maxp@a([-OrderDate,Amount])) .

To get the top two from the multiple eligible records, use

A1.group(OrderID).conj(~.top(2,OrderDate,-Amount)).

  1. Sometimes it is the duplicate records that we want, then get them with group(OrderID).select(~.len()>1).conj().

Here’s the result:

esProc_text_file_group_5

  1. Sometimes we simply want the distinct values of a certain field. To get the client list, for instance, use id(Client). The result:

esProc_text_file_group_6

The id function performs the distinct operation, so it is the equivalent of A1.group@1(Client).(Client).

  1. Other times we want all the rows that have different field values, instead of filtering the records by ID. In those cases we do it with

A1.group@1(OrderID,Client,SellerId,Amount,OrderDate)

The result:

esProc_text_file_group_7

Without intending further structured-data computation, we can import the file as a set of rows and perform distinct operation using file(“D:sales.csv”).import@tsi().id(). This is what we get:

esProc_text_file_group_8

The grouped report from a multi-level JSON file

Cells.json is a multi-level nested JSON file, which you need to present in a grouped report. The grouping fields are name, type and image.”xlink:href”. There is another a field with 3 subdocuments: custom. Identifier, custom. Classifier and custom. Output, which are of the same structure but contain different number of documents each.

Below is the source data:

{

“cells”: [

{

“name”: “b”,

“type”: “basic.Sensor”,

“custom”: {

“identifier”: [

{

“name”: “Name1”,

“URI”: “Value1”

},

{

“name”: “Name4”,

“URI”: “Value4”

}

],

“classifier”: [

{

“name”: “Name2”,

“URI”: “Value2”

}

],

“output”: [

{

“name”: “Name3”,

“URI”: “Value3”

}

]

},

“image”: {

“width”: 50,

“height”: 50,

“xlink:href”: “”

}

},

……

]

}

 

esProc code:

  A
1 =file(“D:\\cells.JSON”).read().import@j()
2 =A1.cells.conj(custom.(  identifier.new(name:cname,URI:cURI,”identifiler”:cType,A1.cells.name:name,A1.cells.type:type,A1.cells.image.’xlink:href’:image)|

classifier.new(name:cname,URI:cURI,”classifier”:cType,A1.cells.name:name,A1.cells.type:type,A1.cells.image.’xlink:href’:image)|

output.new(name:cname,URI:cURI,”output”:cType,A1.cells.name:name,A1.cells.type:type,A1.cells.image.’xlink:href’:image)))

esProc merges the three subdocuments into a single two-dimensional table, gives them a new field name ctype to be identified and appends the grouping fields to the tabled. Thus a typical “table with subtables” is created. Below is the merging result of A2:

esProc_text_file_group_9

Now it’s easy to build a grouped report based on this esProc result.

Joins between MongoDB collections

Both sales and emp are MongoDB collections. sales has SellerId field as its logical foreign key that points to emp’s EId field. You need to query orders in sales by the specified time period and associate them with emp through a left join, and then present the result in a report. Below are selections of the source data:

Collection sales

OrderID Client SellerId Amount OrderDate
26 TAS 1 2142.4 2009-08-05
27 HP 13 538.6 2010-08-06
28 DY 11 307.2 2010-08-07
29 WVF 19 420 2010-08-08
30 UFS 18 1200.8 2009-08-09
31 SWFR 15 1488.8 2010-08-12
32 JFS 3 468 2009-08-13

Collection emp

EId State Dept Name Gender Salary Birthday
2 New York Finance Ashley F 11001 1980-07-19
3 New Mexico Sales Rachel F 9000 1970-12-17
4 Texas HR Emily F 7000 1985-03-07
5 Texas R&D Ashley F 16000 1975-05-13
6 California Sales Matthew M 11000 1984-07-07
7 Illinois Sales Alexis F 9000 1972-08-16

esProc code:

  A
1 =MongoDB(“mongo://localhost:27017/test?user=root&password=sa”)
2 =A1.find(“sales”,”{‘$and’:[{‘OrderDate’:{‘$gte’:'”+string(begin)+”‘}},{‘OrderDate’:{‘$lte’:'”+string(end)+”‘}}]}”,”{_id:0}”).fetch()
3 =A1.find(“emp”,,”{_id:0}”).fetch()
4 =A1.close()
5 =join@1(A2:sales,SellerId;A3:emp,EId)
6 =A5.new(sales.OrderID:OrderID,sales.Client:Client,sales.Amount:Amount,sales.OrderDate:OrderDate,emp.Name:Name,emp.Dept:Dept,emp.Gender:Gender)

Access MongoDB to perform queries on collection sales and collection emp respectively, associate them through a left join, and retrieve the desired fields. Here’s the result:

esProc_text_file_group_10

Query on the HDFS file

The text file employee.gz stored in HDFS contains employee data. The requirement is to perform query by birthday and gender and then present the result. The query condition is, for example, the female employees who were born after January 1, 1981 inclusive. Below is a selection of the source data:

EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY

1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000

2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000

3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000

4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000

5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000

6       Matthew Johnson   M     California 1984-07-07       2005-07-07       Sales         11000

7       Alexis        Smith        F       Illinois       1972-08-16       2002-08-16       Sales         9000

8       Megan     Wilson      F       California 1979-04-19       1984-04-19       Marketing        11000

9       Victoria    Davis        F       Texas        1983-12-07       2009-12-07       HR    3000

10     Ryan         Johnson   M     Pennsylvania    1976-03-12       2006-03-12       R&D          13000

11     Jacob        Moore      M     Texas        1974-12-16       2004-12-16       Sales         12000

12     Jessica     Davis        F       New York 1980-09-11       2008-09-11       Sales         7000

13     Daniel       Davis        M     Florida      1982-05-14       2010-05-14       Finance    10000

esProc code:

  A  
1 =hdfsfile(“hdfs://192.168.1.210:9000/user/employee.gz”,”UTF-8″).cursor@t()  
2 = A1.select(BIRTHDAY>=argBirthDay && GENDER==argGender)

Use cursor to open the HDFS file object, making the first row the headers and using tab as the default field separator. The character set is UTF-8. The compression type will be determined by the file extension. Here it is the gzip format.

Then use the select function to perform the query, in which argBirthDay and argGender are report parameters.

Concatenating strings by groups

The sales.csv file stores sales orders, based on which we want a report of client list having three fields – year, month and comma-separated client list. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

esProc code:

  A
1 =file(“D:\\sales.cvs”).import@t()
2 =A1.group(year(OrderDate),month(OrderDate);~.(Client).string())

Explanation: Group the imported data by the two parts of OrderDate, get the set of clients in each group and convert them to comma-separated strings. Here’s the result:

esProc_text_file_group_11

Intra-group alignment

The empContact.csv file has six columns, the first four of which contain employee information where duplicate values exist; the fifth column holds the ways of contact and the last column is the specific contact information. Each employee has at least one way for contact. Now we want to merge the records to make a new table where the fifth, sixth and seventh columns are work phone, cell phone and work email for reporting. Below is source file:

21458952,John,technology,support staff,work phone,555-555-5555

21458952,John,technology,support staff,work email,johndoe@whatever.net

21458952,John,technology,support staff,cell phone,233-1132-9762

99946133,Jane,technology,administration,work phone,444-444-4444

99946133,Jane,technology,administration,work email,janepaul@whatever.net

99946133,Jane,technology,administration,cell phone,121-3433-9788

99946133,Jane,technology,administration,home phone,431-443-8434

esProc code:

  A
1 =file(“D:\\empContact.csv”).import(;”,”)
2 =A1.group(#1,#2,#3,#4;~.align([“work phone”,”cell phone”,”work email”],#5):g)
3 =A2.new(#1,#2,#3,#4,g(1).#6,g(2).#6,g(3).#6)

Explanation: Group the imported records by the first four columns and align records in each group according to the order of work phone\cell phone\work email (the duplicate values will be automatically filtered away); then build a new two-dimensional table by sequentially retrieving the records. Here’s the result:

 esProc_text_file_group_12

Grouping by sequence numbers

In the following file, every three lines correspond to one record. For example, the first record is 26\TAS\2009-08-05. We want to present it in the report as a multi-field two-dimensional table. Here’s the source data:

26

TAS

2009-08-05

33

DSGC

2009-08-14

84

GC

2009-10-16

esProc code:

  A
1 =file(“D:\\data.txt”).import@si()
2 =A1.group((#-1)\3)
3 =A2.new(~(1):OrderId,~(2):Client,~(3):OrderDate)

Explanation: Import the file as a sequence, during which @s means that fields won’t be extracted; group the sequence every three lines, during which “#” represents the sequence number of each line and “\” represents integer division; finally create a table sequence based on the groups, during which ~(1) is the first member of the current group.

The result:

esProc_text_file_group_13

Extended information: If the file has a complicated format. For example, there are three fields in the second line of each group:

26

TAS   1       2142.4

2009-08-05

33

DSGC        1       613.2

2009-08-14

84

GC    1       88.5

2009-10-16

In this case, we just need to modify A3 into this : =A2.new(~(1):OrderID, (line=~(2).array(“\t”))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate )

The result:

esProc_text_file_group_14

Intra-group order-related calculation

postTurn.txt records the schedule of rotating shifts for a post. The file is grouped by dates and some persons work consecutive shifts. We want to find out how many times the two neighboring names work successive shifts. The fields would be first, next and times in the report. Below is the source file:

#2015-06-01

Ashley

Matthew

#2015-06-02

Ashley

Matthew

Matthew

#2015-06-03

Ashley

Ashley

Matthew

Ryan

#2015-06-04

Ashley

Ryan

Ryan

Matthew

#2015-06-05

Jessica

#2015-06-06

Ashley

Jessica

Matthew

Ashley

Jessica

Jessica

Ryan

esProc code:

  A
1 =file(“D:\\postTurn.txt.txt”).import@i()
2 =A1.group@i(pos(~,”#”))
3 =A2.conj(~.([~[-1],~]).to(3,))
4 =A3.group(~).new(~(1)(1):first,~(1)(2):next,count(~):times)

Explanation: Group the imported data according to the condition that whether a row contains “#” or not, and each group is a day’s work shift schedule; for each group, put every two neighboring names into a subset and concatenate all the subsets; group each concatenated set every two members, get the first name and the next name from each group and count how many times they appear together. Here’s the result:

esProc_text_file_group_15

Inserting data between groups

In the sales.txt file, the value of SellerId changes every multiple rows. We want to export the OrderId field in its original order and add the string “Begin” every time a group of records with new value begins and the string “End” after a group ends. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

esProc code:

  A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.group@o(SellerId).conj(“Begin”|~.(OrderID)|”End”)

Explanation: Write the code directly according to what it asks. @o enables grouping by comparing the adjacent records. To retrieve a certain field from a group of records, use ~.(OrderID). The operator “|” is used to concatenate sets or members of a set. The conj function handles subsets respectively and then concatenates the results.

Here’s the result:

esProc_text_file_group_16

Extended information:

  1. Stop using @o when data is unordered, but the performance will decrease a little.
  2. To add an aggregate value, the number of records in each group, for example, write A2 as group@o(SellerId).conj((“Begin: “+string(~.count()))|~.(OrderID)|”End”)

Here’s the result:

esProc_text_file_group_17

  1. To add strings to both ends of a group of records, use the following code:
  A B
1 =file(“D:\\sales.csv”).import@t() =create(OrderID,Client,SellerId,Amount,OrderDate)
2 for A1.group@o(SellerId) =B2.insert(0,””,”Begin”)
3   =B2.insert@r(0:A13)
4   =B2.insert(0,””,”end”)
5 return B2  

The result:

esProc_text_file_group_18

Sampling by groups

The sales.txt file stores ordering data, and we want to retrieve one record from each seller’s records to present it in the report. Here’s the rule: Find out records whose OrderDates are fewer than 60 days after their previous ones, and retrieve one randomly from these sampled records; if there is an insufficient number of records satisfying the condition, just get the record that has the smallest OrderDate. Besides, it is specified that the second record will always appear in each group. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

144  WZ   6       86.4 2010-12-23

70     DSG 7       288.0        2009-09-30

131  FOL  7       103.2        2009-12-10

65     YZ     8       29600.0   2009-01-06

esProc code:

  A B C
1 =file(“D:\\sales.csv”).import@t()  
2 for A1.group(SellerId) =A2.sort(OrderDate).select(interval(OrderDate[-1],OrderDate)<60)
3   if B2!=[] =B1=B1|B2(rand(B2.len())+1)
4   else =B1=B1|B2(1)

Explanation:

Group the imported records by SellerId and loop through the groups. For each loop, first sort the group by OrderDate and then take the sample as required. If there’re enough records sampled, retrieve one randomly and append it to B1; if the number of records meeting the condition is insufficient, then get the first record of this group. OrderDate[-1] represents the previous record relative to the current one. Below are the results of three sampling:

esProc_text_file_group_19

Handling the big file by groups

The sales.txt file is too big to be held by the memory, thus we need to filter out the records of the clients whose total order amount is greater than 10,000, and present the filtered file in a report. The source data is already ordered by SellerId, as shown below:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

esProc code:

  A B C
1 =file(“D:\\sales.csv”).cursor@t()  
2 for A1;SellerId if A2.sum(Amount)>=10000 =C1=C1|B2
3 return C1    

Explanation: Loop through the cursor by SellerId and import a group of data into the memory at a time; sum up the amounts for the current group and write the records to cell C1 if the sum satisfies the condition. Here’s the result:

esProc_text_file_group_20

If the source data isn’t ordered, we can first sort it using the sortx function, which returns a cursor, too.

Multilevel grouping

The Stock.txt file keeps the information of goods in and out of the warehouse. A same kind of goods may be found in and out several times a day, or there may be nothing in and out during continuous days. The initial inventory of the goods is 0. We use In to represent goods shipped in, and Out for those shipped out. Now we want to present the daily inventory for all kinds of goods in a report. Below is the source file:

date name        quantity   flag

2014-04-01       Item1       15     In

2014-04-01       Item1       4       In

2014-04-02       Item1       3       In

2014-04-02       Item1       10     Out

2014-04-03       Item1       3       In

2014-04-04       Item1       5       Out

2014-04-07       Item1       4       In

2014-04-10       Item1       2       Out

2014-04-01       Item2       20     In

2014-04-02       Item3       30     In

2014-04-03       Item3       14     Out

esProc code:

  A B
1 =file(“D:\\stock.txt”).cursor@t()  
2 =A1.group(name,date;~.select(flag==”In”).sum(quantity):in,~.select(flag==”Out”).sum(quantity):out)
3 =A2.group(name) =periods((t=A2.id(date)).min(),t.max(),1)
4 for A3 =A4.align(B3,date)
5   >c=0
6   =B4.new(A4.name:name,
B3(#):date,
c:Opening,
in,
(b=c+in):Total,
out,
(c=b-out):Close)
7   =@|B5

Explanation: A2 calculates the daily in-and-out quantity for each kind of goods; obtain the complete list of dates according to the earliest and latest dates and store it in B3; group records by items, loop through each group to align the current group of records with B3 and calculate the daily inventory in order. Here’s the result:

esProc_text_file_group_21

 

Advertisements
Posted in Uncategorized | Leave a comment

Collection of Solutions to Knotty Report Building Problems

There are many knotty problems around report development. These include complex layout, which is difficult to build with a reporting tool alone; complex data sources requiring dynamic access and involving different types of databases or non-database data; as well as the handling of order-related computation, subsets of grouped data, fixed grouping, string splitting and concatenation, date intervals and other complicated computations, all of which you may find hard to deal with in SQL.

Using esProc as the computing middleware will handle those issues with simple scripts. esProc encapsulates rich functions for processing structured data, as well as supports dynamic parsing of expressions, heterogeneous data source handling, order-related computations, set operations, irregular grouping and post-grouping computation. A reporting tool will pass parameters to the esProc script, execute it in the same way as it executes a database stored procedure, and get the returned result set via JDBC interface. You can learn more from How to Use esProc to Assist Reporting Tools.

In the following, there are examples of the typical problems concerning report development, and their esProc solutions.

Data transposition

Sometimes you might want to display data in a cross-tab or grouped report, but you could not use the source data directly. The data needs to be transposed first. With either SQL or the stored procedure, you must write code from the lowest level using a high-level language, which involves substantial amount of work and is really hard. esProc can remedy this failing through its support for order-related computation and dynamic scripting. Here is a simple example of transposing rows and columns.

Below is a selection of the database table SALES that stores order data:

OrderID Client SellerId Amount OrderDate
1 DSG 13 19480 2014-06-20 00:00
2 ERN 18 979 2014-06-13 00:00
3 JFE 19 28972 2014-12-11 00:00
4 OFS 21 4829 2014-02-24 00:00
5 ERN 22 21392 2014-02-01 00:00

You need to calculate the total order amount, the maximum/minimum order amount and the total number of orders each month in a specified year, and transpose the data into 13 columns and 4 rows. That is, the four algorithms constitute the first column named subtotal and each month is a column named 1, 2, 3, 4…

esProc script:

esProc_report_problems_1

Explanation: First perform group and aggregate operations in SQL, and then run a loop to transpose every row to column, and fill in values in sequence to each column of A2. Here is the result:

esProc_report_problems_2

Though the transposition in this example is simple, there are other unusual layouts such as post-grouping multi-layer transposition and dynamic location transposition. You can refer to esProc Arranges Data for Reports with Unconventional Layouts for more.

Multi-column layout

Most reporting tools support vertical layout but they really have difficulties in building unusual layouts such as arranging data in multiple columns horizontally or in an interlocking way. Normally you have to turn to high-level languages to work them out. With order-related computation and dynamic scripting features, esProc can help manage this well. Let’s take a look at horizontal multi-column layout:

The database table emp has three fields. Below is a selection of the table:

EId Name Dept
4 Emily HR
5 Ashley R&D
6 Matthew Sales
7 Alexis Sales
8 Megan Marketing
9 Victoria HR
10 Ryan R&D
11 Jacob Sales

The desired layout is that data is presented in a horizontally sequenced order in 3 columns, as shown below:

esProc_report_problems_3

You can first transform the original 3-field table to a 9-field table in esProc, and then build the above layout with the reporting tool. The code is as follows:

esProc_report_problems_4

Explanation: Divide rows into three parts by their sequence numbers and store them respectively in A2, B2 and C2. And then concatenate the fields in B2 and C2 with those in A2. Here is the result you’ll get:

esProc_report_problems_5

Apart from the horizontal multi-column layout, there are other uncommon report layouts, like interlocking multi-column layout, dynamic vertical multi-column layout based on a single field, duplicating rows in sequence and condition-controlled formats for grouped reports. More information can be found in esProc Arranges Data for Reports with Unconventional Layouts.

Unconventional data sources

In addition to conventional databases, data sources of a reporting tool could be JSON files, MongoDB, txt files, Excel and HDFS files. Some of them do not have any computability and some have not enough. Both cases require writing a lot of code for report development. esProc supports various data sources and is capable of helping a reporting tool out. For example, you might want to display data from a multi-level JSON file in a grouped report:

Cells.json is a multi-level, nested JSON file which you want to present as a grouped report. The grouping fields are name, type and image. “xlink:href”. There is also a field with 3 subdocuments in each document: custom. Identifier, custom. Classifier and custom. Output, which are of the same structure but contain different number of documents.

Source data:

{

“cells”: [

{

“name”: “b”,

“type”: “basic.Sensor”,

“custom”: {

“identifier”: [

{

“name”: “Name1”,

“URI”: “Value1”

},

{

“name”: “Name4”,

“URI”: “Value4”

}

],

“classifier”: [

{

“name”: “Name2”,

“URI”: “Value2”

}

],

“output”: [

{

“name”: “Name3”,

“URI”: “Value3”

}

]

},

“image”: {

“width”: 50,

“height”: 50,

“xlink:href”: “

AAADDPmHLAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAABEJAAARCQBQGfEVAAAABl0RVh0

U29mdHdhcmUAd3Vi8f+k/EREURQtsda2Or/+nFLqP6T5Ecdi0aJFL85msz2Qxyf4JIumMAx/

ClmWt23GmL1kO54CXANAVH+WiN4Sx7EoNVkU3Z41BDHMeXAxjvOxNr7RJjzHX7S/

jAflwBxkJr/RwiOpWZ883Nzd+Wpld7tkBr/SJr7ZHZbHZeuVweSnPfniocMAWYwcGBaf

H0OoPamFGAaY4ZBZjmmFGAaY4ZBZjmmFGAaY4ZBZjmmFGAaY7/B94QnX08zxKL

AAAAAElFTkSuQmCC”

}

},

……

]

}

 

esProc script:

esProc_report_problems_6

Explanation: Merge the three subdocuments into one two-dimensional table, represent them with a new field name cType and append grouping fields to the table. Thus a typical “table with subtables” has been created. The result is as follows:

esProc_report_problems_7

According to this esProc result, it’s easy to create a grouped report with a reporting tool.

There are more similar cases, including querying a multi-level subdocument in a JSON file, subreports using different JSON files, MongoDB collection join, and etc. You can find them in esProc Integrates Heterogeneous Data Sources for Report Development.

Dynamic data sources

Sometimes you might want to reference data sources dynamically through a parameter, merge data sources into one, or dynamically pass data source name to a subreport or Table control. Often reporting tools – especially those supporting single data source, such as BIRT and Jasper – have to use a high-level language to accomplish these computations, or trade security for reduced complexity.

esProc can help to perfect these computations, thanks to its support for dynamic parsing of expressions and for heterogeneous data sources. The following explains esProc way of switching data sources dynamically through parameter:

myDB1 and oraDB are data sources that point to different databases. Each holds a sOrder table with the same structure. The report requires connecting to data sources dynamically via a parameter, querying sOrder for orders whose amounts are greater than 1,000, and displaying them.

Below is a selection of sOrder in myDB1:

OrderID Client SellerId Amount OrderDate
1 WVF 5 440 2009-02-03
2 UFS 13 1863 2009-07-05
4 JFS 27 671 2009-07-08
5 DSG 15 3730 2009-07-09
6 JFE 10 1445 2009-07-10
7 OLF 16 625 2009-07-11
8 PAER 29 2491 2010-07-12
9 DY 20 518 2010-07-15
10 JDR 17 1120 2010-07-16

Below is a selection of sOrder in oraDB:

OrderID Client SellerId Amount OrderDate
101 WAN 22 396 2010-11-07
102 SPL 15 142 2010-11-08
103 LA 23 713 2010-11-11
104 ERN 5 5678 2010-11-11
105 FUR 28 154 2009-11-12
106 BSF 27 10742 2009-11-13
107 RHD 4 569 2009-11-14
108 BDR 12 480 2010-11-15
109 OFS 17 1106 2009-11-18

esProc script:

esProc_report_problems_8

Explanation: Both pSource and pAmount are report parameters. pSource represents the data source name; ${…} indicates parsing a string or a string variable into an expression. pAmount stands for the order amount.

When pSource=“myDB1”, A1 has the following result:

esProc_report_problems_9

When pSource=“oraDB”, A1 gets this result:

esProc_report_problems_10

More similar cases, including multi-data-source pre-join, result sets union, different data sources for main report and subreport, dynamic join between main report and subreports and displaying data by different time units specified by parameter, can be found in esProc Implements Dynamic Data Sources for Reporting Tools.

Heterogeneous data sources

Reporting tools can normally handle a type of single data source. To present the result of hybrid computation involving different types of data sources, the only choice is using a high-level language or ETL tools. The process is extremely tedious. esProc is more suitable for doing this because it supports the hybrid computation. Let’s look at the esProc approach through the join of MongoDB and MySQL:

emp1 is a MongoDB collection, whose CityID field is the logical foreign key pointing to CItyID field of cities, a MySQL table that has two fields – CityID and CityName. You need to query employee records from emp1 according to specified time period and switch its CityID field to CityName of cities.

esProc script:

esProc_report_problems_11

Explanation: Query the two data sources using MongoDB syntax and SQL syntax respectively, and then use esProc switch function to establish foreign key association between them, and finally retrieve the desired fields. Here is the result:

esProc_report_problems_12

Besides, you can perform hybrid computation between JSON, Excel and txt, or between a database and a non-database. More details can be found from esProc Integrates Heterogeneous Data Sources for Report Development.

Parallel data retrieval

Sometimes a considerable amount of data needs to be retrieved from the database for presentation or export. But data retrieval with JDBC is always slow – maybe even slower than other computations involved in the report development. This causes the extreme inefficiency of the development. In contrast, esProc can realize data retrieval with multiple threads in parallel using simple code, which solves the JDBC problem. The following is such an example:

The user status table stores detail data – about 3,600,000 records. You need to present it in report and print it out. The desired layout is like this:

esProc_report_problems_81

esProc script:

esProc_report_problems_13

Explanation: Divide data into 4 parts according to the value range of userid and set 4 threads; each is assigned one part through the parameters and returns the result to the main thread.

Note that only when sufficient database resource is available will the parallel processing increase performance considerably. If the database is already fully loaded, the parallel data retrieval will aggravate its burden instead.

Order-related computations

Order related computations include inter-row comparison, getting relative interval and computing ranks. As a SQL result set doesn’t have explicit sequence numbers, you must generate them through a computed column. This is quite a hassle. For databases that lack sufficient support for window functions, such as MySQL, the code will be complicated.

One of the intrinsic qualities of esProc is its support for order-related computations, making the implementation easier. Let’s look at two cases – computing link relative ratio and making a year-on-year comparison.

The database table sOrder stores orders per day from each seller. The report needs to present results of comparing the sales amount per month during a specified time period with that of the previous month and with that of the same month in the previous year. Below is the source data:

OrderID Client SellerId Amount OrderDate
26 TAS 1 2142 2009-08-05 00:00:00
33 DSGC 1 613 2009-08-14 00:00:00
84 GC 1 89 2009-10-16 00:00:00
133 HU 1 1420 2010-12-12 00:00:00
32 JFS 3 468 2009-08-13 00:00:00
39 NR 3 3016 2010-08-21 00:00:00
43 KT 3 2169 2009-08-27 00:00:00
71 JFE 3 240 2010-10-01 00:00:00
99 RA 3 1731 2009-11-05 00:00:00

Script for computing link relative ratio:

esProc_report_problems_14

Explanation: First group and aggregate data to get sales amount per month per year, and then compute link relative ratio using the formula “sales amount of the current month/sales amount of the previous month”. The result is as follows:

esProc_report_problems_15

Script for comparing sales amount between same months on an annual basis:

esProc_report_problems_16

Explanation: Group data and sort it by months, and then perform the computation with formula “sales amount of the current month/sales amount of the same month in the previous year. The result is as follows:

esProc_report_problems_17

A similar example is computing inventory over the row.

The database table inout stores warehouse-in and -out records per product per day during consecutive days and the initial inventory is zero. You need to display the data in a report and add an extra column to represent stock per day.

product date in out
cola 6/1/2015 100 25
cola 6/2/2015 50 50
cola 6/3/2015 40 0
cola 6/4/2015 20 50
juice 6/1/20 100 50
juice 6/2/2015 10 20
juice 6/3/2015 100 20
juice 6/4/2015 40 35
wine 6/1/2015 50 10
wine 6/2/2015 0 30
wine 6/3/2015 20 10
wine 6/4/2015 30 5

esProc script:

esProc_report_problems_18

Explanation: Compute current day’s stock for each kind of product (through data grouping) with the formula “current day’s warehouse-in – current day’s warehouse-out + warehouse stock of the previous day”. esProc uses [-1] to represent the previous record and produces intuitive expression. Here is the result:

esProc_report_problems_19

If the dates are discontinuous, you need to align data with a sequence of consecutive dates before computing inventory. The script is as follows:

esProc_report_problems_20

Another typical case involving order-related computation is the installment loan.

The loan table stores loan information, including total loan amount, payment terms by month and annual interest rate. You need to build a grouped report in which details of each term of payment for every loan record – payment, interest, principal, principal balance – will be presented. Below is the source data:

LoanID LoanAmt Term Rate
L01 100000 5 4.75
L02 20000 2 5
L03 500000 12 4.5

esProc script:

esProc_report_problems_21

Explanation: First calculate the monthly interest rate and each term’s payment for every loan using a formula; then create a two-dimensional detail table for each loan according to the available information and calculate the current term’s interest, principal and principal balance; finally concatenate these detail tables. Below is the detail table for loan 1:

esProc_report_problems_22

Post-grouping subset handling

Some cases, require non-aggregate operations, such as record query, member modification or set operations, after data is grouped, but it’s hard to do this in SQL as a SQL grouping has a built-in aggregate operation. Being able to handle post-grouping operations and set operations, esProc is better at dealing with those cases.

Compare goal difference

The database table Mytable stores goal differences of each player in each team for every competition. The report needs to present the number of competitions in which a specified player is superior in goal difference compared with every other player in the team. Below is the source data:

Player Team Game Goals
Messi A G1 2
Neymar A G1 0
Messi A G2 2
Neymar A G2 3
Messi A G3 4
Neymar A G3 1
Neymar A G4 5
Messi A G5 2
Neymar B G5 2
Valdes A G1 1

esProc script:

esProc_report_problems_23

Explanation: Use a SQL statement to query the goal difference of the specified player and that of each of his teammates in each competition, and then group data by his teammates and query from each group (set) the competitions (subset) in which the specified player scores more goals than his teammate. You can get the result by counting members of a subset. Here is the result:

esProc_report_problems_24

Tag members of a group

In the database table test, classLev1 and classLev2 are grouping fields of two levels and seqInlev2 represents intragroup sequence numbers. You need to create a list to present data and add a tag field to tag rows in a group. Rules for assigning values to tag field are: Assign “first” to the first row and “last” to the last row, and assign “middle” to those in between; if there is only one row in a group, assign null to it.

Below is the source data:

classLev1 classLev2 name unit seqInlev2
1 1 test1 box 1
1 1 test2 box 2
1 2 test1 box 1
1 2 test2 box 2
1 2 test3 box 3
1 3 test1 box 1
2 1 test1 box 1
2 1 test2 box 2
2 1 test3 box 3
2 1 test4 box 4
2 2 test1 box 1

esProc script:

esProc_report_problems_25

Explanation: Retrieve data with a SQL statement and create an empty field tag; group data and select groups in which there is more than one member; tag all rows with “middle” and then change the first row to “first” and the last row to “last” in each group. Here is the result:

esProc_report_problems_26

Querying branch stores where DVDs are in short supply

The Branch table stores information of branch stores selling DVDs; the DVD table stores titles of DVDs and categories which they belong to; DVDCopy table stores information about DVD copies (the real products), and associates with the Branch table through BranchID field and with DVD table through DVDID field. That the Status field in DVDCopy table is “Miss” means this copy is missing; if LastDateReturned field is empty, the copy has been borrowed but hasn’t returned yet. The report needs to present information of branch stores where the available DVD copies have fewer than 4 categories. Below is the source data:

Branch table:

BID Street City
B001 street1 New York
B002 street2 Houston
B003 street3 LA
B004 street4 Lincoln

DVD table:

DVDID Category Title
D001 science fiction Transformers IV
D002 science fiction Transformers II
D003 science fiction Guardians of the Galaxy
D004 act The Expendables III
D005 sport Need for Speed
D006 feature Grace of Monaco

DVDCopy table:

CopyID DVDID BID Status LastDateRented LastDateReturned MemberID
C000 D001 B001 7/10/2014 7/13/2014 M001
C001 D004 B001 7/10/2014 7/13/2014 M001
C002 D001 B001 7/10/2014 M001
C003 D005 B001 7/10/2014 7/13/2014 M003
C004 D006 B001 7/10/2014 7/13/2014 M003
C009 D004 B002 7/10/2014 7/13/2014 M009
C010 D005 B002 7/10/2014 7/13/2014 M010
C011 D006 B002 Miss 7/10/2014 7/13/2014 M010

esProc script:

esProc_report_problems_27

Explanation: Select available DVD copies possessed by branch stores from DVDCopy table and group them by BID to get copies of each branch store; find DVDs corresponding to the DVD copies each branch store has and calculate how many categories where they fall in; find branch stores that have fewer than 4 categories of available DVD copies and that don’t exist in the DVDCopy table (both types are eligible); then combine data of the two types stores. The result is as follows:

esProc_report_problems_28

Fixed grouping

Sometimes the grouping criteria come from other database tables or files, or are defined by external parameters, and their number is large and there is big difference between each other; other times the grouping criteria are dynamic and overlap each other. The problem is that SQL grouping criterion must be some field in the current database table, so it’s difficult to handle this kind of fixed grouping in SQL. esProc can easily implement this algorithm because it supports alignment grouping, enumeration grouping and the flexible equal grouping.

Grouping data by a parameter list

The sales table stores ordering records, in which CLIENT filed holds client names and AMOUNT field holds order amount. The report requires that CLIENT be grouped by the parameter list argClient and that AMOUNT be aggregated to calculate sums. Below is the source data:

OrderID Client SellerId Amount OrderDate
26 TAS 1 2142.4 2009/8/5
33 DSGC 1 613.2 2009/8/14
84 GC 1 88.5 2009/10/16
133 HU 1 1419.8 2010/12/12
32 JFS 3 468 2009/8/13
39 NR 3 3016 2010/8/21
43 KT 3 2169 2009/8/27

If the parameter list has relatively small number of members, you can compose CLIENT into a dummy table with SQL union statement. If it has a large number of members, you can store data permanently in a database table before composing. This problem, however, has a dynamic, external parameter list with unknown number of members and containing members that even don’t exist in CLIENT field. To solve it in SQL, you must create a temporary table. It’s unnecessary to create such a temporary table in esProc. The following is esProc code:

esProc_report_problems_29

Explanation: Group sales with align function by the parameter list, calculate sum for each group and join the result with the parameter list. Here is the result:

esProc_report_problems_30

Grouping data by overlapped criteria

Suppose you want to group and aggregate the sales table according to criteria that are probably overlapped, such as the order amount is between 1,000 and 4,000, less than 2,000 and greater than 3,000. You cannot implement the algorithm in SQL without using a great many of union/with/minus statements, and this really is difficult when the criteria are specified through a parameter. In contrast, esProc does this with just the following code:

esProc_report_problems_31

Explanation: enum function can perform grouping operation with grouping criteria specified by an external parameter. @r allows duplicate members. Here is the result:

esProc_report_problems_32

Cross-tab reports with rows and columns presented in intervals

You might want to count employees with a cross-tab report. Columns are grouped by age ranges and rows are grouped by salary ranges. For instance, 5 columns –  1-30,31-32,33-35,36-40,41-50, and 3 rows – 1-5000,5001-10000,10001-20000 – would be displayed if you type in [30,32,35,40,50] and [5000,10000,20000] respectively. The following is source data:

EId Name SurName Gender State Birthday Salary HireDate Dept
2 Ashley Wilson F New york 1980-07-19 00:00:00 11000 2008-03-16 00:00:00 2
3 Rachel Johnson F New mexico 1970-12-17 00:00:00 9000 2010-12-01 00:00:00 1
4 Emily Smith F Texas 1985-03-07 00:00:00 7000 2006-08-15 00:00:00 3
5 Ashley Smith F Texas 1975-05-13 00:00:00 16000 2004-07-30 00:00:00 1
6 Matthew Johnson M California 1984-07-07 00:00:00 11000 2005-07-07 00:00:00 4
7 Alexis Smith F Illinois 1972-08-16 00:00:00 9000 2002-08-16 00:00:00 2
8 Megan Wilson F California 1979-04-19 00:00:00 11000 1984-04-19 00:00:00 5
9 Victoria Davis F Texas 1983-12-07 00:00:00 3000 2009-12-07 00:00:00 6
10 Ryan Johnson M Pennsylvania 1976-03-12 00:00:00 13000 2006-03-12 00:00:00 1
11 Jacob Moore M Texas 1974-12-16 00:00:00 12000 2004-12-16 00:00:00 3
12 Jessica Davis F New york 1980-09-11 00:00:00 7000 2008-09-11 00:00:00 4
13 Daniel Davis M Florida 1982-05-14 00:00:00 10000 2010-05-14 00:00:00 5

esProc script:

esProc_report_problems_33

Explanation: Group A1 by parameter intervals argAge and argSalary, calculate the number of employees in each group, and finally change the column and row names to a format permitted by the report. You can also calculate ages with SQL, but the esProc age function can prevent SQL code from being altered during database migration. pseg function will return a member’s sequence number in the interval. This is the result you’ll get:

esProc_report_problems_34

Post-grouping order-related computations

Performing order-related computations based on grouped data makes code writing in SQL even harder. But usually esProc handles it easily.

Calculating the remaining class periods

Every student has the same 10-day long class periods. The database table Course stores each student’s attendance information during these days. The report needs to present the classes that are already attended and the remaining class periods for each student each month. Below is the source data:

ID Name Date
1 Sandy 2015-05-06
2 Candy 2015-05-06
3 Sandy 2015-05-28
4 Candy 2015-05-29
5 Candy 2015-06-01

esProc script:

esProc_report_problems_35

Explanation: Perform group and aggregate operations to calculate the number of attended classes, and add a computed column to get the remaining number of class periods. Here is the result:

esProc_report_problems_36

Getting top N members

The database table sOrder stores orders per day from each seller. The report needs to present each seller’s N biggest orders. N is a parameter. Below is the source data:

OrderID Client SellerId Amount OrderDate
26 TAS 1 2142 2009-08-05 00:00:00
33 DSGC 1 613 2009-08-14 00:00:00
84 GC 1 89 2009-10-16 00:00:00
133 HU 1 1420 2010-12-12 00:00:00
32 JFS 3 468 2009-08-13 00:00:00
39 NR 3 3016 2010-08-21 00:00:00
43 KT 3 2169 2009-08-27 00:00:00
71 JFE 3 240 2010-10-01 00:00:00
99 RA 3 1731 2009-11-05 00:00:00

esProc script:

esProc_report_problems_37

Explanation: Group data by SellerID; get N records that have the biggest Amount values from each group with top function and concatenate them with union function. -Amount means getting records from large to small values. Here is the result:

esProc_report_problems_38

Use A2.(~.maxp@a(Amount)) to retrieve the record, or the records holding the greatest value from each group. To retrieve the first record holding the greatest value from each group, sort the records by Amount in SQL and use A1.group@1(SellerId). Use A2.group(SellerId;~.to(argM,argN)).conj(#2) to get records from the Mth to the Nth from each group. To compare the Amount field values between the first record and the second record in a group, use A1.group(SellerId;~(1).Amount-~.m(2).Amount). The m function can get members by their sequence numbers and judge where a group ends (when a group has only one record).

Modifying data to make even value for each group

Suppose you want to create a grouped report that gets data from the database table stock, which makes Title as the grouping field and has detail fields ID and Number. But you need to modify the smallest Number value in each group to ensure that the sum of Number values is 800. Below is the source data:

ID Title Number
1 A 200
2 A 300
3 B 400
4 C 500

esProc script:

esProc_report_problems_39

Explanation: Sort data by Title and Number and then group data by Title; loop through each group to modify the first Number value so that the sum of Number is 800. Here is the result:

esProc_report_problems_40

String splitting and concatenation

Sometimes you need to handle structured data processing based on strings generated from splitting a larger string. Other times you need to split one row into multiple rows or combine multiple rows into one. It’s difficult to implement these calculations in SQL. esProc, however, is better at handling them thanks to its support for explicit sets.

Splitting one row into multiple rows

The database table data has two fields. You want to split one of the fields – ANOMOALIES – into multiple strings and combine each string with the corresponding ID value in the original table to generate new records. Below is the source data:

ID ANOMALIES
3903 B1 D1 CAT1
3904 D7 D2 B1 CAD4

esProc script:

esProc_report_problems_41

Explanation: Split ANOMOALIES field of each record into a sequence according to spaces, create two-dimensional tables according to the sequences respectively and concatenate these two-dimensional tables. This is the result you’ll get:

esProc_report_problems_42

Concatenating members by grouping fields

In the database table table1, the first three fields are grouping fields and the fourth one is the detail field. You need to build a report where values of the fourth field are combined according to the grouping fields. Below is the source data:

Col1 Col2 Col3 Col4
10 1234 Sam A12G3
10 1234 Sam K78DE
10 1234 Sam MAT12
20 1456 Tom E12F4
20 1456 Tom KAR3R
20 3217 Tom G45G4

esProc script: myDB1.query(“select * from table1”).group(Col1,Col2,Col3;~.(Col4).string():Col4

Explanation: The aim of this case is to implement group_concat. But not every database supports this function. You can use esProc to achieve the same target, regardless of which database you are using. The following is the result:

esProc_report_problems_43

Date handling

A sequence containing datetime data is more complicated than the one consisting of natural numbers. Thus you may found it’s very difficult to handle dates in SQL, such as getting work days, intersection of some dates and combining time intervals into a larger one. Equipped with built-in date functions and ordered sets, esProc is convenient to use in date handling.

Calculating work days

The database table project stores project information including beginning and ending dates. The report needs to present the information and an extra column for displaying the number of work days. Below is the source data:

pid pname manager beginDate endDate
p041-27 project1 steven 2014-01-08 2014-07-29
p042-v-1 project2 steven 2014-04-01 2014-06-05
p03-09 project3 mike 2014-08-12 2015-01-31
p02-04 project4 elisa 2014-09-22 2015-09-22

esProc script:

esProc_report_problems_44

Explanation: With workdays function, it’s easy to calculate the number of work days during a certain time period. Here is the result:

esProc_report_problems_45

Calculating peak travel dates

The database table trip stores business trip information. You need to get the 10 busiest dates and sort them by the number of trips. Below is the source data:

       star_date end_date
2015-01-05 2015-03-02
2015-02-01 2015-02-10
2015-02-05 2015-02-09
2015-02-20 2015-04-05
2015-03-06 2015-03-15
2015-03-08 2015-03-11
2015-03-10 2015-03-22
2015-03-12 2015-03-25
2015-03-15 2015-03-23
2015-03-15 2015-03-19
2015-03-18 2015-03-29

esProc script:

esProc_report_problems_46

Explanation: Split each trip into a sequence of dates according to the beginning and ending dates, concatenate the sequences and perform group and aggregate by dates; and then sort the records to get the first 10 ones. Here is the result:

esProc_report_problems_47

Merging overlapped time intervals

In the database table project, beginDate and endDate represent the beginning date and ending date of each project. The report needs to present the time intervals with or without projects (each is designated by “busy” and “free”) in sequence. Below is the source data:

pname beginDate endDate
project1 2015-01-06 2015-01-31
project2 2015-01-15 2015-02-11
project3 2015-03-10 2015-03-22
project4 2015-03-12 2015-03-25
project5 2015-03-15 2015-03-23
project6 2015-03-25 2015-04-10

esProc script:

esProc_report_problems_48

Explanation: A2-A4 merges smaller time intervals into bigger ones. This is done by first removing dates within larger intervals. That is, if the ending date of a time interval is after the beginning date of the next interval, then merge them as a bigger interval from the beginning date of the first interval. After that, group data by dates and use the ending date of the last record in each group as the bigger interval’s ending date. A5 calculates free time intervals. Finally, concatenate the busy and free time intervals and perform sort. Here is the result:

esProc_report_problems_49

 

Posted in Report | Leave a comment

The Three Stages in Building Reports with Heterogeneous Data Sources

There are many different types of report data sources, including relational databases, NoSQL databases, local files, HDFS files and JSON data stream. It’s easy to build a report with a single data source, but it’s difficult to build one that needs data from more than one type of data source, i.e. heterogeneous data sources. The data preprocessing involving different data sources is always one of the bottlenecks in report development. This article will introduce three stages, from raw to perfect, in dealing with the heterogeneous data sources, and attempt in each stage to offer one or a few solutions with an example detailed.

The first stage: Foreign key lookups

Problem definition: Data set A has two fields – code and value, and data set B has multiple attribute fields, including a code field which is the logical foreign key pointing to data set A. The requirement is to present data set B in a report, where the code field is replaced with data set A’s value field.

Solution 1: SQL query

With each record of data set B displayed, code value will be passed as the parameter value to data set A, which will then receive a SQL query to show its value field in place of B’s code value in the cell.

Advantages: Most reporting tools support this solution. It just needs the simple configuration without writing an expression or a script.

Disadvantages: The solution requires multiple queries and thus has an unsatisfactory performance. It serves to display only the values referenced by the foreign key, but is far from being able to handle data preparation based on multiple types of data sources. Besides, the data source itself must support queries against it, like the database. Data sources that doesn’t support queries, such as files or the JSON data stream, just defy the solution.

Solution 2: Making use of the built-in functions of reporting tools, such as the script query.

With each record of data set B displayed, traverse data set A using a script to find out the corresponding value field to display in the cell. This is a more common method.

Advantages: Satisfactory performance and support for any type of data source.

Disadvantages: Sometimes a script is needed, but since different types of reports require different approaches, there isn’t a standard way to implement the solution. Some reports allow for direct traversal on data set A, but others require that data set A be stored in a global variable (like BIRT) in advance. Also, the solution has a very narrow range of applications, and is unable to handle all scenarios involving multiple types of data sources.

Example: Emp is an Oracle table and Sales is a MySQL table. SellerId field is the logical foreign key of the Sales table that points to the Emp table’s EId field. The requirement is to display the Sales table in a report where SellerId field is replaced with Name field of the Emp table. Below is the source data:

esProc_report_three_stages_1

We’ll use SQL queries to build the report in BIRT with the following steps:

  1. Create data sources and datasets.

Create two data sources – oracleDB and mySQLDB.

Create data set dsEmp based on oracleDB using the SQL statement “select emptb.Name from emptb where emptb.EId=?”, in which “?” corresponds to the default parameter param_1.

Create data set dsSales based on mySQLDB using the SQL statement “select salestb.OrderID,salestb.Client,salestb.SellerId,salestb.Amount,salestb.OrderDate from salestb”.

  1. Build a table-type report according to the dsSales data set:

esProc_report_three_stages_2

3. Switch to a new column binding:

Right click on cell [SellerId], select Change data Column, and click dsEmp in the Data Set drop down list and check “Name” field.

After the above operation, the Display Name for the cell will automatically turn into “Name”.

4. Set foreign key parameter

Click cell [Name], open the “Binding” page in the Property Editor, click “Data Set Parameter Binding” and edit the “param_1” in Expression Editor by changing its “Value”, and then select Available Column Bindings, table and SellerId.

After that, the value of param_1 will present itself as an expression – row[“SellerId”].

5. Flavor the report with a more tidy format, as the preview below shows:

esProc_report_three_stages_3

The second stage: Joining data sets

In the first stage, the problem of displaying values referenced by the foreign key has been solved. But it isn’t the handling of heterogeneous data sources in its real sense for report building. It is in this second stage that we begin to face the computational challenge.

Problem definition: Join different types of data sets outside the database to generate a single data source, and then present data from it in the report.

Solution: Use the built-in functions of reporting tools, such as BIRT’s Joint Data Set option or Jasper’s Virtual Data Source.

Advantages: Only a few steps are needed and no script is required. The solution has some degree of universality, applying to common cases of report development with heterogeneous data sources. It supports any other type of data source, in addition to the databases. But if the data source is the databases, a pre-join and a simple post-join are allowed.

Disadvantages: Only a few reporting tools support the solution. The built-in functions are tightly coupled with the reporting tools and thus almost resist portability. To join N data sets, we need to perform the operation for N-1 times and generate N-2 extra data sets. Despite the universality to some extent, it has too many restrictions to have the free computing power over heterogeneous data source handling.

Example:

Both Cities.txt and States.txt are tab-separated text files, related through StateId. The requirement is to build a group report to present the big cities of each state, with headers being Name and ShortName in States.txt and detailed data being Name and Population in Cities.txt. Below is a selection of the source data:

esProc_report_three_stages_4

We’ll use BIRT’s Joint Data Set to create the report in the following steps:

  1. Create two data sets.

dsCities: The data comes from Cities.txt and uses the TSV format.

dsStates: The data comes from States.txt and uses the TSV format.

  1. Create the joint dataset, by performing a left join on the two data sets.

esProc_report_three_stages_5

3. Design a group report according to the joint dataset, as shown below. No extra configuration is needed.

esProc_report_three_stages_6

4. A preview:

esProc_report_three_stages_7

The third stage: Free handling of heterogeneous data sources

A real touching of heterogeneous data source handling though, the second stage lacks many features to do a better job. If the data source itself doesn’t have computing ability, it’s hard to do a pre-join, such as files or JSON data stream. And the post-join can only be used in simple querying and static querying. It’s almost impossible to implement queries based on the flexible composite field and the variable, let alone the multi-step computations, such as data grouping, aggregation and re-filtering.

In addition, the operation of joining data sets has a tight coupling with the reporting tools, making it difficult to perform multilevel associations, and impossible to realize dynamic data sources and multi-data-source subreports.

While in the third stage, we can completely throw off all restraints to achieve free manipulation of heterogeneous data sources.

Solution 1: Combine heterogeneous data sources into a single data source by making use of the database.

Advantages: High performance and high data consistency.

Disadvantages: Substantial amount of work is needed in development, and the hardware and software costs are high.

Solution 2: Consolidate heterogeneous data sources using the user-defined data source or high-level languages like JAVA.

Advantages: It boasts the highest degree of computing freedom.

Disadvantages: Low performance; huge developing workload due to the lack of class libraries for structured-data handling; and extremely tight coupling with data sources and reporting tools.

Solution 3: Professional computing middlewares, such as esProc.

Advantages: This solution is essentially the variant of the previous solution. Take esProc as an example. It works as the computing middleware with loose coupling, supports any type of data source, provides rich class libraries for handling structured data to significantly reduce the workload in program development, requires no extra hardware cost, produces simple and high-performance parallel code, as well as enables convenient integration with reporting tools by offering JDBC interface.

Disadvantages: Despite the light developing workload, the solution is unable to achieve zero coding. And it pales in maintaining data consistency in comparison to the data warehouse.

Example:

Sales is a MySQL table holding each day’s orders of some salespeople. The SellerId field contains these salespeople’s numbers. Emp.txt is a text file, recording information of the salespeople. EId field contains their numbers. The requirement is to present the OrderID, OrderDate, Amount, Seller Name and DeptName with the condition that the ordering dates are within the past N days (Suppose N=30) or the orders belong to certain important departments (such as Marketing and Finance).

The Sales table:

esProc_report_three_stages_8

The Emp.txt file:

esProc_report_three_stages_9

All the three solutions in the above apply to this case. Here we choose to use esProc to build the report in Jasper. Steps are as follows:

  1. Write esProc code:

esProc_report_three_stages_10

Explanation: Query records of the Sales table from myDB1 and import records from Emp.txt. Create an association between them through left join. Filter the related data set according to the condition that the ordering dates are within the last N days (corresponding parameter is days) (expression: OrderDate>=after(date(now()),days*-1)), or that the orders belong to certain important departments (corresponding parameter is depts) (expression: depts.array().pos(SellerId.Dept)). The operator || represents the logical relationship of “OR”. Finally, retrieve all the desired fields.

The switch function replaces the foreign key values with the corresponding referenced records. Here A1’s field is replaced with A2’s records, as shown below:

esProc_report_three_stages_11

A5 gets the final result as follows:

esProc_report_three_stages_12

2. Prepare data source and data sets

esProc provides JDBC interface to the external applications. Both esProc and the common databases have the same way of creating data sources. For more details, see How to Use esProc to Assist Reporting Tools. Note: With the JDBC driver, esProc script can be accessed by any Java-based reporting tool, such as Jasper, BIRT, crystal, Style Report and FreeReportBuilder.

After the data source is prepared, save the esProc script as afterjoin1.dfx, which can be called from JasperReport SQL Designer using the code afterJoin1 $P{pdays},$P{pdepts}. The parameters pdays and pdeps correspond to the two parameters in the esProc script.

3. Design a simple list, without extra configuration:

esProc_report_three_stages_13

4. A preview of the report:

esProc_report_three_stages_14

Applications of the third-stage solutions

Below are several typical types of reports with heterogeneous data sources. To check the universality of the third-stage solutions, we build them using these solutions.

Case 1: Dynamic data sources

The data sources myDB1 and oraDB point to different databases, which each hold a same-structure sOrder table. The desired report will be able to connect to the data sources dynamically depending on the parameter to query sOrder and display orders whose amounts are greater than 1000.

Below is a selection of sOrder in myDB1:

esProc_report_three_stages_15

Below is a selection of sOrder in oraDB:

esProc_report_three_stages_16

Both foreign key lookup and the method of joining data sets are helpless in creating such a report, so we can only use the solutions in the third stage. Below is the esProc code:

esProc_report_three_stages_17

pSource and pAmount are report parameters. The former represents the data source name, in which ${…} means parsing a string or a string variable into an expression. The latter represents the ordering amount.

When pSource value is “myDB1”, A1’s result is as follows:

esProc_report_three_stages_18

When pSource value is “oraDB”, A1’s result is as follows:

esProc_report_three_stages_19

The main report with a subreport using heterogeneous data sources

Design a main report with a subreport to present ordering records of each employee by salary range. The source of the data used by the main report is emp, a MySQL table, and the source of the data used by the subreport is sales, a MSSQL table.

For a report using the same type of data source, if the main report and the subreport get data from different sources, the database URLs should be explicitly passed in or the different data sources of the same type should be combined using Java classes. The first method will compromise security and the second one will produce complicated code. But with the-third-stage solutions, this can be handled conveniently. Below is the esProc script:

empEsProc.dfx (This the script for the main report)

esProc_report_three_stages_20

A1: Query the emp table in the MySQL database according to the salary ranges.

salesEsProc.dfx (This the script for the subreport)

esProc_report_three_stages_21

A1: Find out the matching orders from the sales table in MSSQL database by employee ID. Suppose eid is 1, then A1’s result is as follows:

esProc_report_three_stages_22

We can see that the two data sources have been combined into one data source, and the main report and the subreport just need to call different esProc scripts to display their data.

There’s another type of report called multi-data-source subreports, which means a main report has multiple subreports (or table controls) that use different data sources. Only a third-stage solution is capable of building such a report.

Case 2: Multilevel associations

The Orders table is the fact table stored in the database, the Customer, Product, Region and Supplier tables are dimension tables coming from four files. Their relationships are shown as follows:

esProc_report_three_stages_23

The requirement is to find out the orders whose customers and suppliers belong to the same region, calculate the number of the orders and their amount in each city, and present the result in a report.

The method of joining data sets can be used to handle the report building in this case, but it will generate a lot of intermediate result sets. By comparison, the third-stage solutions are much simpler. Below is the esProc code:

esProc_report_three_stages_24

Explanation: Import the text files and establish relationships between the fact table and the dimension tables. Then query the desired orders according to these relationships and perform group and aggregate operations.

Case 4: Complex data compuation with heterogeneous data sources

According to the database table Sales and the emp.txt file, find out the top three days when the sales amount of each salesperson increases the most after a given date, and display the salespeople’s names, the three dates, the sales amounts and the growth rates in a report.

Limited by the computing power, the second-stage solution can’t perform the post-join multi-step computation this case will present. But a third-stage solution can handle it easily. Below is the esProc code:

esProc_report_three_stages_25

Explanation: Create data sets by retrieving data from the two data sources and create an association between them. Group the related result set by SellerId. Then regroup each group of data by OrderDate and SellerId and aggregate the order amount for every group, that is, the sales amount per day per salesperson. A6 calculates the daily growth rate of the sales amount for each salesperson, using the formula “Sales amount of the current day – Sales amount of the previous day)/ Sales amount of the previous day”. esProc uses subtotal[-1] to represent the sales amount of the previous day, expressing relative positions easily. In the next step, remove the first record from each group (because it doesn’t have the growth rate), find out the records of the top three dates when the sales amount of each salesperson increases the fastest, and concatenate the desired records from each group. In the last step, export all the desired fields as follows:

esProc_report_three_stages_26

 

 

Posted in Report | Leave a comment

esProc Assists BIRT in Handling Irregular Month Grouping

Problem source: http://developer.actuate.com/community/forum/index.php?/topic/36323-months-and-quarters-group-issue/

Irregular month grouping: If the start date is 2014-01-10, group dates from this date to 2014-02-09 together and dates from 2014-02-10 to 2014-03-9 together. If the start date is 2014-01-31, put dates from this date to 2014-02-27 into a group and take dates from 2014-02-28 to 2014-03-30 as a group.

esPro code:

esProc_report_birt_irregular_1

A1: Query database according to the start date and end date. Both startDate and endDate are external parameters.

A2: Count the number of months between the start date and the end date. For example, there are 6 month between 2014-01-31 and 2014-07-31.

B2: Find the day when each of the irregular months begins according to the initial start date and the intervals. Put the start date before the expression for this computation. The sign “|” represents concatenation and “~” represents the current member in the set, i.e. the numbers from 1 to 6. after function is used to get the irregular months.

esProc_report_birt_irregular_2

A3: Group A1’s data by B2’s intervals and calculate sales amount for each of these irregular months; B2 is appended as the last column. pseg function returns the sequence number of the interval in which the data is held. “~” in ~.sum(Amount) represents the current group. “#” represents the sequence number of the current group. Result is as follows:

esProc_report_birt_irregular_3

A4: Retrieve the second and third column from A3 and return them to the reporting tool. esProc provides JDBC interface outward to be identified by a reporting tool as a normal database.

Create a simple list table with BIRT:

esProc_report_birt_irregular_4

A report calls an esProc script in the same way as it calls the stored procedure. Save the above script as BirtUnregulMonth.dfx. You can invoke it with call BirtUnregulMonth(?,?) from BIRT’s stored procedure designer. Below is a preview of the appearance and layout of the result report:

esProc_report_birt_irregular_5

Posted in Report | Tagged , , , | Leave a comment

Calculate Growth Rate in Jasper Crosstabs

Problem source: http://community.jaspersoft.com/questions/847490/how-get-annual-growth-rate-crosstab .

As every column in a crosstab is generated dynamically, you also need to reference them dynamically when performing inter-row calculations. There is some difficulty in handling this dynamic reference using a Jasper script. But the data preparation can be made easier using esProc. Let’s look at an example.

The database table store holds sales amount of multiple products in the year 2014 and 2015. You need to display the sales amount of each product per year using a crosstab and calculate the annual growth rate of every product. Below is a selectin from original data:

esProc_report_jasper_crosstab_1

esProc code:

esProc_report_jasper_crosstab_2

A1: Retrieve records from the store table.

A2: Append annual growth rate of every product to A1. group is used to group data by products; run is used to perform the required calculations by loop; and record is used to append records. ~(i) represents the ith record in the current group. Below is A2’s result:

esProc_report_jasper_crosstab_3

A3: Return A2’s result to the report. Reporting tools will identify esProc equipped with JDBC interface as a normal database.

Then you can create the simplest crosstab with Jasper:

esProc_report_jasper_crosstab_4

Below is a preview of the finished report:

esProc_report_jasper_crosstab_5

A report calls an esProc script in the same way as it calls the stored procedure. Save the above script as AnnualRate.dfx. You can invoke it with call AnnualRate () and input parameters into it from Jasper’s SQL designer.

Posted in Report | Tagged , , | Leave a comment

esProc Joins a Structured Text file with a JSON File

structure.txt is a tab-separated structured text file. json.txt contains non-structured JSON strings. We need to join the two files to create a new file result.txt. The original data is as follows:

structure.txt

esProc_text_structure_json_1

Json.txt

esProc_text_structure_json_2

Result.txt

esProc_text_structure_json_3

esProc will first import json.txt as a structured table sequence and then join it with structure.txt. The code is as follows:

esProc_text_structure_json_4

A1: Import the JSON file into the memory as a table sequence.

A2: Retrieve name column and cluster column from it. #1 represents the first column. Result is as follows:

esProc_text_structure_json_5

A3: Split cluster column into two parts, name them key and value respectively and thus generate a structured data object. Result is as follows:

esProc_text_structure_json_6

A4: Import the structured text file.

A5: Perform join operation. _2 represents A4’s second column. Result is as follows:

esProc_text_structure_json_7

A6,B7: Retrieve desired columns and export them to result.txt.

Posted in Structured Text, Text Processing | Tagged , , | Leave a comment

Find Differences between Text Files with esProc

Problem source:http://bbs.csdn.net/topics/360033579

Compare the following two text files of the same format and write the differences (different records) into a text file.

Content of 1.txt

Proto      Local Address Foreign Address     State

TCP 111.11.1.111:1975   221.181.70.12:5926       ESTABLISHED

Content of 2.txt

Proto      Local Address Foreign Address     State

TCP 111.11.1.111:1975   221.181.70.12:5926       ESTABLISHED

TCP 111.11.1.111:1111   169.254.241.33:2222     ESTABLISHED

 

Content of result.txt with exported data will be:

Proto      Local Address Foreign Address     State

TCP 111.11.1.111:1111   169.254.241.33:2222     ESTABLISHED

esProc script for doing this:

2015-05-25_143240

A1: The path where text files are stored

A2: fns is the script parameter for storing the two text files, 1.txt and 2.txt, under comparison. Import the files and sort each of them by all the fields

A3: The first half line of the code – A2.merge@d(#1,#2,#3,#4) – gets difference between 1.txt and 2.txt; the second half – A2.rvs().merge@d(#1,#2,#3,#4) – gets difference beweeen 2.txt and 1.txt. Results will then be merged.

A4: Export A3’s result to result.txt, which concludes the computation.

Posted in Text Processing | Tagged , , | Leave a comment