Examples of Non-database Sources in Report Building

 

Besides databases, local files, HDFS files ,NoSQL databases and JSON data streams are also frequently used data sources in building reports. These types of data sources are not as good as databases in terms of computational abilities. Since they can’t implement algorithms such as statistical queries, group and aggregation, and sort and merge in the stage of data source preparation, they can only achieve the same effect using complicated report scripts or JAVA language.

esProc has rich class libraries for structured-data handling and supports heterogeneous data sources, making it convenient to build reports with Non-database sources, and thus a suitable tool for report data source preparation. The reporting tool regards an esProc script as the stored procedure, passes parameter to it and gets the returned result after execution through JDBC. For more details, see How to Use esProc to Assist Reporting Tools.

Here follows some typical scenarios about Non-database sources and offers their esProc solutions.

Simple grouping and aggregation

The sales.csv file stores sales orders. Requirement: In the reporting tool, group data by SellerId and calculate the total sales amount per seller. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

The reporting tool can import data in its entirety, perform group and aggregation and then hide the detailed data. But the performance is not equal to that of esProc used specially for arranging data source.

  A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.groups(SellerId;sum(Amount))

Explanation: First import the source file as a structured two-dimensional table; then perform the group and aggregate.

By default The import function imports all fields using tab as the separator; @t means importing the first row as the field names. The groups function can perform group and aggregate on a two-dimensional table.

Here’s the result:

esProc_text_file_group_1

Extended information:

  1. If the source data has been ordered by SellerId, we could use @o to speed up the data handling by only comparing the adjacent records with the code =A1.groups@o(SellerId;sum(Amount)).
  2. esProc allows for grouping data according to multiple fields and the implementation of various aggregate algorithms. To calculate the sales amount and count the orders for each seller per year, for example, we could use this code:groups(SellerId,year(OrderDate);sum(Amount),count(OrderID)).

Here’s the result:

esProc_text_file_group_2

  1. When the source file is too big to be entirely loaded into the memory but the aggregate result would be smaller (this is what usually happens), we can import the file as the cursor and perform the aggregate with the group function, thus the code will be file(“D:\\sales.txt”).cursor@t().groups(SellerId;sum(Amount)).
  2. If the aggregate result set isn’t small enough to be completely put into the memory, we could carry out the aggregation using groupx Unlike the group function, groupx returns a cursor.

Duplicate-related operations

This scenario is about removing the duplicate rows in the Sales.csv file and present the result in the report. The rule is: for two rows with the same OrderID, only keep the first one. Below is a selection of the source file:

OrderID   Client        SellerId     Amount    OrderDate

1       WVF Vip 5       440.0        2009-02-05

1       WVF Vip 5       1040.0      2009-02-04

1       UFS Com          5       1040.0      2009-02-03

2       UFS Com  13     1863.4      2009-07-05

2       UFS Com  13     1863.4      2009-07-05

3       SWFR        2       1813.0      2009-07-08

4       JFS Pep     27     670.8        2009-07-08

5     DSG       15    3730.0    2009-07-09……

esProc code:

  A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.group@1(OrderID)

Explanation: Group the imported data by OrderID and get the first record of each group without aggregation. The group function is able to group data without performing the aggregate; @1 means getting the first row of each group and returning the results as a new two-dimensional table。

Here’s the result:

esProc_text_file_group_3

Extended information:

  1. It’s a more practical practice to retain records according to a certain condition. To keep the record with the earliest date in each group, use the following code: group(OrderID).(~.maxp(-OrderDate)).

Here’s the result:

esProc_text_file_group_4

To find the first of each group of records sorted by OrderDate in ascending order and by Amount in descending order, we can find the records with earliest OrderDate and then find the one from them with the greatest amount. The code is A1.group(OrderID).(~.maxp([-OrderDate,Amount])).

If there are multiple records satisfying the condition and we want them all, use A1.group(OrderID).conj(~.maxp@a([-OrderDate,Amount])) .

To get the top two from the multiple eligible records, use

A1.group(OrderID).conj(~.top(2,OrderDate,-Amount)).

  1. Sometimes it is the duplicate records that we want, then get them with group(OrderID).select(~.len()>1).conj().

Here’s the result:

esProc_text_file_group_5

  1. Sometimes we simply want the distinct values of a certain field. To get the client list, for instance, use id(Client). The result:

esProc_text_file_group_6

The id function performs the distinct operation, so it is the equivalent of A1.group@1(Client).(Client).

  1. Other times we want all the rows that have different field values, instead of filtering the records by ID. In those cases we do it with

A1.group@1(OrderID,Client,SellerId,Amount,OrderDate)

The result:

esProc_text_file_group_7

Without intending further structured-data computation, we can import the file as a set of rows and perform distinct operation using file(“D:sales.csv”).import@tsi().id(). This is what we get:

esProc_text_file_group_8

The grouped report from a multi-level JSON file

Cells.json is a multi-level nested JSON file, which you need to present in a grouped report. The grouping fields are name, type and image.”xlink:href”. There is another a field with 3 subdocuments: custom. Identifier, custom. Classifier and custom. Output, which are of the same structure but contain different number of documents each.

Below is the source data:

{

“cells”: [

{

“name”: “b”,

“type”: “basic.Sensor”,

“custom”: {

“identifier”: [

{

“name”: “Name1”,

“URI”: “Value1”

},

{

“name”: “Name4”,

“URI”: “Value4”

}

],

“classifier”: [

{

“name”: “Name2”,

“URI”: “Value2”

}

],

“output”: [

{

“name”: “Name3”,

“URI”: “Value3”

}

]

},

“image”: {

“width”: 50,

“height”: 50,

“xlink:href”: “”

}

},

……

]

}

 

esProc code:

  A
1 =file(“D:\\cells.JSON”).read().import@j()
2 =A1.cells.conj(custom.(  identifier.new(name:cname,URI:cURI,”identifiler”:cType,A1.cells.name:name,A1.cells.type:type,A1.cells.image.’xlink:href’:image)|

classifier.new(name:cname,URI:cURI,”classifier”:cType,A1.cells.name:name,A1.cells.type:type,A1.cells.image.’xlink:href’:image)|

output.new(name:cname,URI:cURI,”output”:cType,A1.cells.name:name,A1.cells.type:type,A1.cells.image.’xlink:href’:image)))

esProc merges the three subdocuments into a single two-dimensional table, gives them a new field name ctype to be identified and appends the grouping fields to the tabled. Thus a typical “table with subtables” is created. Below is the merging result of A2:

esProc_text_file_group_9

Now it’s easy to build a grouped report based on this esProc result.

Joins between MongoDB collections

Both sales and emp are MongoDB collections. sales has SellerId field as its logical foreign key that points to emp’s EId field. You need to query orders in sales by the specified time period and associate them with emp through a left join, and then present the result in a report. Below are selections of the source data:

Collection sales

OrderID Client SellerId Amount OrderDate
26 TAS 1 2142.4 2009-08-05
27 HP 13 538.6 2010-08-06
28 DY 11 307.2 2010-08-07
29 WVF 19 420 2010-08-08
30 UFS 18 1200.8 2009-08-09
31 SWFR 15 1488.8 2010-08-12
32 JFS 3 468 2009-08-13

Collection emp

EId State Dept Name Gender Salary Birthday
2 New York Finance Ashley F 11001 1980-07-19
3 New Mexico Sales Rachel F 9000 1970-12-17
4 Texas HR Emily F 7000 1985-03-07
5 Texas R&D Ashley F 16000 1975-05-13
6 California Sales Matthew M 11000 1984-07-07
7 Illinois Sales Alexis F 9000 1972-08-16

esProc code:

  A
1 =MongoDB(“mongo://localhost:27017/test?user=root&password=sa”)
2 =A1.find(“sales”,”{‘$and’:[{‘OrderDate’:{‘$gte’:'”+string(begin)+”‘}},{‘OrderDate’:{‘$lte’:'”+string(end)+”‘}}]}”,”{_id:0}”).fetch()
3 =A1.find(“emp”,,”{_id:0}”).fetch()
4 =A1.close()
5 =join@1(A2:sales,SellerId;A3:emp,EId)
6 =A5.new(sales.OrderID:OrderID,sales.Client:Client,sales.Amount:Amount,sales.OrderDate:OrderDate,emp.Name:Name,emp.Dept:Dept,emp.Gender:Gender)

Access MongoDB to perform queries on collection sales and collection emp respectively, associate them through a left join, and retrieve the desired fields. Here’s the result:

esProc_text_file_group_10

Query on the HDFS file

The text file employee.gz stored in HDFS contains employee data. The requirement is to perform query by birthday and gender and then present the result. The query condition is, for example, the female employees who were born after January 1, 1981 inclusive. Below is a selection of the source data:

EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY

1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000

2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000

3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000

4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000

5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000

6       Matthew Johnson   M     California 1984-07-07       2005-07-07       Sales         11000

7       Alexis        Smith        F       Illinois       1972-08-16       2002-08-16       Sales         9000

8       Megan     Wilson      F       California 1979-04-19       1984-04-19       Marketing        11000

9       Victoria    Davis        F       Texas        1983-12-07       2009-12-07       HR    3000

10     Ryan         Johnson   M     Pennsylvania    1976-03-12       2006-03-12       R&D          13000

11     Jacob        Moore      M     Texas        1974-12-16       2004-12-16       Sales         12000

12     Jessica     Davis        F       New York 1980-09-11       2008-09-11       Sales         7000

13     Daniel       Davis        M     Florida      1982-05-14       2010-05-14       Finance    10000

esProc code:

  A  
1 =hdfsfile(“hdfs://192.168.1.210:9000/user/employee.gz”,”UTF-8″).cursor@t()  
2 = A1.select(BIRTHDAY>=argBirthDay && GENDER==argGender)

Use cursor to open the HDFS file object, making the first row the headers and using tab as the default field separator. The character set is UTF-8. The compression type will be determined by the file extension. Here it is the gzip format.

Then use the select function to perform the query, in which argBirthDay and argGender are report parameters.

Concatenating strings by groups

The sales.csv file stores sales orders, based on which we want a report of client list having three fields – year, month and comma-separated client list. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

esProc code:

  A
1 =file(“D:\\sales.cvs”).import@t()
2 =A1.group(year(OrderDate),month(OrderDate);~.(Client).string())

Explanation: Group the imported data by the two parts of OrderDate, get the set of clients in each group and convert them to comma-separated strings. Here’s the result:

esProc_text_file_group_11

Intra-group alignment

The empContact.csv file has six columns, the first four of which contain employee information where duplicate values exist; the fifth column holds the ways of contact and the last column is the specific contact information. Each employee has at least one way for contact. Now we want to merge the records to make a new table where the fifth, sixth and seventh columns are work phone, cell phone and work email for reporting. Below is source file:

21458952,John,technology,support staff,work phone,555-555-5555

21458952,John,technology,support staff,work email,johndoe@whatever.net

21458952,John,technology,support staff,cell phone,233-1132-9762

99946133,Jane,technology,administration,work phone,444-444-4444

99946133,Jane,technology,administration,work email,janepaul@whatever.net

99946133,Jane,technology,administration,cell phone,121-3433-9788

99946133,Jane,technology,administration,home phone,431-443-8434

esProc code:

  A
1 =file(“D:\\empContact.csv”).import(;”,”)
2 =A1.group(#1,#2,#3,#4;~.align([“work phone”,”cell phone”,”work email”],#5):g)
3 =A2.new(#1,#2,#3,#4,g(1).#6,g(2).#6,g(3).#6)

Explanation: Group the imported records by the first four columns and align records in each group according to the order of work phone\cell phone\work email (the duplicate values will be automatically filtered away); then build a new two-dimensional table by sequentially retrieving the records. Here’s the result:

 esProc_text_file_group_12

Grouping by sequence numbers

In the following file, every three lines correspond to one record. For example, the first record is 26\TAS\2009-08-05. We want to present it in the report as a multi-field two-dimensional table. Here’s the source data:

26

TAS

2009-08-05

33

DSGC

2009-08-14

84

GC

2009-10-16

esProc code:

  A
1 =file(“D:\\data.txt”).import@si()
2 =A1.group((#-1)\3)
3 =A2.new(~(1):OrderId,~(2):Client,~(3):OrderDate)

Explanation: Import the file as a sequence, during which @s means that fields won’t be extracted; group the sequence every three lines, during which “#” represents the sequence number of each line and “\” represents integer division; finally create a table sequence based on the groups, during which ~(1) is the first member of the current group.

The result:

esProc_text_file_group_13

Extended information: If the file has a complicated format. For example, there are three fields in the second line of each group:

26

TAS   1       2142.4

2009-08-05

33

DSGC        1       613.2

2009-08-14

84

GC    1       88.5

2009-10-16

In this case, we just need to modify A3 into this : =A2.new(~(1):OrderID, (line=~(2).array(“\t”))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate )

The result:

esProc_text_file_group_14

Intra-group order-related calculation

postTurn.txt records the schedule of rotating shifts for a post. The file is grouped by dates and some persons work consecutive shifts. We want to find out how many times the two neighboring names work successive shifts. The fields would be first, next and times in the report. Below is the source file:

#2015-06-01

Ashley

Matthew

#2015-06-02

Ashley

Matthew

Matthew

#2015-06-03

Ashley

Ashley

Matthew

Ryan

#2015-06-04

Ashley

Ryan

Ryan

Matthew

#2015-06-05

Jessica

#2015-06-06

Ashley

Jessica

Matthew

Ashley

Jessica

Jessica

Ryan

esProc code:

  A
1 =file(“D:\\postTurn.txt.txt”).import@i()
2 =A1.group@i(pos(~,”#”))
3 =A2.conj(~.([~[-1],~]).to(3,))
4 =A3.group(~).new(~(1)(1):first,~(1)(2):next,count(~):times)

Explanation: Group the imported data according to the condition that whether a row contains “#” or not, and each group is a day’s work shift schedule; for each group, put every two neighboring names into a subset and concatenate all the subsets; group each concatenated set every two members, get the first name and the next name from each group and count how many times they appear together. Here’s the result:

esProc_text_file_group_15

Inserting data between groups

In the sales.txt file, the value of SellerId changes every multiple rows. We want to export the OrderId field in its original order and add the string “Begin” every time a group of records with new value begins and the string “End” after a group ends. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

esProc code:

  A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.group@o(SellerId).conj(“Begin”|~.(OrderID)|”End”)

Explanation: Write the code directly according to what it asks. @o enables grouping by comparing the adjacent records. To retrieve a certain field from a group of records, use ~.(OrderID). The operator “|” is used to concatenate sets or members of a set. The conj function handles subsets respectively and then concatenates the results.

Here’s the result:

esProc_text_file_group_16

Extended information:

  1. Stop using @o when data is unordered, but the performance will decrease a little.
  2. To add an aggregate value, the number of records in each group, for example, write A2 as group@o(SellerId).conj((“Begin: “+string(~.count()))|~.(OrderID)|”End”)

Here’s the result:

esProc_text_file_group_17

  1. To add strings to both ends of a group of records, use the following code:
  A B
1 =file(“D:\\sales.csv”).import@t() =create(OrderID,Client,SellerId,Amount,OrderDate)
2 for A1.group@o(SellerId) =B2.insert(0,””,”Begin”)
3   =B2.insert@r(0:A13)
4   =B2.insert(0,””,”end”)
5 return B2  

The result:

esProc_text_file_group_18

Sampling by groups

The sales.txt file stores ordering data, and we want to retrieve one record from each seller’s records to present it in the report. Here’s the rule: Find out records whose OrderDates are fewer than 60 days after their previous ones, and retrieve one randomly from these sampled records; if there is an insufficient number of records satisfying the condition, just get the record that has the smallest OrderDate. Besides, it is specified that the second record will always appear in each group. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

144  WZ   6       86.4 2010-12-23

70     DSG 7       288.0        2009-09-30

131  FOL  7       103.2        2009-12-10

65     YZ     8       29600.0   2009-01-06

esProc code:

  A B C
1 =file(“D:\\sales.csv”).import@t()  
2 for A1.group(SellerId) =A2.sort(OrderDate).select(interval(OrderDate[-1],OrderDate)<60)
3   if B2!=[] =B1=B1|B2(rand(B2.len())+1)
4   else =B1=B1|B2(1)

Explanation:

Group the imported records by SellerId and loop through the groups. For each loop, first sort the group by OrderDate and then take the sample as required. If there’re enough records sampled, retrieve one randomly and append it to B1; if the number of records meeting the condition is insufficient, then get the first record of this group. OrderDate[-1] represents the previous record relative to the current one. Below are the results of three sampling:

esProc_text_file_group_19

Handling the big file by groups

The sales.txt file is too big to be held by the memory, thus we need to filter out the records of the clients whose total order amount is greater than 10,000, and present the filtered file in a report. The source data is already ordered by SellerId, as shown below:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

esProc code:

  A B C
1 =file(“D:\\sales.csv”).cursor@t()  
2 for A1;SellerId if A2.sum(Amount)>=10000 =C1=C1|B2
3 return C1    

Explanation: Loop through the cursor by SellerId and import a group of data into the memory at a time; sum up the amounts for the current group and write the records to cell C1 if the sum satisfies the condition. Here’s the result:

esProc_text_file_group_20

If the source data isn’t ordered, we can first sort it using the sortx function, which returns a cursor, too.

Multilevel grouping

The Stock.txt file keeps the information of goods in and out of the warehouse. A same kind of goods may be found in and out several times a day, or there may be nothing in and out during continuous days. The initial inventory of the goods is 0. We use In to represent goods shipped in, and Out for those shipped out. Now we want to present the daily inventory for all kinds of goods in a report. Below is the source file:

date name        quantity   flag

2014-04-01       Item1       15     In

2014-04-01       Item1       4       In

2014-04-02       Item1       3       In

2014-04-02       Item1       10     Out

2014-04-03       Item1       3       In

2014-04-04       Item1       5       Out

2014-04-07       Item1       4       In

2014-04-10       Item1       2       Out

2014-04-01       Item2       20     In

2014-04-02       Item3       30     In

2014-04-03       Item3       14     Out

esProc code:

  A B
1 =file(“D:\\stock.txt”).cursor@t()  
2 =A1.group(name,date;~.select(flag==”In”).sum(quantity):in,~.select(flag==”Out”).sum(quantity):out)
3 =A2.group(name) =periods((t=A2.id(date)).min(),t.max(),1)
4 for A3 =A4.align(B3,date)
5   >c=0
6   =B4.new(A4.name:name,
B3(#):date,
c:Opening,
in,
(b=c+in):Total,
out,
(c=b-out):Close)
7   =@|B5

Explanation: A2 calculates the daily in-and-out quantity for each kind of goods; obtain the complete list of dates according to the earliest and latest dates and store it in B3; group records by items, loop through each group to align the current group of records with B3 and calculate the daily inventory in order. Here’s the result:

esProc_text_file_group_21

 

Advertisements

About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s