Title: | Working with Google's 'Public Data Explorer' DSPL Metadata Files |
---|---|
Description: | Provides a collection of functions to set up 'Google Public Data Explorer' <https://www.google.com/publicdata/> data visualization tool with your own data, building automatically the corresponding DataSet Publishing Language file, or DSPL (XML), metadata file jointly with the CSV files. All zip-up and ready to be published in 'Public Data Explorer'. |
Authors: | George Vega Yon [aut, cre] |
Maintainer: | George Vega Yon <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.16.1 |
Built: | 2024-11-03 03:19:30 UTC |
Source: | https://github.com/gvegayon/googlepublicdata |
Checks if a string fulfills the joda-times class specifications supported by DSPL language.
checkTimeFormat(fmt)
checkTimeFormat(fmt)
fmt |
String representing a time format to be verified. |
Public Data Explorer currently supports daily, monthly and yearly distributed data. Joda-time, the corresponding time format on which DSPL times is based, allows declaring time formats using small case "d" (for days), capitalized "M" (for months) and small case "y" for years. Some examples:
Format Specification | Data Example |
"yyyy" | 1988 |
"yyyy-MM" | 1988-03 |
"yyyy-MMM" | 1988-Mar |
"dd-MM-yyyy" | 02-03-1988 |
Logical. TRUE
if the string passes the test.
George G. Vega Yon
Google Public Data Explorer DSPL time definition: https://developers.google.com/public-data/docs/canonical/time?hl=es
Google Public Data Explorer Cookbook for time definitions: https://developers.google.com/public-data/docs/cookbook#time_recipes
Joda Time 2.1 API: http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html
See also dspl
checkTimeFormat("yyyy-MM") # TRUE checkTimeFormat("MMMyyyy") # TRUE checkTimeFormat("mmmyyyy") # FALSE
checkTimeFormat("yyyy-MM") # TRUE checkTimeFormat("MMMyyyy") # TRUE checkTimeFormat("mmmyyyy") # FALSE
This data set is one used in the DSPL Tutorial. Specifically, it contains the basic columns used to define geographical dimensions, in this case, countries.
A data frame containing 5 observations.
DSPL Google Code Page Downloads: https://developers.google.com/public-data/docs/tutorial
This data set is one used in the DSPL Tutorial. Specifically, it contains the population magnitudes at country level since 1960 to 1963.
A data frame containing 13 observations.
DSPL Google Code Page Downloads: https://developers.google.com/public-data/docs/tutorial
Parsing csv, tab or xls(x) files at a specific directory path, dspl generates a complete DSPL file. If an output string is specified, the function generates the complete ZIP (DSPL file plus csv files) ready to be uploaded to Google Public Data Explorer.
dspl(path, output = NA, replace = F, targetNamespace = "", timeFormat = "yyyy", lang = c("es", "en"), name = NA, description = NA, url = NA, providerName = NA, providerURL = NA, sep = ";", dec = ".", encoding = getOption("encoding"), moreinfo = NULL) new_dspl(path, output = NA, replace = F, targetNamespace = "", timeFormat = "yyyy", lang = c("es", "en"), name = NA, description = NA, url = NA, providerName = NA, providerURL = NA, sep = ";", dec = ".", encoding = getOption("encoding"), moreinfo = NULL)
dspl(path, output = NA, replace = F, targetNamespace = "", timeFormat = "yyyy", lang = c("es", "en"), name = NA, description = NA, url = NA, providerName = NA, providerURL = NA, sep = ";", dec = ".", encoding = getOption("encoding"), moreinfo = NULL) new_dspl(path, output = NA, replace = F, targetNamespace = "", timeFormat = "yyyy", lang = c("es", "en"), name = NA, description = NA, url = NA, providerName = NA, providerURL = NA, sep = ";", dec = ".", encoding = getOption("encoding"), moreinfo = NULL)
path |
String. Path to the folder where the tables (csv|tab|xls) are at. |
output |
String, optional. Path to the output ZIP file. |
replace |
Logical. If |
targetNamespace |
String. As DSPL documentation states “Provides a URI that identifies your dataset. This URI is not required to point to an actual resource, but it's a good idea to have the URI resolve to a document describing your content or dataset”. |
timeFormat |
String. The corresponding time format of the collection. Should be specified accordingly to joda-time format. See the Details section for more information. |
lang |
A list of strings of the languages supported by the dataset. Could be only one. |
name |
List of strings. The name of the dataset as defined accordingly
to the |
description |
List of strings. Description of the dataset. It also
supports multiple description as the |
url |
The corresponding URL for the dataset. |
providerName |
List of strings. The data provider name. |
providerURL |
List of strings. The data provider website url. |
sep |
The separation character of the tables in the 'path' folder. Currently supports introducing the following arguments: “,” or “;” (for .csv files), “\t” (for .tab files) and “xls” or “xlsx” (for Microsoft's excel files). |
dec |
String. Decimal point. |
encoding |
The char encoding of the input tables. Currently ignored for Microsoft excel files. |
moreinfo |
A special tab file generated by the function
|
If there isn't any output defined the function returns a list of class
dspl
that among its contents has a xml object (DSPL file); otherwise,
if an output is defined, the results consists on two things, an already ZIP
file containing a all the necessary to be uploaded at
publicdata.google.com (a collection of csv files and the XML DSPL
written file) and a message (character object).
Internally, the parsing process consists on the following steps:
Loading the data,
Generating each column corresponding id,
Identifying the data types,
Building concepts,
Identifying dimensional concepts and distinguishing between categorical, geographical and time dimensions, and
Executing internal checks.
In order to properly load the zip file (DSPL file plus CSV data files), the function executes a series of internal checks upon the data structure. The detailed list:
Slices with the same dimensions: DSPL requires that each slice represents one dimensional cut, this is, there should not be more than one data table with the same dimensions.
Duplicated concepts: As a result of multiple data types, e.g a single
concept (statistic) as integer in one table and float in other, dspl
may get confused, so during the parsing process, if there is a chance, it
collapses duplicated concepts into only one concept and assigns it the
common data type (float).
Correct time format definition: Using checkTimeFormat
ensures that the time format specified is
compatible with DSPL.
If there isn't any output
defined, dspl
returns list
of class
"dspl
".
An object of class "dspl
" is a list containing:
dspl |
A character string containing the DSPL XML document as defined
by the |
concepts.by.table |
A data frame object of concepts stored by table. |
dimtabs |
A data frame containing dimensional tables. |
slices |
A data frame of slices. |
concepts |
A data frame of concepts (all of them). |
dimensions |
A data frame of dimensional concepts. |
statistics |
A matrix of statistics. |
otherwise the function will build a ZIP file as specified in the output containing the CSV and DSPL (XML) files.
George G. Vega Yon
Google Public Data Explorer Tutorial: https://developers.google.com/public-data/docs/tutorial
demo(dspl)
demo(dspl)
Methods to print and summarize dspl
class objects
## S3 method for class 'dspl' print(x, path = NULL, replace = FALSE, quiet = FALSE, ...) ## S3 method for class 'dspl' summary(object, ...)
## S3 method for class 'dspl' print(x, path = NULL, replace = FALSE, quiet = FALSE, ...) ## S3 method for class 'dspl' summary(object, ...)
x |
An object of class |
path |
String. Output path where to save the XML DSPL file. |
replace |
Logical. If |
quiet |
Whether or not to print information on the screen |
... |
arguments passed on to |
object |
An object of class |
list("print.dspl") |
None (invisible |
list("summary.dspl") |
Returns the class attributes and a list
containing as defined by |
George G. Vega Yon
See also dspl
## Not run: # Parsing some xlsx files at "my stats folder" mydspl <- dspl(path="my stats folder/") # Checking the summary of the data bundle summary(mydspl) # Writing the DSPL XML definition into a file outputfile <- tempfile() print(mydspl, path=outputfile) ## End(Not run)
## Not run: # Parsing some xlsx files at "my stats folder" mydspl <- dspl(path="my stats folder/") # Checking the summary of the data bundle summary(mydspl) # Writing the DSPL XML definition into a file outputfile <- tempfile() print(mydspl, path=outputfile) ## End(Not run)
This data set is one used in the DSPL Tutorial. Specifically, it contains the population magnitudes at country and gender level since 1960 to 1961.
A data frame containing 13 observations.
DSPL Google Code Page Downloads: https://developers.google.com/public-data/docs/tutorial
This data set is one used in the DSPL Tutorial. Specifically, it contains the basic columns used to define a categorical dimensions such as gender.
A data frame containing 2 observations.
DSPL Google Code Page Downloads: https://developers.google.com/public-data/docs/tutorial
Parsing csv, tab or xls(x) files at a specific directory path, genMore info generates a dataframe used to complete a DSPL bundle with a more complete concepts definition including description, url, etc..
genMoreInfo(path, encoding = getOption("encoding"), sep = ";", output = NA, action = "merge", dec = ".")
genMoreInfo(path, encoding = getOption("encoding"), sep = ";", output = NA, action = "merge", dec = ".")
path |
String. Path to the folder where the tables are saved. |
encoding |
The encoding of the files to be parsed. |
sep |
The separation character of the tables in the 'path' folder. Currently supports introducing the following arguments: “,” or “;” (for .csv files), “\t” (for .tab files) and “xls” or “xlsx” (for Microsoft's excel files). |
output |
If defined, the place where to save the dataframe as tab file. Otherwise it returns a data frame object. |
action |
Tells the function what to do if there's a copy of the file. Available actions are “merge” and “replace”. |
dec |
String. Decimal point. |
If there isn't any output defined (NA
) the function returns a
dataframe containing concepts as observations. Using this, the user may add
more descriptive info about concepts. In turn it writes a tab file with the
dataframe described above. The user may recycle this file writing “append”
in the action
argument.
If no output
defined, genMoreInfo
returns a dataframe
with the following columns.
id |
XML id of the concept (autogenerated) |
label |
The label of the concept (autogenerated) |
description |
A brief description of the concept |
topic |
The topic of the concept |
url |
A URL for the concept where, for example, to get more info |
totalName |
A total name as specified by DSPL language (works for dimensional concepts) |
pluralName |
A total name as specified by DSPL language (works for dimensional concepts) |
George G. Vega Yon
Google Public Data Explorer: http://publicdata.google.com
# Getting the path where all the datasets are path <- system.file("dspl-tutorial", package="googlePublicData") info <- genMoreInfo(path) # This is a dataframe # Setting the 5th concept as topic "Demographics" info[5, "topic"] <- "Demographics" # Generating the dspl file ans <- dspl(path, moreinfo = info) ans ## Not run: # Parsing some xlsx files at "my stats folder" to gen a "moreinfo" dataframe INFO <- genMoreInfo(path="my stats folder/", sep="xls") # Rows 1 to 10 are about "Poverty" and rows 11 to 20 about "Education" # So we fill the "topic" column with it. INFO$topic[1:10] <- "Poverty" INFO$topic[11:20] <- "Education" # Finally, we build the DSPL ZIP including more info dspl(path="my stats folder/", sep="xls", moreinfo=INFO) ## End(Not run)
# Getting the path where all the datasets are path <- system.file("dspl-tutorial", package="googlePublicData") info <- genMoreInfo(path) # This is a dataframe # Setting the 5th concept as topic "Demographics" info[5, "topic"] <- "Demographics" # Generating the dspl file ans <- dspl(path, moreinfo = info) ans ## Not run: # Parsing some xlsx files at "my stats folder" to gen a "moreinfo" dataframe INFO <- genMoreInfo(path="my stats folder/", sep="xls") # Rows 1 to 10 are about "Poverty" and rows 11 to 20 about "Education" # So we fill the "topic" column with it. INFO$topic[1:10] <- "Poverty" INFO$topic[11:20] <- "Education" # Finally, we build the DSPL ZIP including more info dspl(path="my stats folder/", sep="xls", moreinfo=INFO) ## End(Not run)
googlePublicData
package provides a collection of functions to set up
Google Public Data Explorer data visualization tool with your own data,
building automatically the corresponding DSPL (XML) metadata file jointly
with the CSV files. All zipped up and ready to be published at Public Data
Explorer.
Also includes several data structure verifiers in order to avoid surprises while loading your ZIP file to Public Data Explorer page.
Please visit the project home for more information and examples: http://github.com/gvegayon/googlePublicData.
George G. Vega Yon
googlePublicData project site: http://github.com/gvegayon/googlePublicData
Public Data Explorer site: http://publicdata.google.com/
Public Data Explorer Developers site: https://developers.google.com/public-data/
googleVis package: https://github.com/mages/googleVis#googlevis
## Not run: demo(dspl) ## End(Not run)
## Not run: demo(dspl) ## End(Not run)
This data set is one used in the DSPL Tutorial. Specifically, it contains the population magnitudes and unemployment rate at state level since 1960 to 1963.
A data frame containing 9 observations.
DSPL Google Code Page Downloads: https://developers.google.com/public-data/docs/tutorial
This data set is one used in the DSPL Tutorial. Specifically, it contains the basic columns used to define geographical dimensions, in this case, US States.
A data frame containing 8 observations.
DSPL Google Code Page Downloads: https://developers.google.com/public-data/docs/tutorial