Data visualization is an integral part of data science. The programming language Scala has many characteristics that make it popular for data science use cases among other languages like R and Python. Immutable data structures and functional constructs are some of the features that make it so attractive to data scientists. Popular big data crunching frameworks like Spark or Flink do have their fair share on an ever growing ecosystem of tools and libraries for data analysis and engineering. Scala is particularly well suited to build robust libraries for scalable data analytics.
In this post we are going to introduce Breeze, a library for fast linear algebraic manipulation of data sets, together with tools for visualization and NLP. Starting with basic creation of vectors, we will create an application for plotting stock prices. The stock data is obtained form Yahoo Finance, but can also be downloaded here for SAP, YAHOO, BMW, and IBM.
You can follow the here described steps by cloning the sample project from github, the project uses Scala 2.11.7 and SBT 0.13.8, or by creating your own standalone project with SBT. For Breeze make sure you add it as a dependency in your build file sbt.build
scalaVersion := "2.11.7" libraryDependencies ++= Seq( "org.scalanlp" %% "breeze" % "0.12", "org.scalanlp" %% "breeze-natives" % "0.12", "org.scalanlp" %% "breeze-viz" % "0.12" )
For fast exploration sbt console executed within the projects directory can be used to follow the below steps.
Loading the Stock Price Data
As a preliminary step we are going to required depencies we will need for our data exploration:
scala> import breeze.linalg._ import breeze.linalg._ scala> import breeze.numerics._ import breeze.numerics._
Next we will load the data from the CSV files obtained previously. All files contain a header file describing the schema of it. The files have 7 columns that are: (1) Date, (2) Open, (3) High, (4) Low, (5) Close, (6) Volume, and (7) Adj. Close. For a first plot visualizing the close values would be a good exercise for creating a Vector with Breeze.
Breeze comes with data structures well known and used by data scientists: Vectors and Matrices. Turning our provided CSV files of stock price information into a usable data structure for manipulation and plotting is described in the next steps. We will use a DenseVector to represent the closing prices for each day. A separate vector will be used for the days themselves. Additionally Breeze provides SparseVector optimized for storage and a HashVector optimized for access. Read the overview here about Breeze data structures for more details.
Reading the closing prices into a vector in Scala looks like the below code. All is done in one line:
scala> import scala.io._ import scala.io._ scala> val sap_stock_close = DenseVector( Source.fromFile("sap_stocks.csv") .getLines.drop(1).map(_.split(",")(4).toDouble).toSeq :_ * ) sap_stock_close: breeze.linalg.DenseVector[Double] = DenseVector(75.31, 74.72, 74.47, 76.21, 76.4, 75.69, 76.22, 76.54, 76.39, 75.18, 74.82, 74.43, 74.14, 72.32, 71.73, 72.9, 73.36,
The steps explained in detail:
- The Source class of the Scala Standard API provides an iterable representation of source data. Here from our CSV file: Source.fromFile(“sap_stocks.csv”)
- Next we are iterating over each line getLines(), except for the first line drop(1) .
- Iterating over each line in a functional manner gives us the oppertunity to parse each line into a double value of the 5 column
- DenseVector expects a Sequence type and not an Iterator, hence the casting toSeq()
- To turn this into a vector we have to use the splat operator :_ *
On the x-axis of our stock graph the time of the corresponding closing price gets plotted. Please note that for simplicity we are numerating the days as double values prior to creating a DenseVector. Again, all in one line 😉
scala> import scala.io._ import scala.io._ scala> val sap_stock_days = DenseVector.range(0,sap_stock_close.length, 1).map( _.toDouble) sap_stock_days: breeze.linalg.DenseVector[Double] = DenseVector(0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0,
The steps explained in detail:
- We create a DenseVector from a range ranking from 0 to the number of values contained in sap_stock_close in single steps: .range(0, sap_stock_close.length, 1)
- Our numeration needs to be of type Double, so that sap_stock_close and sap_stock_days are of the same type. We again use functional iteration to cast each element to Double: .map( _.toDouble)
Basic Vector Manipulation in Breeze
The Linear Algebra Cheat Sheet of Breeze is a fairly good resource to get an overview of possible manipulations supported for Vectors. One common scenario is slicing and indexing of vectors. Another one would be element wise manipulation, of which we’ve already seen an example with the splat operator.
Getting specific values from a vector can be achieved by using an index. In contrast to Scala collections Breeze data structures do support negative indexes, for example to obtain the last value in a vector:
scala> sap_stock_days(4) res5: Double = 4.0 scala> sap_stock_close(4) res6: Double = 76.4 scala> sap_stock_close(-4) res7: Double = 55.0 scala> sap_stock_close(-1) res8: Double = 53.0 scala> sap_stock_close(1 to 3) res10: breeze.linalg.DenseVector[Double] = DenseVector(74.72, 74.47, 76.21) scala> sap_stock_close(1 until 3) res11: breeze.linalg.DenseVector[Double] = DenseVector(74.72, 74.47)
Plotting Graphs with Scala
Data visualization is an integral part of data science. In this example we can use breeze-viz for drawing out the stock prices. Breeze-viz is the visualization package by Breeze that wraps the very popular Java charting library JFreeChart.
First we need import the package to make it available:
scala> import breeze.plot._ import breeze.plot._
The first step for plotting with breeze-viz is to create a Figure. This will launch an empty Java Swing application that you might see appear in your task bar depending on your OS:
scala> val fig = Figure() res13: breeze.plot.Figure = breeze.plot.Figure@3c609f12
A Figure supports multiple plots and we can add one plot using the subplot function:
val plt = fig.subplot(0)
Having a plot added to our figure we can now draw the actual function on top of it. The changes will become visible once the figure gets refreshed. After the refresh you should be able to see your plot:
scala> plt += plot(sap_stock_days, sap_stock_close) res16: breeze.plot.Plot = breeze.plot.Plot@686128ee scala> fig.refresh()
On the x-axis we display the numerated trading days, while the y-axis plots the closing prices at each day:
Let us further do some simple customization of the plot to demonstrate basic feature. For example we will add a label to the plot. In order to see the label of the graph we also have to enable the legend.
scala> plt += plot(sap_stock_days, sap_stock_close, name="SAP Stock") res1: breeze.plot.Plot = breeze.plot.Plot@7e74f7a1 scala> plt.legend = true plt.legend: Boolean = true scala> fig.refresh
Between trading day 4000 and 4500 on the plot we can notice a unusual jump of the stock price, which likely is an error in the data or a very unusual event in the history of SAP.
scala> sap_stock_close(4221 to 4225) res7: breeze.linalg.DenseVector[Double] = DenseVector(57.375, 227.0, 235.5, 230.0, 230.5)
To further highlight it we can work with so called DomainMarkers and RangerMarkers, which we will set around the values of that particular event. We also are going to add further labels:
scala> import org.jfree.chart.plot.ValueMarker import org.jfree.chart.plot.ValueMarker scala> plt.plot.addDomainMarker(new ValueMarker(4219.0)) scala> plt.plot.addRangeMarker(new ValueMarker(227.0)) scala> plt.xlabel = "trading days" plt.xlabel: String = trading days scala> plt.ylabel = "price" plt.ylabel: String = price scala> fig.refresh
Last we can add a text placed within the plot to further highlight the drastic event discovered.
scala> val txt = s"Gold Rush ->" txt: String = Gold Rush -> scala> import org.jfree.chart.annotations.XYTextAnnotation import org.jfree.chart.annotations.XYTextAnnotation scala> plt.plot.addAnnotation(new XYTextAnnotation(txt, 3890.0, 200.0))
Our final plot: