As many data-scientists, we often deal with the question how to extract all of the contained information from a data-set without exploding feature dimensionality while setting aside a reasonable timeframe for feature-engineering. Let’s emphasize the word „information“.

Working with time-series we most commonly think of seasonality, be it daily, weekly, or otherwise. For many human-behavior related data-sets, we expect people to keep their habits based on regular intervals. Accordingly, one would produce corresponding features (e.g. using lags of time-series) and, probably, employ smoothing to rule out random fluctuations (noise). Though, methods exist to distinguish most optimal lag-terms and, moreover, models that take care of time-series autoregressive forecasts out-of-the-box, doing so might not always bring the maximal information extraction. A certain piece of information can get lost, like for example local context influence (e.g. certain daily patterns) and/or affine transformations on the structure (i.e. rotation or inversion in terms of time, scaling with a constant, translation of particular patterns along the time line) of existing known patterns. Several model architectures exist that are capable of identifying latent features from larger input spaces, given that those possess any kind of topological structure (i.e. sequence, grid, 3-D space). In those models, feature extraction and a learning task like regression or classification are carried out subsequently as part of a single computation graph. Two main families of such algorithms are Recurrent and Convolutional neural networks. Abstracting from details and different combinations of both, the first type iterates through the data, mixing in previous iteration’s outputs, while the second type goes through the whole input with a kernel/filter/mask performing a convolution operation.

For developing a forecast model based on these algorithms, time-series non-stationarity (i.e. when the distribution on the data varies over time) and structural breaks in time-series are yet other hurdles to overcome. Those issues can be approached with additional explanatory meta-data variables, if available. These can be introduced to some computation in combination with our time-series inputs. For instance, external factors like weather and solar radiation may have great influence on what happens to energy consumption and are naturally represented as metric timeseries. Other important behavioral influences can only be explained through categorical variables like weekdays or binary variables representing holidays, bridging days and other special occasions. Such features might be able to explain non-seasonal variation inside the time series, but can also introduce a many-to-one relation, where one feature value is related to a certain subset of time-series. Uniting a whole data sequence with a sensible set of additional variables should provide the best performance, yet it is tricky to decide how to „unite“ them in an optimal way. Joining inputs horizontally brings redundancy through repetition (i.e. each external factor is supposed to be repeated along the whole length of the sequence). Vertical joining or extending the sequence will simply make a data-set inconsistent.

Another approach is to use one of the previously named neural network architectures to project the data into latent space and then concatenate it with extra variables for further computations. To do so, a projector network is built using the initial time-series data. Typically, the most common way is to train a model employing an autoencoder architecture and use an „encoder“ part of a projection model. Projected representations can then be used alongside the original data as independent variables. Though completely valid and working, this method has some unwanted properties: one must discard half of the weights of the full autoencoder after spending the necessary computation power to train them.

To rectify this situation, one could jointly train a projector and a target regressor/classifier neural network. Like in the previous suggestion, the outputs of one or more projector networks will be concatenated with other inputs and fed forward to a target network. The main advantage is that you spare some training time and adapt the weights in accordance with an end-task. However, projector networks might need more time to converge because of this sophisticated architecture. General complexity of the whole network with large amount of layers may yield vanishing or exploding gradients, and there is a lack of control of how exactly a projector network operates.

Yet, such architectures allow to stack as many projectors for as many data sources as the task requires. It brings us to a computation graph, accepting multiple heterogeneous inputs, which further will be referred to as multi-input networks. But… first things first. So, what does a “multi-input network” mean, as a matter of fact? Thinking of this concept one may build up an impression, which can be roughly visualized as follows:

Well, frankly speaking, the reality is not quite far from what is sketched above. With the one remark, that grapes-looking conjunctions of circles and lines may represent any kind of neural networks, with the limitation that the output $latex Z$ of each network is supposed to be of shape $latex Z \in R^{ N_{batch} \times N_{dim}} $, i.e. the very last layer of the network is either an output of a MLP or a flattened output of a different architecture. And instead of “result” one would typically use a Concatenate (e.g. tf.keras.layers.Concatenate layer or its functional interface/alternative, depending on the computation framework you are working with. The TensorFlow (version 2) framework offers a convenient method of whole model definition for further backpropagation and gradient computation. Using tf.keras.layers.Concatenate and tf.keras.Model. Model it is possible to define a whole graph as a single function.

In the example below, a combined network is trained to forecast electricity consumption based on three different input sources. It receives a sequence of weather data (solar, temperature, etc.) as input1, aset of 1-dimensional features characterizing the target (e.g. Day of week, hour, month, embedded or dummy-encoded ) as input2 and, finally, a sub-sequence of electricity consumption time-series, a future element of which the network aims to predict as input3. Note, the order of individual networks, in which each part of data is processed, is completely irrelevant and can be arbitrary, whereas the choices of architecture are solely driven by the nature of each corresponding dataset. Target time-series is the input3, shifted 24 hours ahead.

Consider a snippet implementing an LSTM-based recurrent neural network for processing input1 and producing a vector of latent features of length output_dim_rnn. Given input1 is a multidimensional time-series of shape $latex t \times n$ , where $latex t $ is the amount of time-steps and $latex n$ – amount of variables, this network is able, in an ideal scenario, to extract stationary features of shape $latex 1 \times h$ and reduce dimensionality, so that $latex t*n≫h$.


# rnn model
model1 = Sequential()
model1.add(Input(shape=input_shape1))
return_seq = True
for i, d in enumerate(dim):
    if i == (len(dim) - 1):
        # for the last layer sequences are not required
        return_seq = False
    model1.add(LSTM(d,
                   activation=activation,  
		  recurrent_activation="sigmoid",
                   recurrent_dropout=0,
                   unroll=False,
                   use_bias=True,
                   return_sequences=return_seq))
model1.add(Dense(output_dim_rnn, activation='relu'))

 

The following code implements a simple
fully-connected network to map input2 to another set of latent features
of dimensionality output_dim_rnn.


# mlp model
model2 = Sequential()
model2.add(Input(shape=input_shape2))
for d in dim:
    model2.add(Dense(d, activation=activation))

The third block is a 1D Convolutional neural network, that takes input3 (also sequential) and produces once again a set of latent features of dimensionality output_dim_cnn. The choice of architecture here is based on the data, as mentioned before.


# cnn model
model3 = Sequential()
model3.add(Input(shape=input_shape3))
for (i, f) in enumerate(filters):
    model3.add(Conv1D(f, 
		     (KERNEL_SIZE[i]),
	             padding="causal", strides=strides, use_bias=False)
	      )
    model3.add(Activation("relu"))
    model3.add(BatchNormalization(axis=chanDim))
    model3.add(MaxPooling1D(pool_size=pool_size))
model3.add(Flatten())
model3.add(Dense(output_dim_cnn, activation='relu'))

Finally, all the latent features are
concatenated together and passed to another fully-connected output layer. It
may be beneficial to use more than one layer on top of joint latent features,
this is a matter of validation and hyperparameter search. Using tf.keras.Model
API it is easy to access inputs and outputs of already defined models, as well
as to define a new one based on only it’s input and output.


combinedInput = concatenate([model1.output, model2.output, model3.output])
y = Dense(final_output_dim, activation=target_activation)(combinedInput)

# build final model
model = Model(inputs=[model1.input, model2.input, model3.input], outputs=y)

After the model is defined, it is straight-forward
to train it with the only difference that instead of a single input, an ordered
list of inputs is provided.

  

opt = Adam(lr=LR, decay=WD)
model.compile(loss="mean_absolute_percentage_error", optimizer=opt)
hist = model.fit(
    [x_w_train, x_ex_train, x_seq_train], y_train,
    epochs=25, batch_size=128) 

Designing a multi-input neural network, one
is not limited with the topology choice. Based on the application, it can as
well be useful to use a concatenation operation of processed features from one
network directly with new inputs.

Here is an example of how the final
architecture can look like (note: the network consists of 2 CNNs and an MLP).

Weight initialization can be a game changer
in such sophisticated architectures. Therefore, it is crucial to keep in mind
how you handle it. Checking a selected dataset distribution and sampling initial
weights of a neural network from an appropriate random distribution of weights
is a good way to go. Also, pretraining individual networks will speed up
convergence of a joint network.

It was verified on a practical case-study
with 3 years of data that this type of network architecture offers a mechanism to
leverage heterogenous data structures without extensive feature-engineering. As
already described, the experimental network used 3 inputs: primary electricity
consumption time-series data, weather curves and different exogeneous features,
analyzed respectfully by 2 CNNs and an MLP. The resulting network was able to
outperform all the combinations of 2 Input- or 1 Input- networks (no surprise
though, the more data – the better), as well as a big MLP with plain flattened
inputs (unrolled sequences result in considerably more trainable parameters,
though having same amount of layers). As an error metric – Mean Average
Percentage Error was used. As a baseline and state-of-the-art for
this dataset a gradient boosting model with weeks spent on feature engineering and
feature selection was used.

The plot below depicts one week of
prediction inside the test set using the best performing combined network.

In this post we revisited the problem of
tackling multiple inputs of different structure and form. While not always offering
the highest prediction accuracy, multi-input or joint networks offer a lot of
flexibility and adaptivity for low development cost. It doesn’t matter if one
is analyzing time-series or pictures with accompanying meta-data – the
mechanism is the same. The post underlines how uncomplicated it has become to
build branched network architectures, using the Tensorflow v2 and previously
Keras. For Proof-of-Concept, a 3-input combined network was built and
benchmarked against simpler versions. The result has shown improvements for
adding each sub-network for different input types accordingly. Once again, it
may be not the best fit for the problem but offers acceptable results in a much
smaller development time, due to zero feature-engineering and only basic data
preparation efforts.