It is a running joke that data scientist is the most difficult profession to explain to your mother. Likewise, companies doing data science, like ours, have a hard time to describe their strengths, their core skills, what they can do —and, very importantly, what they do not— to their potential customers. Many, therefore, recourse to verbal smoke screens, filled with jargon but with very little informative content.

We do not.

This document describes what we at Circiter are good at, what we have done and feel proud of, and what we have done and we are not quite as happy with the expectation that you will find our services useful. And, also, that you do not waste your time if you don’t.

What Circiter is

Circiter is a niche consulting firm based in Spain providing data science services. We are a group of three consultants in the midst of a constellation of third party freelancers with whom we have collaborated in previous projects.

We specialise in turnkey projects in data science, typically involving some degree analytical complexity. In fact, we have an enviable track record of having succeeded in projects where others had previously failed to deliver.

Our projects are typically onion shaped: an analytical core embedded in an tech integration layer ready to be deployed at our customer’s systems. Both parts will be discussed below.

Our approach to technology

As part of our past projects we have designed and deployed databases, created ETL processes, built dashboards, connected different systems, etc. both on premises and cloud. We abhor overengineering and prefer simple, tested, open tools to have the job done: we have seen too many data related projects fail because of a misguided choice of the latest fancy tool just for the sake of it.

We have worked in truly big data environments, but most of our projects are not. Moving and analysing a few dozen million records does not merit using specialised tools intended for terabytes of information: traditional database engines (such as MySQL or PostgreSQL) together with adequately crafted Python/Pandas processes have catered for most of our necessities However, we are fluent in standard and widely used technologies such as Spark, Snowflake, PowerBi, etc. with which we have interacted in the past. Very often, our deliverables are embedded in dockers for the ease of deployment. We need only add that the only tech related hurdles we have faced in the past were related to either having to use or interact with legacy technology and solutions or when we have had to work in Microsoft shops —i.e., companies with a high dependency on Microsoft’s tech stack—.

Our approach to data science

As for the data science part of our business, we need to readily start with the things we do not do: we do not do image, video, or text analysis. There are people who just do that for a living and we cannot and will not pretend that we can do better than them. Having said that, it is also true that some of our projects involved small, easy, particular applications of some of these techniques —e.g., NLP— and we managed to implement them satisfactorily.

We have done, although it is not our favourite cup of tea, standard analyses —such as customer clustering/segmentation or churn prediction—. Although there is a large market for such applications of data science, it is not where we can truly offer a differentiated model.

We can proudly state that we have managed to implement the state of the art in analytics in many of our projects. We know our business and we are good at choosing the right technique and the right approach to the problem at hand. This differentiates us with respect to other competitors who often default on two or three standard tools —e.g. XGBoost or deep neural networks— even if the business problem is calling for a different approach.

As a result, we have used an ample variety of tools, ranging from classical statistics or traditional operations research (e.g., in order to optimise routes) to the latest development of Bayesian analysis or machine learning.

Recently —as of the writing of these lines— a couple different customers called us to inform us that the models we have deployed did work as expected. It was very kind on their part, but also quite shocking: you do not call your mechanic to inform him that the car battery he just replaced works as expected; you do not call your supermarket to congratulate them because the cheese you purchased from them was actually cheese. But the fact that our customers did underlines a sad feature of our sector: there are lots of oversell and lots of underdelivery. We, however, are the first to add clauses to our proposals to cut a project short (and cheap) if after a first, quick analysis we find that it will be impossible to achieve the expected results; we also play —perhaps too much— the devil advocate against ourselves —and often against our customer expectations— so as to guarantee the quality of our deliverables and that they will perform in production as well as in our testing environments. It is often just too tempting to provide over optimistic results on test data, but we refrain from it.

Some of our past projects

Here we describe a selection of our past projects. Except where we have explicit customer consent, we have hidden their identities.

Email campaign optimization

The customer, a main actor in the digital advertising industry in Spain, wanted to optimise the outcome of their email campaigns. The project was divided in two phases: Predicting the response of a given customer to a given campaign in terms of openings and conversions. Using the output of the models so as to optimally allocate emails to customers on a weekly basis so as to maximise profit while complying with a number of business constraints: limiting the number of contacts per customer, etc. In summary, trying to minimise the number of emails sent while guaranteeing a certain level of profit (or, dually, maximise the profit while keeping the number of email contacts at certain reasonable levels).

Data science

We implemented fairly standard machine learning models to estimate the propensities. Accuracy was a concern, but also the need for the model to be retrained automatically from time to time without human intervention. Another difficulty for these models was the fact of class imbalance: most emails are not open or, even less, acted upon.

The allocation optimization was addressed using tools and techniques that are seldom seen (and often overlooked) in data science projects: linear programming optimization solvers. A number of adaptations were required to obtain reasonable execution times for extremely large datasets (with millions of variables and constraints). Actually, we managed to reduce the computing time to find optimal solutions from hours (originally, using a standard approach) to minutes (almost real time) by utilising a number of ad hoc heuristics.

Technology

The customer already had an effective data storage model, so we just created some ETL pipelines in Python (Pandas), built the predictive models in Python (Scikit-learn), created a pipeline for periodic model retraining, and used a state of the art linear programming solver to address the optimal allocation problem.

Churn prediction for a security firm (home/small business alarms)

This is a recent project in which we implemented a classic churn model but with a number of advanced and ad hoc modifications. For instance, we did not just provide a churn probability for a given customer but we also categorised the most likely reason for such a high probability (e.g., system failures or price increases) so as to facilitate the customer success team case management. Also, such reasons were ranked in terms of actionality.

Also, it was agreed that the model would attempt to predict churn far in the future. It is just too easy to detect churn that has already happened but perhaps not been recorded yet (e.g., the contract is still running and will be until expiry, but the customer has already decided not to renovate it). With this penalization, the model would not perform quite as well on the paper, but would be much more useful in terms of customer retention.

Data science

Churn models are quite straightforward in terms of data requirements and tool availability. What made this project different from most is that we used natural language analysis tools to analyse transcripts of conversations with customers in search of relevant clues concerning churn.

Technology

The project was fully implemented in R, including the data extraction, the modelling, and the refitting pipeline.

Mortality monitoring

The customer, a Spanish government agency, wanted to completely refurbish their legacy mortality monitoring tool, the platform that tracks mortality in real time in Spain as well as to monitor mortality excesses due to heat waves, extreme cold spells, flu or, later, COVID. Data science

The project involved the creation and adaptation of a number of time series models so as to:

  • Predict the actual mortality today based on incomplete data (records have some delay).
  • Predict the expected mortality today based on historical information.
  • Measure the relevance of the deviations of actual data from the model so as to raise warnings.

There exist state of the art models for mortality monitoring and we adapted some to the specifics of Spain (where we would not just track global mortality but also at much more granular levels: e.g., females in a particular age range in a given province).

Technology

We created a fresh MySQL database to store mortality related data and Python scripts to automatically populate it from different official sources. Models were developed in R and a number of dashboards were created using Python’s Dash and R’s Flexdashboards.

The ETL and models run daily in a completely automated fashion (as compared to the previous system, which required lots of human intervention).

Electrical transformers operating temperature nowcasting

Electrical transformers at wind farms are subject to much more heat related material fatigue than standard transformers due to their load variability: they have many more cycles of heating/cooling (and, therefore, expansion/contraction) given the wind variability than others. The customer, an international company operating wind farms in different countries, asked us to nowcast transformer operating temperature given a number of climatic and operational variables so as to detect deviations and be able to perform preemptive maintenance actions.

Data science

The core models for the nowcast system were relatively standard: given the current and the relevant past (last 2-3 hours) state of the system, predict the current operational temperature. Most of the effort consisted in selecting a relatively small number of relevant variables defining (or summarising) the status of the system and the length of the operational history affecting the current status.

Having selected these variables, a classic, simple regression model was found to work well within the customer operational requirements. Moreover, the careful analysis of model failures (i.e., periods where the model did not seem to work correctly) revealed that climatic data was incomplete and missed events (p.e., episodes of rain) that affected the transformed temperature.

Technology

We created a fresh MySQL database to store mortality related data and Python scripts to automatically populate it from different official sources. Models were developed in R and a number of dashboards were created using Python’s Dash and R’s Flexdashboards.

The ETL and models run daily in a completely automated fashion (as compared to the previous system, which required lots of human intervention).

Bid optimization in online marketplaces

The customer was a company providing consulting services for sellers in a number of online markets (mostly Amazon). In particular, they were interested in a system to optimise online bids.

Data science

Bid optimization is a relatively new area of application. The state of the art involves the usage of Bayesian models, which optimally incorporate the uncertainty associated to data scarcity (mostly for new products or ads). However, the direct implementation of fully Bayesian approaches to this problem is too demanding computationally. As a consequence, a fast approximation to these Bayesian methods (using INLA) was implemented.

Having a measure of the uncertainty of the estimates allowed us to split the ad lifetime into an exploration period, where money is invested to assess the ad performance, and an exploitation period where an optimal policy is implemented and profits are finally reaped.

Technology

The customer provided us with a very well designed MySQL database and we could just query it (via Python) and model the results (using R and INLA as the analytical backend). The solution was perfectly dockerized and deployed on the customer servers.

Transit analysis in physical stores

A large actor in the retail sector was interested in analysing the behaviour of customers in their physical stores. He had already deployed a network of azimuthal cameras able to detect activity. However, the output of the cameras were just strings of raw numbers associated with pixels. It was therefore required to map this activity to actual store locations, i.e.:

  • Identify the points on the store plan corresponding to each camera pixel.
  • Combine information from different cameras when they projected on the same store area.
  • Create visualisations (e.g., heatmaps) representing store activity over time.

Data science

The initial setup for a new store was a very time consuming process. It was greatly accelerated using a model of the camera optics and image deformation (they were bull’s eye cameras) based on limited inputs (as these inputs still required human intervention).

The models were built using standard nonlinear models which could learn camera orientation (i.e., in which direction it was pointing to) and deformation.

Technology

The solution was used in, at least, two large retailers in Spain, so we implemented the solution on Google Cloud (Compute Engine, Cloud Functions, and BigQuery). Users had access to the hourly statistics, maps, and heatmaps via a highly performance dashboard built on Python Flask and R Shiny.

In summary

Readers of this document may be surprised by the variety and heterogeneity of the projects we have worked on. At some superficial level, they are; however, we find them all quite similar, with just some superficial attributes. We are good at abstracting problems and finding similarities among them: probably, we see your seemingly specific problem as just one more example of a wide category of problems.

But we are also good at adapting the output of these global solutions to the specifics of your own requirements and business needs.

We also have a preference for simple tools, standard, time tested tools. We would very seldom suggest the lastest, shiny, trendy tool. In fact, very recently we were discussing the possibility of presenting some of our work at a data science conference and realised we would not be welcome: at conferences you cannot go and say that in a given project you used a database from the nineties, computing tools from the early 2000’s, and a few statistical ideas from the 60’s (or earlier!). But it works and it works reliably and as expected.