Expert around Spark architecture on top of Hadoop. Taught Spark developments (with DataSets and RDDs) by using MongoDB, CSV, JSON and XML libraries. Taught agile prototyping with Notebooks such as Jupyter and Zeppelin.
Installation and configuration of a plug-and-play Datalab environment enabling executions on multiple different clusters whatever the target Hadoop distribution types (Hortonworks, Cloudera, MapR or pure Hadoop). Integration of the latest Spark V2.1.0 and Spark history server, testing on Hortonworks V2.5 (in AWS and Local modes). Preparation of open source tools for Data-Engineers, Data-Analysts and Data-Scientists (e.g: Jupyter Notebook and Zeppelin for running on top of Spark). Docker containerization for Hadoop nodes, Spark applications and Notebook servers.
Data Engineering and Big Data Architecture projects driven by Data and Use Cases:
Data transformations with Spark (e.g: XML and log4j logs to column format), Dashboards with Banana and SolrCloud search engine, Machine Learning with Spark ML, Prototyping with Jupyter Notebook and Spark kernel, On-demand deployments for Spark applications, Developments with Python Java and Scala, Trainings.
Quick delivering of high challenge projects oriented logs analysis:
Leading a Data driven approach around the EDRMS (Electronic Document and Records Management System), by implying closely the project's owner in an agile way in order to clarify his needs like getting statistics and detecting outliers.
One of the challenge was to transform a big variety of XML logs coming from the EDRMS via a Java Spark transformer, to make complex aggregations with Spark SQL, then to perform indexing with Solr-Cloud, and to finally for analyzing of aggregated clean data with Banana dashboards.
Machine Learning POC for quality analysis of CRM data:
Using Spark ML and GraphLab-Create for researching hidden abnormalities in the data. Data preparation, and clustering algorithms such as K-Means, GMM and LOF.
Innovating and assisting the Datascientists who develop their business Use Cases with Machine Learning by using the benefits provided by Big Data technologies such as Hadoop with Spark, with DataFrames and Machine Learning algorithms (e.g of UCs: Anti-Fraud detection, churn, appetence, text analysis for classification).
Recommendation system:
Implementation of a Proof Of Concept consisting of a real time recommendation system to recommend insurance products on new clients's sales receipt when they go through the tills of hypermarkets.
Spark Machine Learning Pipeline for learning in batch mode: Supervised learning with the Alternative Least Square of Spark-ML (matrix factorization algorithm, using the vector of users and the rank parameter for dimensions reduction).Unsupervised learning with the K-Means algorithm of Spark-ML in order to find clusters of similarities in term of purchase behavior.
Hadoop HDFS for storing logs coming from the tills in order to process later feature engineering and learning algorithms. Hadoop YARN for Spark and Kafka clusters (for processing and computing in batch mode and in real time mode).
Real time with Kafka and Spark Streaming for predictions (evaluations of sales receipts coming from the tills of supermarkets).
Spark Python programming. Developing a Python-Scala bridge in order to improve the performances of UDF (User Defined Functions). Linux shell programming for packaging and deployment in production environment. Jupyter Notebook and Eclipse-PyDev IDE used as development environments.
PageRank:
Understanding of the PageRank algorithm and developing it with Spark RDDs in order to make a pedagogical presentation of Graph Processing for the Datascientists and to convince them to use Spark GraphX coupled with a Graph database like Neo4J.
Implementation of Proof Of Concept based on Spark RDDs (to present the PageRank algorithm) and Spark GraphX (for PageRank, Connected Components, and Triangle Counting). Tehnos: Spark / Hadoop YARN, Hadoop HDFS, Python, Scala, Jupyter Notebook.
POC with Spark DataFrames (SQL) and Spark Machine Learning:
Implementation of Proof Of Concepts with Spark DataFrames and SQL, and Spark Machine Learning (by using objects like Transformer and Estimator for Pipelines, Evaluator, CrossValidator).
Developments by using ML algorithms like the Linear and Logistic Regressions, Random Forest, Neural Networks and ALS for recommendations.
Presentation of these works to the DataScientists in order to explain them how from their Jupyter environment they can develop both with Python Scikit-Learn and Pandas and with Spark ML and DataFrames.
Implementation of a bench in order to compare performances between Pig Hive and Spark SQL on a five nodes Hadoop V2.6 cluster, based on a Left- Outer Join query with tables containing retail data.
Technos: Shell Linux programming for packaging and deployment of the project on different Hadoop-Spark clusters, Spark DataFrames, Python.
This tool infers dynamicaly the schema from big CSV files containing lot of columns; this tool enables quick automatic injections of external data to feed Hive metastore and Hadoop HDFS.Technos: 95% of Linux Shell scripting, 5% of Python.
Miscelaneous tasks:
Administration of a Cloudera 4 nodes cluster (Hadoop 2.3.0-cdh5.0.2) for using mainly Pig and Hive.
Installation and administration of a 5 nodes Cloudera cluster (Hadoop 2.6.0- cdh5.4.4), Sizing for each node: 4 cores 2.6 Ghz, 96 GB Mem, 1 TB Disk.
Installation of Spark on YARN with Anaconda on each Hadoop node.
Configuration of Jupyter Notebook for Spark on YARN, to allow DataScientists to discover Spark in Hadoop cluster.
In charge of feeding of business data towards the HadoopDataLake (hence Csv2Hive).
Developed MapReduce jobs in Java (e.g: inverted index).
Machine-Learning Challenge (Retail domain):
Multi-categorization for CDiscount company ( challenge on https://www.datascience.net/fr/challenge/20/details ). Developed a program with more than 500 Multinomial-NaiveBayes models, Stemming, Stratified sampling and Mutual Information.
Defining a Test Plan to compare performances between the two search engines Solr and Exalead.
08/2014 - Call of tender:
Survey around a possible hybrid SQL-NoSQL implementation for a multilingual platform based on SQL-Server and Cassandra.
05/2014-06/2014 - Bench for Kafka broker:
As part of the implementation of a strategic large project, defining a performance Test Plan based on Apache Kafka broker used to feed the NoSQL Cassandra database.
Development of a configurable tool (production in real-time of big volumes of data with statistics, by using customized message formats as Text, Json, Xml and others) covering all the test cases for Kafka and Cassandra in the new Cdiscount environments.
Architecture surveys around a web application implemented in Java and JavaScript, dedicated to produce contents in real time about the traffic jams, car crashes, roadwork’s, etc.
Yield Management in the domain of train transportation:
Audits for designing n-tiers architectures of existing applications dedicated to the Yield Management business domain (optimizing the prices for train tickets).
List of technologies involved: AngularJS, JQuery with JQPlot for the GUIs, web-services with JBoss AS server configured in High Availability, Drools Engine Rules (BRMS) to compute automatically the recurrent and simple business rules.
Support and compliance of operational processes and action plans:
Compliance of the project plans and operational processes to successfully deliver in time the statistical business applications
List of technologies: Java-EE, Customer's frameworks, ClearCase, Maven, SonarQube, WebLogic, Oracle, SAS, Unix.
Design and development of a SSO launcher for SAS Enterprise Guide V4.3:
Launcher deployed in all the agencies, providing an automatic authentication for the end-users like the statisticians.
Design and development (in Java) of a reliable integration chain for SAS components running in Cobol and Unix:
Solution similar to a continuous integration system, automatically preparing each SAS component for a specific target environment such as the production.
Quotation for a call for tender successfully won, consisting of the creation of a Web Java EE Application for managing the European historical heritages (method used: Use Case Points).
Specifications around Java ESB technologies in order to transfer the business data in a secure way and reliable way (solution based on Java OSGI and Apache Camel).
French leadership and coordinator of a R&D European project called Usenet (ITEA2 consortium), in order to create a new European standard in the domain of the M2M (Machine to Machine).
11/2006 - Architecture and Development around ETL (Talend) for COMPLETEL-BOUYGUES:
Design and development of a Java ETL application based on Talend, in order to transfer monthly and automatically the invoices which come directly from the clients.
10/2006 - Architecture and Development around ETL (Talend) for CNAMTS:
Design and development of a Java ETL application based on Talend, in order to import-export the data from the fleet management sources, to SIEBEL and GLPI.
09/2006 - Pre-sale & Investigations:
Costings and technical surveys around solutions using GPS localization in the domain of Fleet Management.
Investigations to make recommendations in order to evolve client-server architectures to N-tier JEE architectures.
Investigations around access control systems concerning the time management.
Technical support in order to help the developer teams.
Development around a Java ETL (based on Oracle Sunopsis) for real-time synchronization and bidirectional communication between heterogeneous databases.
Specifications and developments in Java around a web PDM application (Product Data Management). List of technologies: Rational Rapid Developer, WebSphere, Tomcat, Oracle RDBMS, tests with IBM Workload Simulator.
Development in order to install automatically any Oracle 10G Database in "silent" mode. List of technologies: InstallShield, Java, Ant.
Many developments based on Java EJBs, JBoss, WebLogic and WebSphere servers.
Development in Java of a CTI server (Computer Telephony Interface) for Alcatel-Lucent PABXs, above Genesys middle-ware, providing a CTI integration with high availability up to hundreds connected operators using a CRM application (Customer Relationship Management).
Main technologies: Java, Siebel CRM, CTI Genesys, Oracle RDBMS, MySQL RDBMS, SQL-Server RDBMS, MQSeries Broker, SWIFT, LDAP Directory.
Main customers: CNAMTS, MGEN GROUP CIC, GROUPAMA, URSSAF, CNCA.