Stanford CS345S Autumn 2016 Project Ideas
I'm happy to indulge many different data-related projects in this course, including projects that are directly related to your own research. You'll need to eventually convince me why your proposal is suitable for this course and should be considered research, but I have confidence in your persuasiveness. I'm especially excited about projects that solve real problems, with real data and the potential for real impact.
To get you thinking, I've listed a bunch of project ideas below that could be really interesting for a quarter-long project. Many of these are more speculative than concrete; my suggestion: read them over, see what jumps out, and I am happy to work with you to scope a project that's appropriate for the quarter (and then possibly going forwards!).
Several of these projects refer to MacroBase, a new analytics engine we're developing in our research group. The broader opportunity here is that, if you do something cool in MacroBase (which is open source), a number of companies you've almost certainly heard of may use your research, and you'll have an opportunity to work closely with the FutureData M.S. and Ph.D. students and the recent Ph.D. graduate whose advice they sometimes listen to.
As always, happy to talk! — PB
"ML" is eating the world
- Deep/Slow, Shallow/Fast? Deep networks excel at automatic featurization in many tasks. They can also be expensive to evaluate. Simpler, linear models require more manual featurization to perform accurately. But they're also much cheaper to evaluate. Pick a task (my suggestion: time-series modeling, we have datasets) and evaluate the speed-accuracy trade-off in this design space (bonus: training time and scoring time; bonus bonus: do it for many models). Can you build an optimizer that decides which model to use for a given data stream and over time?
- Own a benchmark: NAB is a benchmark competition for time-series classification. I have reason to believe we can win this competition. Your project: actually win. Your project++: evaluate runtime instead of just accuracy.
- Discretization: Win or Loss? Many recently proposed techniques for generating explanations for important events rely on discretization in order to efficiently generate hypotheses. Is this strictly necessary for scale and accuracy? Is there an optimal way to discretize a space to generate the best explanations? Or should we forget about discretization and instead focus on scaling up continuous techniques such as subspace clustering? If this description is too vague, Peter can explain on the board in five minutes. This project has serious implications for next-gen engines like MacroBase. Opportunities for hacking, theory, and everything in-between.
- Warning: Sharp edges! Phones are amazing. They're basically supercomputers! Take your favorite ML operator (my pick: MacroBase's unsupervised density estimators) and figure out how to implement it on a fleet of phones (we have some), then figure out how to keep it up to date (we have ideas), then figure out how to do it with low power (wow, what a cool project!). We have applications in industrials + wearables + ... you name it!
- Model Party! In general, we'll want to run and evaluate many models at once, then combine their outputs via ensembling methods. There are many interesting questions here, in terms of what scales, what's accurate, etc.
- I can see clearly now... We want to try out a large set of of visualization techniques in MacroBase, including improvements to faceted browsing, feature selection, and result explanation. Want to hack in JS? Or... want to figure out how to push down visualization properties like pixel density into data analysis algorithms? Peter has projects.
Applications are eating the world
- Eyes in the sky: Imagine you have access to a large dataset of commercial satellite imagery that is updated over time. What changes over time are most interesting? Can we detect deforestation events? Urbanization? Drought? What features are indicative and important? Project idea: using a corpus we have collected, try to detect events like the above (think: fancy featurization, supervised models, etc.). Figure out how to do this in a streaming manner. Implement in MacroBase!?!
- Which video, now? Video represents a huge proportion of data volumes (think: bodycams, surveillance, cell phone streamers); how do we quickly search through video feeds to find what matters? Project idea: scrape online video feeds (Peter has suggestions), build a prototype online recommendation engine that continuously monitors video feeds and updates its suggestions for what to watch. Requires lots of hacking and fun with featurization, image processing, deep learning, data serving... This is a super hard problem; building something will likely uncover some very interesting research challenges.
- How many logs make a cabin? Systems like Splunk and Elasticsearch are hugely valuable for searching over large amounts of semi-structured text such as logs. In theory (i.e., in the research literature), there are many techniques for finding unusual patterns in these logs, but nobody actually uses them (/exaggeration). What actually works, and what scales?
- Physics! How do they work? A large number of data-driven domains like self-driving cars, many kinds of equipment maintenance, mobility monitoring incorporate physics-based models. How should we encode these physics-based models into a more generic model serving architecture like MacroBase? In an ensemble, in what regimes should we trust them versus other models?
Everything is fine
- Databases are broken, apps are broken, we all get free Bitcoin: It turns out that many applications have a bunch of bugs, both because databases don't implement transactions correctly, and because application writers dont use transactions correctly. It's possible to exploit these bugs to get things like free Bitcoins and unlimited tee-shirts and access to private data. We've got some new results in this space, and a set of potential projects around automatically generating attacks on web apps (and patching them). Contact Peter.
- Flirting with data science? We have a large number of datasets to explore. If you just want to get some hands-on data science experience, we have opportunities.
- Language separates humans from animals: Many exciting new platforms (e.g., MacroBase for analytic monitoring, Lambda for serverless computation) have extremely low-level APIs for authoring new dataflow pipelines / computations. This project: prototype a new language and/or set of interfaces (apocryphally quoth Dawson Engler: "APIs are just semantically impoverished programming languages") for a platform like MacroBase, Lambda, or Docker.