If you get stuck, ask questions on Stack Overflow or Reddit. Good luck
Online resources, there are many. Too many sometimes. Grab anything from codecadeamy, codeschool, or Tree House. But remember that's not the only thing you need to know.
If you check my profile, we do remote programming courses where people work together with a real teacher for 6 weeks. We offer scholarships (100% free).
Remember:* Focus on learning programming. What's the scope of a variable? what's immutability? etc.* Practice a lot. Code as much as you can.* Look up for a group to work together.
Reasons to segregate:
1) The single biggest one is that it firewalls the liability of the businesses from each other. Whether this is important or not for you depends on what the businesses are doing: if it's Regular Internet Stuff then your E&O policy is probably good enough in terms of risk mitigation, but if 1+ of your products are in highly regulated spaces (hello HIPAA, finance, etc) then putting them in their own LLC isn't a crazy solution.
2) If you're religious about doing not just the paper ownership but the business accounts separately for each business, that makes eventually selling or otherwise disposing of them much, much easier. Otherwise you're looking at weeks of work and/or very fun professional services bills when you decide to do the division later.
3) If you have co-founders or investors, or the prospect of getting co-founders or investors, separate legal entities are going to be pretty much required. You don't want them to accidentally get ownership of your side projects; they don't want to own your side projects (ownership is a risk; they know the risks they're signing up for and don't want additional sources of uncontrolled unknown risk).
4) A minor factor, but there is non-zero social friction involved in "We've been talking about my trading name of $FOO but remember that the invoice/contract/etc will be from $BAR, LLC."
Reasons to not segregate:
1) It's a lot of extra work.
2) There's a running cost to keeping an LLC open, both the yearly fees and the operational complexity of maintaining separate books, accounts at various providers, and (if you're doing things in a complicated fashion) keeping up appearances with regards to the LLCs being formally separate from each other.
As an ex-consultant with some accidental knowledge of the payments space: I would be doing double-plus firewalling between any payments startup and anything I'd lose sleep about losing, and I would be happily writing a sizable check right about now to a lawyer rather than taking HN's advice about my compliance obligations and potential sources of risk.
I work at a small startup with a roughly 10-person eng team.
When we write docs we focus mainly on architecture and processes. The architecture docs often emerge from a "tech spec" that was written for the development of a feature that required a new service, or substantial changes to a new one. We keep everything in Github, in a README or other markdown files.
We also write API docs for HTTP endpoints. These are written with client developers and their concerns in mind. When doing this for a Rails app we use rspec_API_documentation, which is nice, but it can be annoying to have testing and documentation so tightly couples. We've talked about changing how we do this, but we always have more pressing things to do.
We never write docs for classes or modules within an app/service.
The rest of the systems are documented ad-hoc. Some readme files here and there, a large block of comments inside of confusing files, the occasional style guide, etc.
We also have an onboarding guide for new devs (just a PDF) which walks them through our systems, our tools, etc. Nothing fancy, about 10 pages.
I'm actually pretty proud of the search that I put together for this setup too, it's all done in the browser and the indexes are built at compile time which is then downloaded in full for a search, which sounds silly, but it works surprisingly well .
So we have a doc folder in the repo that is like staring into the maw of Cthulhu and takes up 90% of our build time on the CI server sucking that down mass of garbage for the checkout.
Saner systems have been proposed, but rejected because the powers that be are too averse to change...
Trying to get people onto Sphinx , and use it for some non-sanctioned documentation with good success, but unlikely to make it official.
I really think version control is important: what changed, who changed it, provisional changes through branches, and removing the bottleneck of "I updated the docs, please everyone check before release and send me your comments". It should be patches, and only patches.
Besides the README.md to get started, the app defaults to a private portal with a component playground (for React), internal docs (for answering "how do I"), and tools for completely removing the need for doc pages at all.
I believe that documentation has to be part of the workflow, so component documentation should be visible while working on the component, tools for workflow should have introductions and helpful hints rather than being just forms and buttons, etc.
So far, this is proving fruitful.
(Side note: wikis are where docs go to die.)
- Easy editing (namely markdown files in folders)- Runs on "cheap" hosting/everywhere (built with PHP)- Supports multiple languages (so you can create docs in english, german, etc.)- Can have editable try-on-your-own demos embedded into the documentation- SEO friendly (clean URLs and navigation structure)- Themeable (themes are separated and run with the Twig templating engine)- Works on mobiles out of the box- Supports Plugins/Modules for custom content/behaviour- Formats reference pages for objects/classes/APIs in a nice way- Supports easy embedding of disqus for user feedback- Other stuff I forgot right now
The system powers the knowledge base of my recent app "InSite" for web developers: https://www.insite-feedback.com/en/help
Another instance of docEngine runs for my pet html5 game engine: http://wearekiss.com/gamekitThis one uses the default theme, has most pages in two languages and again incorporates a couple of live demos.
I host a little documentation about the engine itself here, but its not complete right now: http://wearekiss.com/docEngineYou can also find the github link to the project in the footer of every hosted documentation.
Have fun with it - I give it away for free. Critics and comments welcome!Everything I have linked was built by myself.
The important thing about docs is to keep in mind the audience. This is important because it lets you estimate their mental model and omit things that are redundant: for example, if it's internal documentation for a codebase, there is little need to explicitly list out Doxygen or JSDoc style information, because they have access to the damned source code. External audiences may need certain terms clarified, or some things explained more carefully because they can't just read the source.
I'd say that the biggest thing missing in the documentation efforts I've seen fail is the lack of explanation for the overarching vision/cohesive architecture of the work. This is sometimes because there isn't a single vision, or because the person who has the vision gets distracted snarfing on details that are not helpful without a preexisting schema to hang them on when learning. So, always always always have a high-level document that describes the general engineering problem the project solves, the main business constraints on that solution, and a rough sketch of how the problem is solved.
Ideally, the loss of the codebase should be less of a setback than the loss of the doc.
I will say that, as your documentation improves, you will hate your project more and more--this is the nature of the beast as you drag yourself through the broken shards of your teams engineering.
So whenever a new staffer comes along, I get asked to give them wiki access... but I'm the only one here that uses my edits (only ops staffer). Sure, have some wiki access, for all the good it will do you!
I really don't recommend our model :)
Anyway, this is an important point: documentation is not free. It takes time. Even shitty documentation takes time. If you want good documentation, you need to budget time away from other tasks. When I used to work in support, the field repair engineers would budget 30% of their hours for doing paperwork - not documentation specifically, but it clearly shows that 'writing stuff' is not something that springs as a natural/free parallel to other activity.
I believe that literate style of code writing has many benefits in any language.
Basically mix markdown with the codebase and export the documentation from the same file.
For a very well executed and interactive example check out
Which make it easy to create html, pdf, epub, latex formats, etc.
I like to create a user guide, developer guide, and ops guide for each large project.
Beautiful documents, but it takes a decent chunk of time to create. We do extract some docs via XML to generate code, somewhat backwards from how most engineers merge docs and code.
1) Write to all of your target audience. For example if your product is targeted at both technical and non-technical people, then write the documentation in such a way that non-technical folks can understand it. Don't just write for the technical people.
2) If possible, write documentation around several 'how do I do XYZ task'? My experience has been that people tend to turn to documentation when they want to execute a specific task and they tend to search for those phrases
3) As much as is possible, include examples. This tends to remove ambiguities.
* MS doc(x) on a network folder with an excel spreadsheet to keep track of docs (and a lot of ugly macros).
* MS doc(x) in a badly organized subversion repository (side note here, docs comments and revision mode are heavily used in those contexts, which is really annoying)
* rst + sphinx documentation in a repository to generate various outputs (html, odt, pdf...) depending on the client.
In some cases we also use Mako (a python template engin) before sphinx to instantiate the documentation for a specific platform (ex: Windows, RedHat, Debian...), with just a few "if" conditions (sphinx could do it in theory, but it's quite buggy and limited).
I've also put in place a continuous build system (just an ugly shell script) rebuilding the sphinx html version every commit (it's our "badly implemented readthedocs.org", but it's good enough for our needs).
In other cases we use specification tools like PowerAMC or Eclipse/EMF/CDO based solutions, the specification tool in that case works on a model, and can generate various outputs (docx, pdf, rtf, html...).
At home, for my personal projects, I use rst + sphinx + readthedocs, or if the documentation is simple, just a simple README.md at the root of my repository.
As a personal opinion, I like to keep the documentation close to the code, but not too close.
For example, I find it really annoying when the sole documentation is doxygene (or equivalent), it's necessary to document each public methods/attributes individually, but it's not sufficient, you need to have a "bigger picture documentation" on how stuff works together (software and system architecture) in most cases.
On the other side, keeping the documentation away from the code (in a wiki or worst) doesn't work that well either, it's nearly a guaranty that documentation will be out of date soon, if it's not already the case.
I found having a doc directory in the source code repository a nice middle ground.
I found wikis annoying in most cases, rarely up to date, badly organized and difficult to version coherently and properly (ex: having version of the doc matching the software version).
We also have higher level documentation, which is meant to serve as a sort of conceptual overview of the framework, as well as to show what the framework comes with out of the box. This section is written mostly in kramdown, which gets parsed by jekyll before it's turned into HTML.
We generate the bulk of those manuals based on our object model, which is liberally sprinkled with (text only) descriptions. We've created a simple XML-based authoring framework which allows us to create pretty tidy documentation. Including images, tables, code examples etc.
We convert that XML to Apache FOP. At the end of the process, we're left with a bunch of tidy PDF manuals in a variety of languages.
This is the most important step. If you cannot remember it from a blank slate, then no one can. Keep doing that until you understand the code at first glance. Then your code will be easy for anyone to maintain.
Ideally, I'd love to find a mechanism that:
- provides the OO principles in documents; Encapsulations, Abstraction, Polymorphism, Inheritance . - Accessible & maintainable by non-techies. - Allows scripting (I toyed with PlantUML, but it was a bit rigid).
As for code, auto generated docs from jsdoc etc. headers are fine but I never use them honestly. I find unit tests to be the ultimate documentation in terms of code level docs.
I have a CVS repository of PDF and Word docs.
The business side uses docx format, so using markdown and generating docx is not really feasible. I have run into issues of people changing the filename and it creating a new entry in the version control. I have a idea I plan to implement to fix this.
What I would really like is some linux system that would make it easy to pull the text out of docx and make it searchable. I would want something that could run on the command line that does not have a ton of dependencies.
This seems like something that is a really good idea, but is hard to find any projects for it.
I'm trying to gather a community of git supporters to push for git.
However, after three months I still haven't gotten a computer suitable for my job.
Obviously I think just locking your doors (if you don't already) is the smartest idea. And setup a camera outside, at least that way you can start to deter this behavior and possibly catch a glimpse if it does happen. I'd also bet if this is happening multiple times, it is someone you know, or a local teenager etc. That's what happened to me when I found my car getting egged etc repeatedly one summer, I setup a system to catch them which allowed me to have a little discussion with them to make it stop before it escalated.
A little fun too, if you only use the outside camera, put a sign in the car that says "don't look back" or something to that affect (to get them to look towards the camera). Most people when they read that immediately do the behavior you wrote to see why, and they'll look right at the camera. Only the real criminals will just bolt and not do it.
Rather than 64 bit and ECC RAM, you could have high redundancy on the module level. AFAIK Google do not use server grade systems, just lots of them in a failure tolerant configuration.
An important part of the goal is to get an as "closed computational environment" as possible, where risk for BIOS/firmware infection by hacker is minimal.
So just CPU, ECC RAM, ethernet, and microSD (or USB) to boot off.
"What do you mean more?" her husband says. "I can't afford more."
"I mean more better," the producer replies.
The problem isn't so much WHAT Mayer did, but how badly she did it. She wanted to boost Yahoo's video division, for example, so she (a) paid Katie Couric (someone my Mom really liked) more money than God, (b) licensed the streaming rights to SATURDAY NIGHT LIVE (which needs to be put out of its misery) and (c) decided to fund a season of COMMUNITY (which nobody ever watched). Now THERE'S a compelling product offering. WhoTF would watch that?
She redesigned the site to make it mobile-friendly-- which made it almost unreadable for desktop users. Apparently no one told her that browsers send information about what type of device is reading the site. I liked the tech news, but it became so painful that I just gave up.
(She also killed a lot of the personalization features, which is what had attracted people.)
She bought a lot of tiny companies, most of which made products that were never integrated-- and whose employees left as soon as their employment agreements expired.
She bought Tumblr... then had absolutely no idea what to do with it, and ended up not doing anything. Shut down their sales force to integrate it with the Yahoo sales force (which didn't want to sell Tumblr)-- and then realized they'd have to restart the Tumblr sales force to monetize at all.
She inherited an email system that was a hot mess and made it a different type of hot mess-- with enough glitches that people couldn't rely on it for their email.
A lot of the ideas were reasonable, but the execution was so horrific that there was never any value-add.
Why is Yahoo so politicized and who cares? I think Yahoo is no different from any other large corporation. It's just that the players are engaging and using the media a lot more for their stupid rich asshole power struggles that 99% of earth doesn't care about.
They got me on to monitoring EVERYTHING with statsd. Great stuff.
Edit: Too bad you can't use asterisks in HN comments...
Fellow longtime Gentoo and recent Alpine user here. I haven't encountered undue conflict burden from configuration file updates. Some projects do churn whitespace etc, in configuration defaults files, which is unfortunate but not specific to any distro.
If an application supports a conf.d style override, I use that, containing only settings which differ from default.
Is there something inherent about Alpine packaging that handles local config differently?
* Don't use queues. Use logs, such as Apache Kafka for example. It is unlikely to lose any data, and in case of some failure, the log with transactions is still there for some time. Also Kafka guarantees order of messages, which might be important (or not).
* Understand what is the nature of data and what are the queries that are made later. This is crucial for properly modeling the storage system.
* Be careful with the noSQL cool-aid. If mature databases, such as postgreSQL can't handle the load, choose some NoSQL, but be careful. I would suggest HBase, but your mileage may vary.
* NoSQL DBs typically limits queries that you might issue, so the modelling part is very important.
* Don't index data that you don't need to query later.
* If your schema is relational, consider de-normalization steps. Sometimes it is better to replicate some data, than to keep relational schema and make huge joins across tables.
* Don't use MongoDB
I hope it helps!
Most of the big data tools out there will work with data in this format -- BigQuery, Redshift, EMR. EMR can do batch processing against this data directly from s3 -- but may not be suitable for anything other than batch processing. BigQuery and/or Redshift are more targeted towards analytics workloads, but you could use them to saw the data into another system that you use for OLAP -- MySQL or Postgres probably.
BigQuery has a nice interface and it's a better hosted service than Redshift IMO. If you like that product, you can do streaming inserts in parallel to your gcs/s3 uploading process for more real-time access to the data. The web interface is not bad for casual exploration of terabytes of raw data. And the price isn't terrible either.
I've done some consulting in this space -- feel free to reach out if you'd like some free advice.
My advice is to step away from AWS (because of price as you noted). Bare metal servers are the best startup friend for large data in regards to performance and storage. This way you avoid virtualized CPU or distributed file systems that are more of a bottleneck than advantage.
Look for GorillaServers at https://www.gorillaservers.com/
You get 40Tb storage with 8~16 cores per server, along with 30Tb of bandwidth included for roughly 200 USD/month.
This should remove the IOPS limitation and provide enough working space to transform the data. Hope this helps.
* Use a Queue. RabbitMQ is quite good. Instead of writing to files, generate data/tasks on the queue and have them consumed by more than one client. The clients should handle inserting the data to the database. You can control the pipe by the number of clients you have consuming tasks, and/or by rate limiting them. Break those queue consuming clients to small pieces. Its ok to queue item B on the queue while processing item A.
* If you data is more fluid and changing all the time, and/or if it comes in JSON serializable format, consider switching to postgresql ^9.4, and use the JSONB columns to store this data. You can index/query those columns and performance wise its on par (or surpasses) MongoDB.
* Avoid AWS at this stage. like commented by someone here - bare metal is a better friend to you. You'll also know exactly how much you're paying each month. no surprises. I can't recommend Softlayer enough.
* Don't over complicate things. If you can think of a simple solution to something - its preferable than the complicated solution you might have had before.
* If you're going the queue route suggested above, you can pre-process the data while you get it in. If its going to be placed into buckets, do it then, if its normalised - do it then. The tasks on the queue should be atomic and idempotent. You can use something like memcached if you need your clients to communicate between eachother (like checking if a queue item is not already processed by another consumer and thus is locked).
Have you looked at Google at all? Cloud Bigtable runs the whole of Google's Advertising Business and could scale per your requirements.
In my previous job we processed 100s of millions of row updates daily on a table with much contention and ~200G size and used a single PostgreSQL server with (now somewhat obsoleted by modern PCIe SSDs) TMS RamSAN storage, i.e. Fibre-Channel based Flash. We had some performance bottlenecks due to many indexes, triggers etc. but overall, live query performance was very good.
Realistically, this is what I would do (I work on something very similar but not really in adtech space):
1. Load data in text form (assuming it sits in S3) inside hadoop (EMR/Spark)
2. Generate reports you need based on your data and cache them in mysql RDS.
3. Serve the pre-generated reports to your user. You can get creative here and generate bucketed reports where user will fill its more "interactive". This approach will take you a long way and when you have time/money/people, maybe you can try getting fancier and better.
Getting fancy: If you truly want near-real time querying capabilities I would looks at apache kylin or linkedin pinot. But I would stay away from those for now.
Bigtable: As someone pointed out, bigtable is good solution (although I haven't used it) but since you are on AWS ecosystem, I would stick there.
(Hope that might be helpful! A bunch of us hang out on IRC at #cassandra if you're curious.)
If your data are event data, e.g. User activity, clicks, etc, these are non-volatile data which should preserve as-is and you want to enrich them later on for analysis.
You can store these flat files in S3 and use EMR (Hive, Spark) to process them and store it in Redshift. If your files are character delimited files, you can easily create a table definition with Hive/Spark and query it as if it is a RDBMS. You can process your files in EMR using spot instances and it can be as cheap as less than a dollar per hour.
This architecture will get you to massive, massive scale and is pretty resilient to spikes in traffic because of the Kafka buffer. I would avoid Mongo / mysql like the plague in this case, a lot of designs focus on the real time aspect for a lot of data like this, but if you take a hard look at what you really need, its batch map reduce on a massive scale and a dependable schedule with linear growth metrics. With an architecture like this deployed to AWS EMR (or even kinesis / s3 / EMR) you could grow for years. Forget about the trendy systems, and go for the dependable tool chains for big data.
And pay a little to read this book: http://www.amazon.com/Designing-Data-Intensive-Applications-...
And this one: http://www.amazon.com/Big-Data-Principles-practices-scalable...
Nathan Marz brought Apache Storm to the world, and Martin Kleppmann is pretty well known for his work on Kafka.
Both are very good books on building scalable data processing systems.
The bottleneck is usually not I/O, but computing aggregates over data that continuously gets updated. This is quite CPU intensive even for smaller data sizes.
You might want to consider PostgreSQL, with Citus to shard tables and parallelise queries across many PostgreSQL servers. There's another big advertising platform that I helped move from MySQL to PostgreSQL+Citus recently and they're pretty happy with it. They ingest several TB of data per day and a dashboard runs group-by queries, with 99.5% of queries taking under 1 second. The data are also rolled up into daily aggregates inside the database.
There are inherent limitations to any distributed database. That's why there are so many. In Citus, not every SQL query works on distributed tables, but since every server is PostgreSQL 9.5, you do have a lot of possibilities.
Looking at your username, are you based in the Netherlands by any chance? :)
- How CloudFlare uses Citus: https://blog.cloudflare.com/scaling-out-postgresql-for-cloud...
- Overview of Citus: https://citus-conferences.s3.amazonaws.com/pgconf.ru-2016/Ci...
- Documentation: https://www.citusdata.com/documentation/citusdb-documentatio...
If you just want reports (and are okay getting them in the matter of minutes), then you can continue storing them in flat files and using apache HIVE/PIG-equivalent software (or whatever equivalent is hot right now, im out of date on this class of software).
If you want a really good out-of-box solution for storage + data processing, google cloud products might be a really good bet.
DO NOT write to txt files and read them again. This is unnecessary disk IO and you will run into a lot of problems later on. Instead, have an agent which writes into Kafka (like everyone mentioned), preferably using protobuff.
Then have an aggregator which does the data extraction and analysis and puts them in some sort of storage. You can browse this thread to look for and decide what sort of storage is suitable for you.
We looked at just about every open source and commercial platform that we might use as a backend, and decided that none were appropriate, for scale, maturity, or fairness / scheduling. So we ended up building, from scratch, something that looks a bit like Google's Dremel / BigQuery, but runs on our own bare metal infrastructure. And then we put postgres on top of that using Foreign Data Wrappers so we could write standard SQL queries against it.
Some blog posts about the nuts and bolts you might find interesting:
If we were starting today, we might consider Apache Drill, although I haven't looked at the maturity and stability of that project recently.
If you need to generate rich multi-dimension reports I recommend you create ETL pipeline into star-like sharded database (ala OLAP).
Dimensions normalization sometime dramatically reduce data volume, most of dimensions even can fit into RAM.
Actually 200Gb per day not so much in terms of throughput, you can manage it pretty well on PostgreSQL cluster (with help of pg_proxy). I think mySQL will also work OK.
Dedicated hardware will be cheaper then AWS RDS.
* Logs are written to S3 (either ELB does this automatically, or you put them there)
* S3 can put a message into an SQS queue when a log file is added
* A "worker" (written in language of your choice running on EC2 or Lambda) pops the message off the queue, downloads the log, and "reduces" it into grouped counts. In this case a large log file would be "reduced" to lines where each line is [date, referrer domain, action, count] (e.g. [['2016-02-24', 'news.ycombinator.com', 'impression', 500], ['2016-02-24', 'news.ycombinator.com', 'conversion', 20], ...]
* The reduction can either be persisted in a db that can handle further analysis or you reduce further first.
In a shell (modified for speed and ease of use) get, insert, update data in a simple way, without all the fat from other mainstream (Java) solutions.
Tip: focus on how to backup and restore first, the rest will be easy!
AWS also has Kinesis, which is deliberately intended to be a sort of event drain. Under the hood it uses S3 and they publish an API and an agent that handles all the retry / checkpoint logic for you.
Does the business really need exactly this? What is their actual goal? Are they aware of the effort and resources required to get this report?
If it's modelled in SQL, it's probably relational and normalized so you'll be joining together tables. This balloons the complexity of querying the data pretty fast. Denormalizing data simplifies the problem so see if you can get it into a K/V instead of or relational database. Not saying relational isn't a fine solution - even if you keep it in mysql, denormalizing will benefit the complexity of querying it.
Once you determine if you can denormalize, you can look at sharding the data so instead of having the data in one place, you have it in many places and the key of the record determines where to store and retrieve the data. Now you have the ability to scale your data horizontally across instances to divide the problem's complexity by n where n is the number of nodes.
Unfortunately the network is not reliable so you suddenly have to worry about CAP theorem and what happens when nodes become unavailable so you'll start looking at replication and consistency across nodes and need to figure out with your problem domain what you can tolerate. Eg bank accounts have different consistency requirements than social media data where a stale read isn't a big deal.
Aphyr's Call Me Maybe series has reviews of many datastores in pathological conditions so read about your choice there before you go all in (assuming you do want to look at different stores.) Dynamo style DB's like riak are what I think of immediately but read around - this guy is a wizard. https://aphyr.com/tags/Jepsen
AWS has a notorious network so really think about those failure scenarios. Yes it's hard and the network isn't reliable. Dynamo DBs are cool though and fit the big problems you're looking at if you want to load and query it.
If you want to work with the data, the Apache Spark is worth looking at. You mention mapreduce for instance. Spark is quick.
It's sort of hard because there isn't a lot of information about the problem domain so I can only shoot in the dark. If you have strong consistency needs or need to worry more about concurrent state across data that's a different problem than processing one record at a time without a need for consistent view of the data as a whole. The latter you can just process the data via workers.
But think Sharding to divide the problem across nodes, Denormalization eg via Key/Value lookup for simple runtime complexity. But start where you are - look at your database and make sure it's very well tuned for the queries you're making.
Do you even need to load it into a db? You could distribute the load across clusters of workers if you have some source that you can stream the data from. Then you don't have to load and then query the data. Depends heavily on your domain problem. Good luck. I can email you to discuss if you want - I just don't want to post my email here. Data isn't so much where I hand out as much as processing lots of things concurrently in distributed systems is so others may have better ideas who have gone through similar items.
There are some cool papers like the Amazon Dynamo paper and I read the Google Spanner paper the other day (more globally oriented and around locking and consistency items). You can see how some of the big companies are formalizing thinking by reading some of the papers in that space. Then there are implementations you can actually use but you need to understand them a bit first I think.
Short answer: I think you're looking in the wrong direction, this problem isn't solved by a database but a full data processing system like Hadoop, Spark, Flink (my pick), or Google Cloud's dataflow. I don't know what kind of stack you guys are using (imo the solution to this problem is best made leveraging java) but I would say that you could benefit a lot from either using the hadoop ecosystem or using google cloud's ecosystem. Since you say that you are not experienced with that volume of data, I recommend you go with google cloud's ecosystem specifically look at google dataflow which supports autoscaling.
Long answer: To answer your question more directly, you have a bunch of data arriving that needs to be processed and stored every X minutes and needs to be available to be interactively analyzed or processed later in a report. This is a common task and is exactly why the hadoop ecosystem is so big right now.
The 'easy' way to solve this problem is by using google dataflow which is a stream processing abstraction over the google cloud that will let you set your X minute window (or more complex windowing) and automatically scale your compute servers (and pay only for what you use, not what you reserve). For interactive queries they offer google bigquery, a robust SQL based column database that lets you query your data in seconds and only charges you based on the columns you queried (if your data set is 1TB but the columns used in your query are only some short strings they might only charge you for querying 5GB). As a replacement for your mysql problems they also offer managed mysql instances and their own google bigtable which has many other useful features. Did I mention these services are integrated into an interactive ipython notebook style interface called Datalab and fully integrated with your dataflow code?
This is all might get a little expensive though (in terms of your cloud bill), the other solution is to do some harder work involving the hadoop ecosystem. The problem of processing data every X minutes is called windowing in stream processing. Your problems are solved by using Apache Flink, a relatively easy and fast stream processing system that makes it easy to set up clusters as you scale your data processing. Flink will help you with your report generation and make it easy to handle processing this streaming data in a fast, robust, and fault tolerant (that's a lot of buzz words) fashion.
Please take a look at the flink programming guide or the data-artisans training sessions on this topic. Note that the problem of doing SQL queries using flink is not solved (yet) this feature is planned to be released this year. However, flink will solve all your data processing problems in terms of the cross table reports and preprocessing for storage in a relational database or distributed filesystem.
For storing this data and making it available you need to use something fast but just as robust as mysql, the 'correct' solution at this time if you are not using all the columns of your table is using a columnar solution. From googles cloud you have bigquery, from the open source ecosystem you have drill, kudu, parquet, impala and many many more. You can also try using postgres or rethinkdb for a full relational solution or HDFS/QFS + ignite + flink from the hadoop ecosystem.
For the problem of interactively working with your data, try using Apache Zeppelin (free, scala required I think) or Databricks (paid but with lots of features, spark only i think). Or take the results of your query from flink or similar and interactively analyze those using jupyter/ipython(the solution I use).
The short answer is, dust off your old java textbooks. If you don't have a java dev on your team and aren't planning on hiring one, the google dataflow solution is way easier and cheaper in terms of engineering. If you help I do need an internship ;)
If you want to look at all the possible solutions from the hadoop ecosystem look at:https://hadoopecosystemtable.github.io/
For google cloud ecosystem it's all there on their website.
Oops, it seems I left out ingestion, I would use kafka or spring reactor.
P.S The flink mailing list is very friendly, try asking this question there.
Although it isn't fool proof, obviously better than not doing it.
Search for your URL, as well as targeted keywords.
What an incredible thought! Though, given the noise to signal ratio round here, one that I fear you may be alone in thinking.
Even non-meta questions can be rather lazy...I mean a couple of throw away sentences that don't provide much context suggest that it's probably not that important.
For example: https://news.ycombinator.com/item?id=11160872
While I don't think of "Ask HN" as StackOverflow, there's something to the response "What <code> have you tried?" and the idea that a two sentence question doesn't necessarily deserve a long detailed comprehensive answer.
so I guess what I"m saying is that I'm not particularly surprised by its speed, because a lot of the stories submitted to deserve to decline quickly
It's not exactly that but https://news.ycombinator.com/ask exists.
It's an incredibly sparse list though, given GitHub's popularity. Compare this Atlassian's marketplace
which seems to include everybody, both big and small.
The most valuable thing Github has is the enormous portfolio of FOSS projects and all the free eyeballs that come with it. Yes, it's not paid, but because its public offerings are so popular, most developers are familiar with it and that gives it a trusted inlet to basically every software organization in existence. After all, in a non-dysfunctional organization, the guy who decides what products to buy to support their developers should give far more weight to what those developers want, than what is said by a vendor salesman whose job it is to convince them to buy a particular product.
Which means it's a mistake to neglect their FOSS users -- it seems self-evident that the biggest, best, most cost-effective way for Github to sell its paid features is word-of-mouth from its free users. It's very hard for a competitor to replicate their enormous network effect, which is also what makes it so effective. Their underwhelming response to the "dear github" letter suggests that their upper management is blind to this. I think there's a serious possibility that, in the next five to ten years, their position will be as marginalized as, say, Sourceforge is today -- long ago it was the "gold standard" in hosting services, but today it's only used by barely-maintained projects who can't even scrape together the resources needed to change their code hosting provider.
The idea was based on our experience that you spend too much time searching through code as a developer. Coati makes it much easier to see how the different parts of the software play together.
For example, I've seen many Scala programmers struggle with achieving maximal type safety due to getting bogged down in boilerplate. To attack this problem (as I see it anyway, but that's all which matters for now) I'm researching ways of utilizing Scala macros to cut down on such boilerplate while retaining type safety.
This has been huge for my own side work (which means the utility is definitely there), but I'm still uncertain as to the cost, i.e. what effects my approach might have on a "serious" commercial project. I'm just going to have to find out the old fashioned way.
A DIY sensor needs to either be some kind of test strip or a needle. Both are a lot easier to just get through insurance than to mimic.
I'm sure you could perform the chemical reactions yourself, but you'll probably find whatever you make will need to be replaced frequently. Meanwhile, the software, modified Android phones, etc. last a long longer, so that's where is the action.
The challenge for the homebrewer is to build something traceable. When you measure blood sugar today the same sample should yield the same number - not only today but also ten years from now. Yes, Theranos is fighting with traceability, too.
It's not necessarily a bad idea to play around with MOOCs to take a look at other skills while working at your job, but if a university course wasn't enough to make up your mind a MOOC won't be either.
(In general, a thesis or internship is a good way to look deeply into a specific niche. But if you're three months from graduating, I don't think that advice will help you...)
More schooling will teach more of the sort of things that are taught in school. A job will teach the sort of things that are taught on a job.
If the priority is learning more of the things that are taught in school, get a graduate degree. There is nothing to keep a person from spending time on MOOC's while working. A lot of people do.
Work is not like school, most new grads will learn a lot really fast because the learning is by doing alongside other people with more experience in a culture with a great deal of institutional experience in the thing that is being done.
To put it another way, if there's some really interesting MOOC, take it now.
The thing with doing MOOCs on your own is it seems easy to get lazy and end up wasting a lot of time with nothing to show on your resume.
Email them and say "I'd love to work at your company in a year, what can I do to prepare".
They'll tell you. You do want to get a job, be it contracting or something, so you have some professional experience.
But the main thing is you learn what you need to learn, you figure out if you need some github contributions, moocs, read a books, etc.
Your app would have to be great at non-secure notes AND have an option to add secure notes.
Evernote, btw, does support secure (encrypted) notes. They have a lousy UI for them but the option is there.
If you don't think that your app is better in at least some ways than existing note-taking apps, that having secure notes will not make a difference.
* Emacs (desktop / laptop editor)
* Orgzly (Android editor)
* org-mode (the note mechanism itself)
* Unison (for file sync)
* Ubuntu LTS + OpenSSH (on the file server)
Happy to provide more detail if you're interested.
Stord.io is a key/value store. This is often modelled as a hashmap or a dictionary in programming languages.
Under the hood, stord.io is powered by Redis, with a thin python application wrapper, based on Flask.Stord.io doesnt assume anything about your data, make whatever nested schema you want!
Full disclosure: if this wasn't already clear, it's my project. I would LOVE feedback/feature suggestions.
I think it would be safe to assume there are also collision problems in unauthenticated ones as well...
What are the key/value durability requirements? (OK to drop values now and then, or does it need to keep them until the end of time?). Need backups? Do values expire, or do you have to expire them manually? Since you can't enumerate or search, how do you delete things? Allowed sizes of keys and values, between bytes and terabytes? How far should it scale? Shared namespace, or namespace per user? Do you need a latency guarantee? How low? Are you gonna use it for something important and need an SLA on the availability of the service as a whole?
A couple of "nearby" points in the solution space:
Amazon S3 is a KV store where the keys look like filenames and the values look like files. High durability, good scaling, pretty high latency. You could also obviously paper a KV store on top of ElastiCache or DynamoDB, which are going to have different properties.
Going low-level and implementing your own in say, golang would probably be the most fun though :p
Hard to say if we could use a SAAS KV store at work without a lot more technical detail on the solution. I'm having a hard time thinking of an app where you'd want a KV store, but not need a database or NoSQL store which you could use instead.
Unless you can find an editor for the category you're trying to submit too.
Those short URLs are often being used in mass mailings, guestbooks, comment forms etc. and they'll get reported pretty quickly to URL blacklists. That gives you the chance to disable the shady short URls before too many people come across them.
Here is a list of URL blacklist providers that you should check each URL against, both before accepting it into your database and again before you deliver it to the user:
multi.surbl.org, uribl.swinog.ch, dbl.spamhaus.org, url.rbl.jp, uribl.spameatingmonkey.net, iadb.isipp.com, dnsbl.sorbs.net
You can find out more about this stuff on http://www.surbl.org/
Since I've implemented these checks I only receive complaint emails every other month and no contact from the authorities so far (in contrast to the disposable email services I run).
When you get a complaint about a URL, your software should allow you to disable all the URLs of the reported domain. There are times when the spammers have taken over a forum (or any site really) and created a large amount of spammy content.
Hope this helps.
My first thought is to wonder why someone would want to be in a space where an arms race against people devoted to child porn was both necessary and constant. Particularly when that space wasn't providing capital sufficient to retain a lawyer.
My gut tells me that as a matter of will, it's going to be hard to keep this class of activity from reoccurring regardless of how this incident works out and that scaling the service will scale the headaches of monitoring and policing user behavior along with it. Personally, I'd rather deal with users I liked. YMMV.
Take a look at the Center for Missing and Exploited children and perhaps call them. They operate a tip line to report this kind of activity. Sad that we need to think about these things.
In practice we paired when doing difficult stuff. It worked well.
Don't pair at your desk. Use a pairing station.
Pair for a task, short if possible. If not, timebox.
Pair for short periods!
Train yourself to be productive while pairing. Don't relax. A short period of hard work.
Pick a consistent pairing time and make it a habit. Such as after standup, after lunch, whatever. Some times when you would not be productive solo, like after lunch, you can be productive in a pair. Experiment.
Long, unproductive, exhausting pairing periods. People will quit.
That one guy that nobody wants to pair with. Happens, sorry.
Sitting there twiddling your thumbs while your pair partner writes an email. Only pair for coding / debugging / testing! Don't pair at your desk.
As a fast touch typer I preferred to be the one at the keyboard. I sometimes like to sketch things out first to test my initial idea.
I think pair programming can be extremely effective and fun when done in an ambitious work environment with open minded people who can work well with others. I much prefer doing pair programming than not since it's an opportunity to discuss requirements/design/solutions and also a kind of code review before the actual code review.
Coding on your own hides wrong assumptions, incorrect thinking, lack of teamwork/collaboration, misunderstandings and code quality issues which could likely to be found in pair programming sessions before a code review. It can be a great synergy when trying to solve complex problems where two people balance up different aspects of problem solving.
But all day pair programming would drive me crazy.
I just haven't had a great experience with it, though I remain open to trying it again.
Pair programming glorifies the task of programming as overly difficult. Solving problems can be difficult, programming them shouldn't be. Solving a tough problem in a code editor is not a good idea, IMO.
* someone I know and like.. I imagine being paired with a person you disliked would be awful
For instance, the machine learning group (Yoshua Bengio, Ian Goodfellow, et al.) at the University of Montreal (the people behind open source software like Theano , Pylearn 2 , etc.) regularly makes papers available on arXiv, and is currently working on a book about Deep Learning, drafts of which are freely available .
: http://scholar.google.com: http://arxiv.org: https://github.com/Theano/Theano: https://github.com/lisa-lab/pylearn2[4: http://www.deeplearningbook.org